This is a read-only snapshot of the ComputerCraft forums,
taken in April 2020.
Lua: distinguish Raw from Hyper-text-markup-language
Started by EveryOS, 21 April 2016 - 09:07 PMPosted 21 April 2016 - 11:07 PM
I would like to use http API to distinguish Raw from Hyper-text-markup-language.
Edited on 21 April 2016 - 09:07 PM
Posted 21 April 2016 - 11:19 PM
You want to parse HTML? Good Luck!
Posted 22 April 2016 - 12:19 AM
If you just want to tell them apart and not parse it or anything try.
function isHTML(text)
local pattern = "<%s-!doctype%s+html%s->"
text = text:lower()
return text:find(pattern) ~= nil
end
Edited on 21 April 2016 - 10:20 PM
Posted 22 April 2016 - 12:54 AM
OKIf you just want to tell them apart and not parse it or anything try.function isHTML(text) local pattern = "<%s-!doctype%s+html%s->" text = text:lower() return text:find(pattern) ~= nil end
Posted 22 April 2016 - 06:08 AM
If you just want to tell them apart and not parse it or anything try.function isHTML(text) local pattern = "<%s-!doctype%s+html%s->" text = text:lower() return text:find(pattern) ~= nil end
That's a good idea, but many pages don't contain such a line. It would be safer to check the first two layers of actual HTML, I believe.
<html>.+<head>.+</head>.+<body>.+</body>.+</html>
I guess the pattern could look like this. But I'm not too trained with them so please correct me if I did something wrong ^^Posted 22 April 2016 - 12:53 PM
I think we have two problems:
Some raw pages still contain HTML elements, just they don't get compiled
Some HTML pages have a license, for example:
[html]<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">[/html]
Some raw pages still contain HTML elements, just they don't get compiled
Some HTML pages have a license, for example:
[html]<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">[/html]
Posted 22 April 2016 - 02:09 PM
I think we have two problems:
Some raw pages still contain HTML elements, just they don't get compiled
Some HTML pages have a license, for example:
[html]<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">[/html]
For your second point
"<%s-!doctype%s+html.->"
should fix the problem.For your first point I'm not exactly sure what you mean but a raw page should not start with <!doctype html>.
Posted 22 April 2016 - 02:38 PM
Posted 22 April 2016 - 02:55 PM
xDYou want to parse HTML? Good Luck!
Posted 22 April 2016 - 03:53 PM
Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
Edited on 22 April 2016 - 02:04 PM
Posted 22 April 2016 - 05:50 PM
In fact, it would be easy to figure out if we had access to http headers and checking the Content-type field.Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
Posted 22 April 2016 - 09:20 PM
In fact, it would be easy to figure out if we had access to http headers and checking the Content-type field.Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
Yes but sadly we do not. :/