This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
EveryOS's profile picture

Lua: distinguish Raw from Hyper-text-markup-language

Started by EveryOS, 21 April 2016 - 09:07 PM
EveryOS #1
Posted 21 April 2016 - 11:07 PM
I would like to use http API to distinguish Raw from Hyper-text-markup-language.
Edited on 21 April 2016 - 09:07 PM
KingofGamesYami #2
Posted 21 April 2016 - 11:19 PM
You want to parse HTML? Good Luck!
Morganamilo #3
Posted 22 April 2016 - 12:19 AM
If you just want to tell them apart and not parse it or anything try.

function isHTML(text)
	local pattern = "<%s-!doctype%s+html%s->"
	text = text:lower()
  
	return text:find(pattern) ~= nil
end
Edited on 21 April 2016 - 10:20 PM
EveryOS #4
Posted 22 April 2016 - 12:54 AM
If you just want to tell them apart and not parse it or anything try.

function isHTML(text)
	local pattern = "<%s-!doctype%s+html%s->"
	text = text:lower()
  
	return text:find(pattern) ~= nil
end
OK
H4X0RZ #5
Posted 22 April 2016 - 06:08 AM
If you just want to tell them apart and not parse it or anything try.

function isHTML(text)
	local pattern = "<%s-!doctype%s+html%s->"
	text = text:lower()

	return text:find(pattern) ~= nil
end

That's a good idea, but many pages don't contain such a line. It would be safer to check the first two layers of actual HTML, I believe.


<html>.+<head>.+</head>.+<body>.+</body>.+</html>
I guess the pattern could look like this. But I'm not too trained with them so please correct me if I did something wrong ^^
EveryOS #6
Posted 22 April 2016 - 12:53 PM
I think we have two problems:
Some raw pages still contain HTML elements, just they don't get compiled
Some HTML pages have a license, for example:
[html]<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">[/html]
Morganamilo #7
Posted 22 April 2016 - 02:09 PM
I think we have two problems:
Some raw pages still contain HTML elements, just they don't get compiled
Some HTML pages have a license, for example:
[html]<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">[/html]


For your second point

"<%s-!doctype%s+html.->"
should fix the problem.

For your first point I'm not exactly sure what you mean but a raw page should not start with <!doctype html>.
Bomb Bloke #8
Posted 22 April 2016 - 02:38 PM
a raw page should not start with <!doctype html>.

Why not? :P/>
Sewbacca #9
Posted 22 April 2016 - 02:55 PM
You want to parse HTML? Good Luck!
xD
Morganamilo #10
Posted 22 April 2016 - 03:53 PM
a raw page should not start with <!doctype html>.

Why not? :P/>

Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
Edited on 22 April 2016 - 02:04 PM
Anavrins #11
Posted 22 April 2016 - 05:50 PM
Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
In fact, it would be easy to figure out if we had access to http headers and checking the Content-type field.
Morganamilo #12
Posted 22 April 2016 - 09:20 PM
Because in a case like that it's impossible to tell if it's raw or HTML without prior context.
In fact, it would be easy to figure out if we had access to http headers and checking the Content-type field.

Yes but sadly we do not. :/