This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
remiX's profile picture

Getting specific information from a website

Started by remiX, 05 February 2013 - 06:09 AM
remiX #1
Posted 05 February 2013 - 07:09 AM
Hey guys, I need to get information from a website which looks like this:

Spoiler

    <ul id="linkGroupsUL"><li class="linkGroup" id="linkGroup_3026216">
          <div class="linkGroupDiv" id="linkGroupDiv_3026216">


            <ul id="links-3026216" class="categoryLinks"><li class="link" id="link_7016734">
                  <table cellpadding="0" cellspacing="0" class="linkLayout"><tbody><tr><td class="img"><a href="http://pastebin.com/0qGFzhrU" target="_blank">
                          <img src="http://custom.pagepeeker.com/thumbs.php?encode=anB2cDt5JGJja2suenF5dm1jZ2N%2FYWtkenZocCBhcG0xf29gbXxufGg6YW5h%0AID5iXl92fXRB%0A" alt="http://pastebin.com/0qGFzhrU">
                      </a></td><td class="content">
                      <a href="http://pastebin.com/0qGFzhrU" target="_blank">Test</a>
                        <div class="linkURL" id="link_url_7016734">
                          pastebin.com/0qGFzhrU
                        </div>
                        <div class="linkURLFull" id="link_url_full_7016734" style="display:none">paste<wbr></wbr>bin.c<wbr></wbr>om/0q<wbr></wbr>GFzhr<wbr></wbr>U</div>
                      <p class="linkDescription">Test Description</p>
                        <div class="linkTimestamp">
                          Posted February 4, 2013 at 12:07 PM
                        </div>
                    </td></tr></tbody></table>
                </li><li class="link" id="link_7016751">
                  <table cellpadding="0" cellspacing="0" class="linkLayout"><tbody><tr><td class="img"><a href="http://pastebin.com/Xd01uZgc" target="_blank">
                          <img src="http://custom.pagepeeker.com/thumbs.php?encode=anB2cDt5JGJja2suenF5dm1jZ2N%2FYWtkenZocCBhcG0xf29gbXxufGg6YW5h%0AIFZ3KSh5T2F3%0A" alt="http://pastebin.com/Xd01uZgc">
                      </a></td><td class="content">
                      <a href="http://pastebin.com/Xd01uZgc" target="_blank">Sample OS</a>
                        <div class="linkURL" id="link_url_7016751">
                          pastebin.com/Xd01uZgc
                        </div>
                        <div class="linkURLFull" id="link_url_full_7016751" style="display:none">paste<wbr></wbr>bin.c<wbr></wbr>om/Xd<wbr></wbr>01uZg<wbr></wbr>c</div>
                      <p class="linkDescription">My Sample OS</p>
                        <div class="linkTimestamp">
                          Posted February 4, 2013 at 12:20 PM
                        </div>
                    </td></tr></tbody></table>
                </li></ul>
          </div>
        </li></ul>
  </div>

There's quite a lot there and I need to get information from sources like this:


<a href="http://pastebin.com/0qGFzhrU" target="_blank">Test</a>
                        <div class="linkURL" id="link_url_7016734">
                          pastebin.com/0qGFzhrU
                        </div>
                        <div class="linkURLFull" id="link_url_full_7016734" style="display:none">paste<wbr></wbr>bin.c<wbr></wbr>om/0q<wbr></wbr>GFzhr<wbr></wbr>U</div>
                      <p class="linkDescription">Test Description</p>
                        <div class="linkTimestamp">
                          Posted February 4, 2013 at 12:07 PM
                        </div>

I need:
1. Name - <a href="http://pastebin.com/0qGFzhrU" target="_blank">Test</a> - "Test"
2. Code - pastebin.com/0qGFzhrU (I need the last part)
3. Description - <p class="linkDescription">Test Description</p> - "Test Description"
4. Timestamp - Posted February 4, 2013 at 12:07 PM - "February 4, 2013 at 12:07 PM"

How can I achieve this? :X
zekesonxx #2
Posted 05 February 2013 - 07:25 AM
I think the best way would probably be Regular Expressions.
Cranium #3
Posted 05 February 2013 - 07:28 AM
You can use string.gmatch for this.

local nameTable = {}
for name in string.gmatch(<URLFULLTEXT>, '<A href="http://pastebin.com/.->(%w-)</A>' do
	table.insert(nameTable, name)
end
That should match every instance of the string with any pastebin code, and add the name to the table.

EDIT: Of course, <URLFULLTEXT> would be replaced with whatever http.get returns.

EDIT 2: Actually, you can take a look at how I matched strings with my SmartPaste program. It should be from lines 545 - 560 if you're interested.
Edited on 05 February 2013 - 06:31 AM
remiX #4
Posted 05 February 2013 - 07:40 AM
Yeah I've been messing around with string.gmatch but this is my first time using it so I'm kind of clueless!

I'm trying to get all four things into a table of a table:

t = {}

t[1] = {}

t[1].code = "First code it finds"
t[1].name = "First name it finds which has to match the code above"
t[1].desc = "First description it finds which matches the code/name"
t[1].timestamp = "The posted time"

So It will be easily printed and put together etc.

What I have made over the past 30 mins (yes, I know it's bad xD)

Spoiler
t = {}

term.clear() term.setCursorPos(1, 1)

local response = http.get("http://cc-youtube.webs.com/apps/links/")
l = response.readAll()
--[[while l do
	table.insert(t, l)
	l = response.readLine()
end]]
response.close()

i = 1
for code in l:gmatch('<a href="http://pastebin.com/(.-)" target="_blank">') do
	b = false
	for q = 1, #t do
		if t[q].code == code then b = true end
	end
	if not b then
		t[i] = {}
		t[i].code = code
		print(code)
		for name in l:gmatch('<a href="http://pastebin.com/' .. code .. ' target="_blank">(.-)</a>') do
			t[i].name = name
			print(name)
			break
		end
		for desc in l:gmatch('<p class="linkDescription">(.-)</p>') do
			t[i].desc = desc
			print(desc)
			break
		end
		for td in l:gmatch(' Posted (.-) (.-) ') do
			t[i].postDate = td
			print(td)
			break
		end
		i = i + 1
		sleep(5)
	end
end

for z = 1, #t do
	print(t[i].name .. " (" .. t[i].code .. ") - " .. t[i].postDate .. "\n")
	print(t[i].desc .. "\n")
end

I know that you can use something like
for content in string.gmatch(urlText, "www.pastebin.com/(.-)") print(content) end
I'm able to do that but now how do I add it into a table in the right index, etc.

Going to take a look at your SmartPaste program now …
Cranium #5
Posted 05 February 2013 - 07:46 AM
My mistake, the lines I gave you were wrong. I meant to say starting around 685.
remiX #6
Posted 05 February 2013 - 08:06 AM
My mistake, the lines I gave you were wrong. I meant to say starting around 685.

Yeah I used find to find it… But getting specific information from pastebin is easier because everything is encased in <> and you're inserting everything into one table. Would that work for me? And then combine then at the end. I think it would but I have no clue how xD
Cranium #7
Posted 05 February 2013 - 08:10 AM
Well, for each variable you are trying to match, you are going to need a new string.gmatch command, but you can put them all in the same table with a different index. Like response.name[1] would be the first instance it returns with the name variable. So it would be written to like this:


response = {}
for name in string.gmatch(string, "matchCommand") do
	table.insert(response.name, name)
end
for code in string.gmatch(string, "matchCommand2") do
    table.insert(response.code, code)
end
It is a super simplified example, because I just don't want to have to write out the whole command.
remiX #8
Posted 05 February 2013 - 08:15 AM
That won't work because there will be more than 1 code/name/description/title.

edit: misread, I'll check what I can do now…

edit2: Btw, forgot to ask: how do I get the full date?


using string.gmatch(text, "Posted (.-)") returns "February"

Posted February 4, 2013 at 12:07 PM
Cranium #9
Posted 05 February 2013 - 08:29 AM
You can do

string.gmatch(string, '<div class="linkTimestamp">(.-)</div>')
That should do anything within those tags.
remiX #10
Posted 05 February 2013 - 08:40 AM
Yeah but it has spaces:



<div class="linkTimestamp">
                          Posted February 4, 2013 at 12:07 PM
                        </div>

Anyway, looks like I got it!


i = 1
for code in sourceText:gmatch('><a href="http://pastebin.com/(.-)" target="_blank">') do
	t_Programs[i] = {}
	t_Programs[i].code = code
	i = i + 1
end

i = 1
for k = 1, #t_Programs do
	for title in sourceText:gmatch('<a href="http://pastebin.com/' .. t_Programs[k].code .. '" target="_blank">(.-)</a>') do
		t_Programs[i].name = title
	end
	i = i + 1
end

i = 1
for desc in sourceText:gmatch('<p class="linkDescription">(.-)</p>') do
	t_Programs[i].desc = desc
	i = i + 1
end

i = 1
for pDate in sourceText:gmatch([[<div class="linkTimestamp">
                          Posted (.-)
                        </div>]]) do
	t_Programs[i].postDate = pDate
	i = i + 1
end

for z = 1, #t_Programs do
	print(t_Programs[z].name .. " (" .. t_Programs[z].code .. ") - " .. t_Programs[z].postDate .. "\n")
	print(t_Programs[z].desc .. "\n")
end

Thanks :P/>
Cranium #11
Posted 05 February 2013 - 08:41 AM
Glad I could help!