This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
oeed's profile picture

Verifying that two files are the same

Started by oeed, 03 May 2014 - 07:16 AM
oeed #1
Posted 03 May 2014 - 09:16 AM
So, I've been getting a ton of crash reports from OneOS and a huge amount (over 80%) are from people mucking around with the code. Now, I don't have a problem with this really, it's just that they're meant to set a variable at the top of startup to true to prevent these reports. But they aren't doing that.

To try and conqueror this I've decided that the best way would be to detect if any of the system files have been altered and if so don't send the error report. I'm aware that there are lots of programs and AAP topics that hash strings and files, but none really specifically about this. However, as I've barely ever used or looked in to checksums or those sort of things I'm not really sure what the best way of doing this is. Storage space is also one of my main concerns, OneOS already uses about ~80% of the default storage space and these system files are about 100KB, so I can't have huge hashes of them. There's about 30 of them too. Speed is also an issue, I don't want it freezing for 10 seconds while it compares all the files.

So yea, what would the best way to do this be? I'm thinking of storing the hashes or what ever I use in a table correlating to the file names. Checking if the file sizes are different is another possibility, but it may not be accurate enough.
Edited on 03 May 2014 - 07:19 AM
logsys #2
Posted 03 May 2014 - 09:22 AM
It would be faster if it uploaded that file to php and calculated the md5 hash there and return true or false from there. It would save some time
Bomb Bloke #3
Posted 03 May 2014 - 09:32 AM
It sorta strikes me that an easier route would be to have error reporting off by default - if an error occurs that otherwise would be reported, you'd then throw up a message telling the user to consider enabling it. This'd give you an opportunity to force your "don't bug me with error reports if you're modifying my code" message somewhere it'll be seen before a user starts sending you "false positives".
logsys #4
Posted 03 May 2014 - 10:04 AM
I have an idea, place an empty space on the last line, if code was modified, the line will change of number
awsmazinggenius #5
Posted 04 May 2014 - 03:47 AM
Except you can put multiple Lua statements/whatever in one line, so if someone really wanted to be a pain, or they just can't read or they know nothing about conventions, they could take that route.

[dumb moment]I would take the SHA1 hash of the computer's content and compare it to the hash of the latest commit on GitHub, since the readme gets downloaded and everything. You would upload the computer, hash it server-side (just in case someone modifies the hash-checking code) and compare it.[/dumb moment]
EDIT: I'm an idiot, that way, you couldn't save files on the computer. You would only upload the default files.
Edited on 04 May 2014 - 01:52 AM
oeed #6
Posted 04 May 2014 - 03:56 AM
Except you can put multiple Lua statements/whatever in one line, so if someone really wanted to be a pain, or they just can't read or they know nothing about conventions, they could take that route.

[dumb moment]I would take the SHA1 hash of the computer's content and compare it to the hash of the latest commit on GitHub, since the readme gets downloaded and everything. You would upload the computer, hash it server-side (just in case someone modifies the hash-checking code) and compare it.[/dumb moment]
EDIT: I'm an idiot, that way, you couldn't save files on the computer. You would only upload the default files.
If someone's modifying the hashing code then they're just being malicious, that's another issue. The people causing this just aren't reading.
Would it be possible to create a single hash from a few files? I don't want a hash of everything, people have their own documents and settings files. I don't want to hash it server side either, uploading about a megabyte of files isn't practical.
awsmazinggenius #7
Posted 04 May 2014 - 04:03 AM
It should be, Git makes hashes from multiple files, so this should just be a matter of picking and choosing which files to hash.
Bomb Bloke #8
Posted 04 May 2014 - 04:15 AM
If you choose to take the hashing route, you'll indeed be better off implementing the hashing code into the local copy of the script rather than sending the data to be hashed off to the internet somewhere. Either way you've got to read the complete set of data off the drive and it really doesn't take that long. My system can easily hash a hundred megs of data within a second or three.

This may sound obvious, but I'm getting the impression you're not familiar with file hashes and checksums, so I'll point out that the hashes themselves need not be more than a few dozen bytes at most (their size is typically not related to the size of the files being hashed). Whether you have one or multiple hashes for your entire set of files is completely up to you.
oeed #9
Posted 04 May 2014 - 04:18 AM
This may sound obvious, but I'm getting the impression you're not familiar with file hashes and checksums, so I'll point out that the hashes themselves need not be more than a few dozen bytes at most (their size is typically not related to the size of the files being hashed). Whether you have one or multiple hashes for your entire set of files is completely up to you.

Ah ok, that makes a lot more sense actually. If were to use GravityScore's API for example, would the best way to combine all the files in to one be to put all the files in to one string and hash that, or is there a better way?
awsmazinggenius #10
Posted 04 May 2014 - 04:38 AM
You should hash with SHA1, in the same way that Git hashes multiple files, so you can easily compare to the hash of the latest commit on GitHub. It will also allow you to not report the error if it is for an outdated version of OneOS.
theoriginalbit #11
Posted 04 May 2014 - 04:40 AM
It will also allow you to not report the error if it is for an outdated version of OneOS.
it would also be handy for an update checker.
Edited on 04 May 2014 - 02:40 AM
oeed #12
Posted 04 May 2014 - 04:52 AM
You should hash with SHA1, in the same way that Git hashes multiple files, so you can easily compare to the hash of the latest commit on GitHub. It will also allow you to not report the error if it is for an outdated version of OneOS.
Ah I see, so I wouldn't even have to make a list of hashes. So I'd take it you'd just read the file content in to a string and hash that then compare it.
HometownPotato #13
Posted 04 May 2014 - 04:53 AM
Couldn't you just do something like this:

local x = fs.open("file1");
local c = x.readAll();
x.close();


x = fs.open("file2");
local c2 = x.readAll();
x.close();

if c == c2 then
end

It would check if the contents are the same
oeed #14
Posted 04 May 2014 - 05:13 AM
Couldn't you just do something like this:

local x = fs.open("file1");
local c = x.readAll();
x.close();


x = fs.open("file2");
local c2 = x.readAll();
x.close();

if c == c2 then
end

It would check if the contents are the same
The context behind what I'm trying to do is important. I don't have the original copy of the file nor can I download it really due to both file size issues and the number of files.
logsys #15
Posted 04 May 2014 - 11:40 AM
In PHP there is a md5 hash calculator.. put all the content into one line and send it to php. From php, the md5 will have to match the other md5 value. after that, return true or false.
viluon #16
Posted 04 May 2014 - 12:05 PM
Hmm, wouldn't hashing all the files one by one and then hashing the hashes be better?

You could then compare the two hashes - local and github
Edited on 04 May 2014 - 10:06 AM
awsmazinggenius #17
Posted 04 May 2014 - 02:51 PM
I'm not sure exactly how Git goes about SHA1ing multiple files - whether it just concatenates them, or hashes them all and then hashes the hashes, or if it concatenates them but puts some character in between the files to separate them, etc. - but you would want to do it how Git does it, which might take a little research. I'll see if I have time to look into it.

EDIT: The point is not to have to hash the local files and the GitHub files, it is to be able to compare to the already available SHA1 hash of the latest commit on GitHub.
Edited on 04 May 2014 - 12:53 PM
Saldor010 #18
Posted 04 May 2014 - 02:54 PM
I think I have an idea. How about, whenever someone opens up a core file of OneOS, it will pop up with a message saying "Are you about to modify OneOS's code? If so, I suggest you turn off "Auto Report"."
MKlegoman357 #19
Posted 04 May 2014 - 02:57 PM
I think I have an idea. How about, whenever someone opens up a core file of OneOS, it will pop up with a message saying "Are you about to modify OneOS's code? If so, I suggest you turn off "Auto Report"."

And what if I edit those files outside of OneOS, outside of CC?
Saldor010 #20
Posted 04 May 2014 - 03:10 PM
I think I have an idea. How about, whenever someone opens up a core file of OneOS, it will pop up with a message saying "Are you about to modify OneOS's code? If so, I suggest you turn off "Auto Report"."

And what if I edit those files outside of OneOS, outside of CC?

Then oh well. It's not OneOS's job to track all of that.
awsmazinggenius #21
Posted 04 May 2014 - 03:19 PM
Yeah, that just won't cut it, as CC sucks for editing large Lua files, or programs that you want to be polished enough to release.

I've found half of how Git hashes multiple files, here is how Git finds the hash for a single file. It isn't exactly sha1(handle.readAll()).
Edit: forgot to link to it, but it is basically this:

sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua. 
Edited on 04 May 2014 - 04:45 PM
skwerlman #22
Posted 05 May 2014 - 06:11 PM
Yeah, that just won't cut it, as CC sucks for editing large Lua files, or programs that you want to be polished enough to release.

I've found half of how Git hashes multiple files, here is how Git finds the hash for a single file. It isn't exactly sha1(handle.readAll()).
Edit: forgot to link to it, but it is basically this:

sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua.
I'm pretty sure lua supports (some? most?) standard C escape sequences, including \nnn (in decimal) for ASCII chars (including null bytes).
blipman17 #23
Posted 05 May 2014 - 06:42 PM
what about a file that stores the exact quantity every single letter is used in your code? and compare your current code with it? It would as far as I know a bit faster that hashing. Although the possibility for an equal amounth of all characters is more likely than with a good hashing algorithm.
skwerlman #24
Posted 05 May 2014 - 06:50 PM
what about a file that stores the exact quantity every single letter is used in your code? and compare your current code with it? It would as far as I know a bit faster that hashing. Although the possibility for an equal amounth of all characters is more likely than with a good hashing algorithm.
If someone replaces a line with one of the same length, no difference would be detected.

EDIT: A simple, temporary fix would be to ask whether the user would like to report the error. That way, if someone generates 20 errors in 7 min, they'd have to confirm all 20 error reports. Obviously this won't stop someone from being malicious, but it should help prevent unintentional report spam.
Edited on 05 May 2014 - 05:01 PM
MKlegoman357 #25
Posted 05 May 2014 - 06:58 PM
If someone replaces a line with one of the same length, no difference would be detected.

The idea is not having it compare the file size, but how many time different characters appear. But it wouldn't make any difference if characters would only be shuffled around. For ex.:


local id = os.getComputerID()

--// Change that to:

local os.getComputerID = id() --// This one has the same amount of every letter the above one has

It's not likely that someone would change the code to be different but have the same amount of every letter the original code has, but it is still possible.
Edited on 05 May 2014 - 05:00 PM
skwerlman #26
Posted 05 May 2014 - 07:05 PM
If someone replaces a line with one of the same length, no difference would be detected.

The idea is not having it compare the file size, but how many time different characters appear. But it wouldn't make any difference if characters would only be shuffled around. For ex.:


local id = os.getComputerID()

--// Change that to:

local os.getComputerID = id() --// This one has the same amount of every letter the above one has

It's not likely that someone would change the code to be different but have the same amount of every letter the original code has, but it is still possible.
Oh, I misread that. That's certainly better than checking file size, but it sounds fairly slow, since each char is checked and tallied individually. Remember, OneOS is huge, so we need a relatively fast algorithm.
awsmazinggenius #27
Posted 05 May 2014 - 11:01 PM
Depending on the speed of your web server, it might actually be smart to send the files off somewhere. If the OS crashes, you can compress the code and send it off to the web where PHP calculates the md5 and checks it. You would need to have an algorithm to compress the code as best as you can, though, and then you'd need to reimplement it in PHP to decompress the code.

EDIT: The reason I say this is because it seems like calculating the SHA1 of the latest commit on GitHub win't work, because in this hash Git also includes the previous commit's hash. You could, still, though, hash all the files (in this case, since we are not required to use SHA1, you would probably want to use Grav's SHA256 snippet, as (I would think) it has less collisions), concatenate the hashes and then hash again, then send this hash off to the web to check against the one you've calculated for the latest version of OneOS. Again, just a matter of picking and choosing what to hash, but also remembering to recalculate for each release.
Edited on 05 May 2014 - 09:05 PM
oeed #28
Posted 05 May 2014 - 11:04 PM
Depending on the wood of your web server, it might actually be smart to send the files off somewhere. If the OS crashes, you can compress the code and send it off to the web where PHP calculates the md5 and checks it. You would need to have an algorithm to compress the code as best as you can, though, and then you'd need to reimplement it in PHP to decompress the code.
Uploading 1MB on some connections (i.e. every single one in Australia) would take ages. I'll just SHA1 them and compare it to GitHub.
awsmazinggenius #29
Posted 05 May 2014 - 11:12 PM
Looking at what you quoted, you haven't seen my edit, as I also fixed an obvious spelling mistake in that edit. Also, I forgot about the varying-internet-speeds problem. Something makes me wonder how you play online games…
oeed #30
Posted 05 May 2014 - 11:23 PM
Looking at what you quoted, you haven't seen my edit, as I also fixed an obvious spelling mistake in that edit. Also, I forgot about the varying-internet-speeds problem. Something makes me wonder how you play online games…
Hmm I see. I might just make a hash each release and compare it to that.

I often wonder that too, as do many of the people on Cranium's server. I don't even want to mention what it's like when my brother plays GuildWars 2…..
awsmazinggenius #31
Posted 05 May 2014 - 11:29 PM
Yes, that is what will need to happen, as a Git SHA1 is not just the files, apparently. Just SHA256 (using Grav's snippet) all the files, concatenate the hashes in the same order each time (maybe "alphabetically" using 0-9 a-f (I don't know the word), but the only thing is that it is in the same order each time) and hash those concatenates hashes, and send 'em off to your server where you already handle reporting.
theoriginalbit #32
Posted 06 May 2014 - 12:45 AM

sha1("blob "..filesize.."\0"..filecontent)
--# where \0 is a null byte. I'm not sure how you do null bytes in Lua.
yes you're correct with the \0
awsmazinggenius #33
Posted 06 May 2014 - 04:14 AM
Essentially this pseudo-code:
(Sorry for mistakes, written on iPad)

local hashes = {}
for _, filename in pairs(filenames) do
  local h = fs.open(filename, "r")
  hashes[(#hashes + 1)] = sha256(h.readAll())
end
local finalhash = sha256(table.concat(hashes))
Edited on 06 May 2014 - 10:56 PM
MKlegoman357 #34
Posted 06 May 2014 - 12:45 PM
The problem I see with hashing it with sha256 is that there would be over 30 hash calculations (IIRC). Those files are big too. Wouldn't that be quite slow?
awsmazinggenius #35
Posted 07 May 2014 - 12:58 AM
If you have a decent computer, no. And SHA256 has less collisions than SHA1, so, why not?
theoriginalbit #36
Posted 07 May 2014 - 01:01 AM
If you have a decent computer, no. And SHA256 has less collisions than SHA1, so, why not?
Industry practise, file integrity is checked with CRC32, MD5, or SHA1… its because these are enough, anything more is just overkill.
Edited on 06 May 2014 - 11:02 PM
skwerlman #37
Posted 07 May 2014 - 04:15 AM
The fastest pure-lua SHA1 implementation I've found is here.
The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Finally, the only pure-lua implementation of MD5 written for 5.1 (again, that I could find) is here.

All three appear to be released under the MIT license.

I hope one of these works well enough for this application.

EDIT: I forgot to mention that I haven't had time to actually test them in CC.
Edited on 07 May 2014 - 02:31 AM
oeed #38
Posted 07 May 2014 - 06:16 AM
On the performance aspect of this, it's worth noting that GravityScore's version actually runs quicker than the SHA1 implementations he's tried.
theoriginalbit #39
Posted 07 May 2014 - 07:02 AM
Honestly oeed I think your best method is to just do CRC's for each version, have a folder in the repo for the CRCs, each file has the system version number, the contents of the file are the full path of the file and the CRC for it. Download that CRC file and compare against the system. That was the easiest and quickest solution that NeverCast and I could find to implement for CCTube.

The fastest pure-lua SHA1 implementation I've found is here.
That is extremely slow! This is the fastest SHA1 implementation I've found, and its made by someone on these forums; its a near instant calculation compared to the one you've linked. I've also performed some cleanup on the code which can be found here.

The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Another cleanup to get it working nicely in ComputerCraft found here ;)/>
Edited on 07 May 2014 - 05:25 AM
skwerlman #40
Posted 07 May 2014 - 06:53 PM
Honestly oeed I think your best method is to just do CRC's for each version, have a folder in the repo for the CRCs, each file has the system version number, the contents of the file are the full path of the file and the CRC for it. Download that CRC file and compare against the system. That was the easiest and quickest solution that NeverCast and I could find to implement for CCTube.

The fastest pure-lua SHA1 implementation I've found is here.
That is extremely slow! This is the fastest SHA1 implementation I've found, and its made by someone on these forums; its a near instant calculation compared to the one you've linked. I've also performed some cleanup on the code which can be found here.

The only 5.1 CRC32 implementation that appears to be CC-compatible (that I could find) is here. You'll need to comment out the first line of actual code, though. (module('CRC32', package.seeall))
Another cleanup to get it working nicely in ComputerCraft found here ;)/>
Wow, that SHA1 routine is stupid fast compared to the ones I've seen! Nice find!

Doesn't removing the license info constitute a violation of the license?
MIT License said:
–The above copyright notice and this permission notice shall be included in all
–copies or substantial portions of the Software.
theoriginalbit #41
Posted 07 May 2014 - 11:08 PM
Wow, that SHA1 routine is stupid fast compared to the ones I've seen! Nice find!
Thanks… an advantage of being on the forums for so long.

Doesn't removing the license info constitute a violation of the license?
yes and no. I've added the license back into the SHA1 (that was just forgetting to copy it).
awsmazinggenius #42
Posted 10 May 2014 - 01:21 AM
Yeah, I guess you should use the SHA1 snippet BIT linked, though using SHA256 honestly happens almost instantly on my machine.
theoriginalbit #43
Posted 10 May 2014 - 02:03 AM
Yeah, I guess you should use the SHA1 snippet BIT linked, though using SHA256 honestly happens almost instantly on my machine.
the thing is, is its just overkill. there's a reason that CRC32, MD5, and SHA1 are used in industry to verify integrity of files, and its just 'cause its all you need, anything more is just extra unrequited processing. and when dealing with a large quantity of files like oeed will be, its just best to stick to something simple.
oeed #44
Posted 11 May 2014 - 05:19 AM
Yeah, I guess you should use the SHA1 snippet BIT linked, though using SHA256 honestly happens almost instantly on my machine.
the thing is, is its just overkill. there's a reason that CRC32, MD5, and SHA1 are used in industry to verify integrity of files, and its just 'cause its all you need, anything more is just extra unrequited processing. and when dealing with a large quantity of files like oeed will be, its just best to stick to something simple.
Yea, I've just tried using GravityScore's SHA-256 API and a even a single file is too big really, it just grinds to a halt.

Anyone know of a good MD5 API for CC, I haven't really been able to find one that works.
theoriginalbit #45
Posted 11 May 2014 - 05:26 AM
Anyone know of a good MD5 API for CC, I haven't really been able to find one that works.
clearly haven't read the last few replies on the last page… :P/>
awsmazinggenius #46
Posted 11 May 2014 - 05:27 AM
I would go with SHA1 over MD5, but that's (sort-of) just a matter of preference.
oeed #47
Posted 11 May 2014 - 07:12 AM
Anyone know of a good MD5 API for CC, I haven't really been able to find one that works.
clearly haven't read the last few replies on the last page… :P/>
*facepalm*

I would go with SHA1 over MD5, but that's (sort-of) just a matter of preference.
I'll see which one is quicker and use that. I'm not overly concerned about accuracy, if one slips through it's not the end of the world.

Edit: CRC32 was quicker, so I'll be using that.

Edit 2: Got it all working now, it's doing it's just excellently. Thanks everyone!
Edited on 11 May 2014 - 10:02 AM
awsmazinggenius #48
Posted 11 May 2014 - 10:40 PM
As in, if you were to release the next release of OneOS now the hashing part will all be working? Great!
oeed #49
Posted 11 May 2014 - 10:44 PM
As in, if you were to release the next release of OneOS now the hashing part will all be working? Great!

Yep, it's all done and working.
viluon #50
Posted 19 May 2014 - 03:51 PM
As in, if you were to release the next release of OneOS now the hashing part will all be working? Great!

Yep, it's all done and working.
I suggest editing the original post, so people around here don't have to read all the replies.