This is a read-only snapshot of the ComputerCraft forums,
taken in April 2020.
Learning about Compression
Started by Selim, 24 May 2016 - 09:37 PMPosted 24 May 2016 - 11:37 PM
I am currently trying to learn how file compression works. I understand that it usually looks for redundancy in the file, but from there I have no clue. So, if anyone themselves is a good resource and would like to help me out, I would appreciate it, or if you have a link to a good resource.
Posted 25 May 2016 - 12:00 AM
Moved to Ask a Pro.
Posted 25 May 2016 - 12:03 AM
I didn't think it belonged in AAP personally (as it isn't specifically Lua), but you are the mod.–snip–
Posted 25 May 2016 - 12:26 AM
OOH I just got done spending ages delving into file compression so allow me to try and help you here! (bomb bloke is better at this so he can probably help further)
File compression works in a number of ways using some type of algorithm. From what I have seen there are two major types, Algorithms that lessen redudency and (attempt to)lessen file size, and algorithms that raise redudency and keep file size about the same. For the rest of this I will refer to entropy instead of redudency . The entropy of a file determines how "compressable" it is entropy is essentially how random the data in a file is. The more you compress it the greater the entropy and thus harder it is to compress further.
The most common type of compression I see is called RLE or run length encoding. This is one of the simplest in my opinion to implement while still being quite efficient. This essentially stores the "count" of a run of same bytes so it is instead saved as count, byte or something similar. The most common compression algorithms I see in major file types are also Huffman and LZW. Huffman is a variable bit encoding algorithm. This essentially makes various characters take up more or less than 8 bits depending on how common they are. LZW I haven't actually programmed into anything yet so I think I will let bomb bloke explain that one since he knows it very well and I know he will show up. More minor algorithms include the BWT and MTF. These aim to lower entropy instead of lowering file size. This way other algorithms such as RLE can compress further and more effectively. There are tons more algorithms designed for specific situations. If you would like to see MOST of the ones I listed in Lua to help you understand exactly how everything is working feel free to see the below paste.
compression algorithms
File compression works in a number of ways using some type of algorithm. From what I have seen there are two major types, Algorithms that lessen redudency and (attempt to)lessen file size, and algorithms that raise redudency and keep file size about the same. For the rest of this I will refer to entropy instead of redudency . The entropy of a file determines how "compressable" it is entropy is essentially how random the data in a file is. The more you compress it the greater the entropy and thus harder it is to compress further.
The most common type of compression I see is called RLE or run length encoding. This is one of the simplest in my opinion to implement while still being quite efficient. This essentially stores the "count" of a run of same bytes so it is instead saved as count, byte or something similar. The most common compression algorithms I see in major file types are also Huffman and LZW. Huffman is a variable bit encoding algorithm. This essentially makes various characters take up more or less than 8 bits depending on how common they are. LZW I haven't actually programmed into anything yet so I think I will let bomb bloke explain that one since he knows it very well and I know he will show up. More minor algorithms include the BWT and MTF. These aim to lower entropy instead of lowering file size. This way other algorithms such as RLE can compress further and more effectively. There are tons more algorithms designed for specific situations. If you would like to see MOST of the ones I listed in Lua to help you understand exactly how everything is working feel free to see the below paste.
compression algorithms
Posted 25 May 2016 - 12:58 AM
So, it attempts to lower entropy (not file size) so that another algorithm can be run to then reduce the entropy and the file size with it?
Posted 25 May 2016 - 02:05 AM
algorithms like BWT and MTF try to lower the entropy. Algorithms like Huffman, RLE, and LZW try to lower file size by increasing entropy
Posted 25 May 2016 - 02:10 AM
Recently i made a compression algorithm for in-game areas. While doing this i have discovered that it is very important that you think about compressing something very close to its origin. Lets say you want to make something like i did, a program that saves an ingame area to a really small file. You could save every block, put it into an array, write that down as JSON or so and then compress the file. This would take a lot of space. A better idea is to ignore the most common block, like air, to reduce the amount of stored data. Also saving stuff as bits and bytes and not as Strings is really useful, this was probably the largest improvement for my project. After that you can still compress it down with some algorithm, if you can.
No idea what you want to compress (if you want to compress something at all), you might be able to find something similar done previously by someone else you can learn a lot from.
You might also want to look into something like Inter Frame Compression, compression of Video data. There are a lot of Algorithms for different types of data to compress stuff, you can learn something from each of them, too.
No idea what you want to compress (if you want to compress something at all), you might be able to find something similar done previously by someone else you can learn a lot from.
You might also want to look into something like Inter Frame Compression, compression of Video data. There are a lot of Algorithms for different types of data to compress stuff, you can learn something from each of them, too.
Posted 25 May 2016 - 02:28 AM
Compression basically centers around "information density". Consider a large (say 1000x1000) bitmap containing a representation of a polar bear in a snow storm (ie a pure white image) - the file size might be quite high, but there's not much information in there, so we'd say the information density is low. On the other hand, the words "1000 square pure white image" are relatively dense - but they're all you need to build the image itself. Compression aims to reduce the information density to the minimum amount required to express the original set of data (or a close approximation of the original set of data, if you're prepared to use a lossy technique).
There are loads of different ways of doing this, and there are pros and cons to each.
For example, RLE is a very simple scheme, whereby you simply tally up "runs" of identical bytes, and replace them with a few codes saying "repeat this byte this many times". On a pure-white image, for example, or a text file full of spaces, whatever, that sort of compression will shrink things down nicely - and hardly any RAM or CPU power is required to perform it.
But what if the original data starts out a bit denser than that? For example, what if you created a text document out of a thousand completely random characters? There would hardly be any runs at all - and so attempting to apply RLE might even increase the file size!
This is where you might perform a Burrows-Wheeler Transform - this rather clever technique allows you to reshuffle the content of a set of data, such that the size remains the same, the result can be reversed, and runs become more common. This is not a form of compression in itself, but it does lower entropy within the file (creating low-density pockets of information), meaning that a real compression technique (such as RLE) can be used more effectively. You still pay for it - a BWT chews a bit of RAM and processor time.
Then you've got other techniques. LZW, for example, is a dictionary based system that works by building a table filled with streams of bytes, and replacing matches for those streams within the data to compress with just the table index values. The clever bit is that the dictionary table doesn't need to be saved alongside the output - the decompressor can figure out that table's content based on the order of the indexes found in the compressed file! Again, more work for the processor and RAM chips, but dramatically better results than basic RLE can provide.
LZ77 is similar, only the dictionary file IS written to disk (making things a lot simpler for the decompressor). But if multiple files are compressed using the one dictionary file, you can still achieve a smaller total file size over all.
Then you've got Huffman's, which essentially removes slack space within bytes by replacing them with binary trees. That technique also uses a dictionary, and needs to include it alongside the compressed output.
When push comes to shove, though, there's a maximum amount of information density that you can sink into a given number of bytes. One byte can only represent 256 different values, after all - your choice of compression technique basically defines which values those are. Hence some compression techniques are better for some distributions of data, and some compression techniques are better for others. There's no one single, perfect method that suits all use cases ideally.
There are loads of different ways of doing this, and there are pros and cons to each.
For example, RLE is a very simple scheme, whereby you simply tally up "runs" of identical bytes, and replace them with a few codes saying "repeat this byte this many times". On a pure-white image, for example, or a text file full of spaces, whatever, that sort of compression will shrink things down nicely - and hardly any RAM or CPU power is required to perform it.
But what if the original data starts out a bit denser than that? For example, what if you created a text document out of a thousand completely random characters? There would hardly be any runs at all - and so attempting to apply RLE might even increase the file size!
This is where you might perform a Burrows-Wheeler Transform - this rather clever technique allows you to reshuffle the content of a set of data, such that the size remains the same, the result can be reversed, and runs become more common. This is not a form of compression in itself, but it does lower entropy within the file (creating low-density pockets of information), meaning that a real compression technique (such as RLE) can be used more effectively. You still pay for it - a BWT chews a bit of RAM and processor time.
Then you've got other techniques. LZW, for example, is a dictionary based system that works by building a table filled with streams of bytes, and replacing matches for those streams within the data to compress with just the table index values. The clever bit is that the dictionary table doesn't need to be saved alongside the output - the decompressor can figure out that table's content based on the order of the indexes found in the compressed file! Again, more work for the processor and RAM chips, but dramatically better results than basic RLE can provide.
LZ77 is similar, only the dictionary file IS written to disk (making things a lot simpler for the decompressor). But if multiple files are compressed using the one dictionary file, you can still achieve a smaller total file size over all.
Then you've got Huffman's, which essentially removes slack space within bytes by replacing them with binary trees. That technique also uses a dictionary, and needs to include it alongside the compressed output.
When push comes to shove, though, there's a maximum amount of information density that you can sink into a given number of bytes. One byte can only represent 256 different values, after all - your choice of compression technique basically defines which values those are. Hence some compression techniques are better for some distributions of data, and some compression techniques are better for others. There's no one single, perfect method that suits all use cases ideally.
Posted 25 May 2016 - 04:38 PM
Update on my learning, I have working Burrows-Wheeler Transform encoding and decoding functions.
Posted 26 May 2016 - 02:56 PM
And, now, functional RLE functions, but my Burrows-Wheeler Transform functions are quite slow. Any tips on optimizing them?
Current code: i9LJ3edP
Current code: i9LJ3edP
Edited on 26 May 2016 - 09:03 PM
Posted 27 May 2016 - 02:35 AM
It'd probably go a bit faster if you had some source of events to trigger resumes after each yield… I guess you're tapping the keyboard or something to get it to proceed?
This:
… would technically go faster like this (assuming any appreciable amount of data):
… but a far better speed gain is available for your decoder, where you only need to perform one sort (and don't need to populate a table with full-sized strings, either). This article covers the relevant technique.
This:
local returnValue = ""
for i=1,#rotations do
returnValue = returnValue..string.sub(rotations[i],#rotations[i])
coroutine.yield()
end
return returnValue
… would technically go faster like this (assuming any appreciable amount of data):
local returnValue = {}
for i=1,#rotations do
local thisRotation = rotations[i]
returnValue[i] = thisRotation:sub(#thisRotation)
end
return table.concat(returnValue)
… but a far better speed gain is available for your decoder, where you only need to perform one sort (and don't need to populate a table with full-sized strings, either). This article covers the relevant technique.
Posted 27 May 2016 - 03:19 AM
Thanks! You have been a great source of help. I will perform the modifications tomorrow, and get back with the results.–snip–
Posted 27 May 2016 - 02:28 PM
Alright, I encountered a problem, though I am not sure how solvable it is. When I attempt to encode a large file, it kills Lua. I assume this is a problem with memory management and the fact that the encoder is creating a massive table for the file that is more than 5k lines long. I tried it in CCEmu Redux along with Lua in my Debian terminal, and the Lua in the terminal died, and the CCEmu Redux one just stopped. My thought that Lua just cannot handle such a large table, and that the compression should be handled in a different language.
Posted 27 May 2016 - 03:25 PM
Well, you're kinda using an amount of memory about equal to the string size squared. But that's only ~5mb, and frankly I'm somewhat surprised it wouldn't work; I mean, this worked. Yes, that's a 30mb script.
Did you actually test it in-game? I've come to suspect LuaJ is a lot more tolerant than some other VMs out there. (Related: MC 1.8.9 has a much better garbage collector system than 1.7.10.)
If the worst comes to worst, you could always just… not store rotated copies of the string. Build a table with numbers 1 through to #stringLength in it instead, then sort it like this:
I wouldn't expect it to operate at the same speed (there should theoretically be quite a few more function calls in there), but it ought to reduce the stress on your RAM.
You can often swap processing time for memory space in this sort of manner.
Did you actually test it in-game? I've come to suspect LuaJ is a lot more tolerant than some other VMs out there. (Related: MC 1.8.9 has a much better garbage collector system than 1.7.10.)
If the worst comes to worst, you could always just… not store rotated copies of the string. Build a table with numbers 1 through to #stringLength in it instead, then sort it like this:
function encode(theString)
local theString = theString .. markChar
local numbersTable, stringLen = {}, #theString
for i = 1, stringLen do numbersTable[i] = i end
table.sort(numbersTable, function(a, b)
a = theString:sub(stringLen - a + 1) .. theString:sub(1, stringLen - a)
b = theString:sub(stringLen - b + 1) .. theString:sub(1, stringLen - b)
return a < b
end)
local results = {}
for i = 1, stringLen do
local thisNum = stringLen - numbersTable[i]
if thisNum == 0 then thisNum = stringLen end
results[i] = theString:sub(thisNum, thisNum)
end
return table.concat(results)
end
I wouldn't expect it to operate at the same speed (there should theoretically be quite a few more function calls in there), but it ought to reduce the stress on your RAM.
You can often swap processing time for memory space in this sort of manner.
Edited on 27 May 2016 - 01:36 PM
Posted 27 May 2016 - 03:31 PM
Well, doesn't LuaJ run CCEmu Redux as well? I thought it did…
Posted 27 May 2016 - 03:36 PM
Oh, 5k lines, you said. A bit more than 5mb for the rotation table, then. (Edit: argh 25mb) Lessee, say 100 chars per line, a 500kb file, five hundred thousand squared… storing all those rotations would take something like ~250gb?
Yeah, um, maybe go with the code above, then.
Yeah, um, maybe go with the code above, then.
Edited on 27 May 2016 - 01:44 PM
Posted 27 May 2016 - 03:38 PM
Okay, unless I want to invest in a computer with more than 250GB of RAM, I will swap that out later, and get back to you again.
Posted 27 May 2016 - 10:32 PM
Update. I was now able to encode the file. It took a while, as expected, but it worked :)/>
Now I just need to adjust my compression algorithm so that it can handle repetition of characters more than 255 chars in a row.
Now I just need to adjust my compression algorithm so that it can handle repetition of characters more than 255 chars in a row.
Edited on 28 May 2016 - 02:09 AM
Posted 09 June 2016 - 07:01 PM
Posted 09 June 2016 - 08:35 PM
At this point, I have it all worked out, but good resource for anyone else trying, thanks!