Compression Algorythm in Lua [WIP] - ComputerCraft Forums (archive)

Goof #1

Goof's profile picture

767 posts

Posted 08 March 2015 - 08:08 PM

Hello

I've decided to try and make a "compressor" in Lua, with some different algorithms in mind;
Currently im working on an algorythm, where i have the string, and add numbers / Letters corresponding to the amount of that letter etc;
Example:


OriginalString = "AAAAABBA"
CompressedString = "5A2B1A"

Explanation:

A small GIF, with a better "vision" of how it works

Currently im making the code, and so far im getting it:


--# Implement basic API's
table.size=function(tbl)local Count=0 for k,v in pairs(tbl) do Count=Count+1 end return Count end


--# Main Block --
local Str = 'AAAAABBA' --# Length: 8

local function Compress(Str)
  local Length = string.len(Str)
  local Output = {}
  OutputCurrent = 1
  CurrentChar = ''
  for I = 1, Length do
	local Previous = string.sub(Str, I-1, I-1)
	local Current = string.sub(Str, I, I)
	local Next = string.sub(Str, I+1, I+1)

	if not Output[OutputCurrent] then
	  Output[OutputCurrent] = {}
	end
	Output[OutputCurrent]['Amount'] = 1

	if Next == Current then
	  Output[OutputCurrent]['Amount'] = Output[OutputCurrent]['Amount'] + 1
	end
	Output[OutputCurrent]['Char'] = Current
	OutputCurrent = OutputCurrent + 1
  end
  if table.size(Output) > Length then
	print(Str)
	return false, Str
  end
  print(textutils.serialize(Output))
  return true, Output
end


Compress(Str)

I know it isn't much yet, but i "just" started this project (20 mins ago).

maybe i could make some improvements, but i can't really think of other compression, rather than using a "dictionary" and X pos of that text-field, etc.

So far, i would think this method is pretty good, but if the string does NOT have several "multiple" chars after each other, its going to INCREASE the size instead :(/>

Any suggestions / critics?

Thanks :P/>

Edited on 08 March 2015 - 07:33 PM

Lupus590 #2

Lupus590's profile picture

2427 posts

Location UK

Posted 08 March 2015 - 08:18 PM

Sounds/looks good, may I suggest a check at the end so that if the length of the string increases after going through you algorithm then it returns the default string? or returns false so that the program using this code knows to use it's input string.

Goof #3

Goof's profile picture

767 posts

Posted 08 March 2015 - 08:29 PM

Lupus590, on 08 March 2015 - 08:18 PM said:
Sounds/looks good, may I suggest a check at the end so that if the length of the string increases after going through you algorithm then it returns the default string? or returns false so that the program using this code knows to use it's input string.

Hmm. Yeah, of course… That would actually be much better, in terms of "securing" the size of the string. Being added

Lupus590 #4

Lupus590's profile picture

2427 posts

Location UK

Posted 08 March 2015 - 08:51 PM

I would suggest the bool method, as returning the input string could cause problems when they get your code to decompress it.

Edited on 08 March 2015 - 07:51 PM

Lupus590 #5

Lupus590's profile picture

2427 posts

Location UK

Posted 08 March 2015 - 09:06 PM

You may want to collaborate with this project http://www.computercraft.info/forums2/index.php?/topic/22198-craft-compress/

Goof #6

Goof's profile picture

767 posts

Posted 08 March 2015 - 09:22 PM

Lupus590, on 08 March 2015 - 08:51 PM said:
I would suggest the bool method, as returning the input string could cause problems when they get your code to decompress it.

Well Thats why i make both.
Arg1=State (Success compressing or not(false/true))
Arg2=Data

So im not sure what you're actually talking about :P/>

Lupus590, on 08 March 2015 - 09:06 PM said:
You may want to collaborate with this project http://www.computerc...craft-compress/

Naah.. not entirely sure about that, since that project doesn't quite fit the same as im doing.. I'm just goofing around and testing some "basics" of the compression.
Maybe i will ask, but at the moment, i can't declare whether i want or not

Bomb Bloke #7

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 08 March 2015 - 10:39 PM

With RLE methods like the one you're using, the trick is to have a method of temporarily disabling it. For example, you might say that if your number is less than 128, then it means "repeat the next value x times", and if it's more than 128, it means "dump the next x - 128 values directly into the output".

Many compression algorithms make use of a code table, one that's built during the compression process, but not actually saved to disk. The table is filled with strings of data (which might be anything from "A" to "BBBBBBBB" to "FAVRBTBADCREAC XZFG"), and the compressed code is basically a list of indexes into that. The decompressor is able to infer the same table when it goes to unpack the data, based on the order of codes in the compressor's output stream.

Back in the day I used this article as part of the documentation to help me write a GIF encoder / decoder (in QBASIC!). It describes one of the older and simpler methods which make use of this technique, LZW.

Goof #8

Goof's profile picture

767 posts

Posted 09 March 2015 - 02:42 PM

Bomb Bloke, on 08 March 2015 - 10:39 PM said:
With RLE methods like the one you're using, the trick is to have a method of temporarily disabling it. For example, you might say that if your number is less than 128, then it means "repeat the next value x times", and if it's more than 128, it means "dump the next x - 128 values directly into the output".

Hmm. I am not sure i quite understand that fully. So if the number is more than 128, then what? just paste it as a "local x = {'StringRepeated128times'} for i = 1, _Num do print(x[1]) end"

Bomb Bloke, on 08 March 2015 - 10:39 PM said:
Many compression algorithms make use of a code table, one that's built during the compression process, but not actually saved to disk. The table is filled with strings of data (which might be anything from "A" to "BBBBBBBB" to "FAVRBTBADCREAC XZFG"), and the compressed code is basically a list of indexes into that. The decompressor is able to infer the same table when it goes to unpack the data, based on the order of codes in the compressor's output stream.

Yeah, I've read about a dictionary compression method, by saving some strings which is repeated through the whole text, and placing that string into a table, with X positions.
However it might be pretty stupid to make such a method in CC, since it might take forever to execute that single function?

TurtleHunter #9

TurtleHunter's profile picture

139 posts

Posted 09 March 2015 - 02:51 PM

Mikk809h, on 09 March 2015 - 02:42 PM said:
Bomb Bloke, on 08 March 2015 - 10:39 PM said:
With RLE methods like the one you're using, the trick is to have a method of temporarily disabling it. For example, you might say that if your number is less than 128, then it means "repeat the next value x times", and if it's more than 128, it means "dump the next x - 128 values directly into the output".

Hmm. I am not sure i quite understand that fully. So if the number is more than 128, then what? just paste it as a "local x = {'StringRepeated128times'} for i = 1, _Num do print(x[1]) end"

Bomb Bloke, on 08 March 2015 - 10:39 PM said:
Many compression algorithms make use of a code table, one that's built during the compression process, but not actually saved to disk. The table is filled with strings of data (which might be anything from "A" to "BBBBBBBB" to "FAVRBTBADCREAC XZFG"), and the compressed code is basically a list of indexes into that. The decompressor is able to infer the same table when it goes to unpack the data, based on the order of codes in the compressor's output stream.

Yeah, I've read about a dictionary compression method, by saving some strings which is repeated through the whole text, and placing that string into a table, with X positions.
However it might be pretty stupid to make such a method in CC, since it might take forever to execute that single function?

Imagine you have this string ABCDEABCDE
That would mean that your script would do
1a1b1c1d1e1a1b1c1d1e

With a directionary based you could do this
2abcde

Also, how you compressor handles numbers?

Lupus590 #10

Lupus590's profile picture

2427 posts

Location UK

Posted 09 March 2015 - 03:52 PM

Could implement an escape character, most use \ so if the next value is a number you can do \1 telling your system to interpret the 1 as a 1 and not part of the number of next chars. If you need to encode a \ you use \\ . I wastes a character but how many times are you going to compress numbers?

Also, I would recommend using the method that TurtleHunter suggested, and make your code look for repeating patterns, the chances of having data with one repeating character is rare. (I can only think of game maps.)

Goof #11

Goof's profile picture

767 posts

Posted 09 March 2015 - 06:17 PM

Lupus590, on 09 March 2015 - 03:52 PM said:
Could implement an escape character, most use \ so if the next value is a number you can do \1 telling your system to interpret the 1 as a 1 and not part of the number of next chars. If you need to encode a \ you use \\ . I wastes a character but how many times are you going to compress numbers?

Also, I would recommend using the method that TurtleHunter suggested, and make your code look for repeating patterns, the chances of having data with one repeating character is rare. (I can only think of game maps.)

Yeah, Implementing an escape char for numbers is the best possible solution i can think of, which im going to add.

Well, the method TurtleHunter suggested is actually pretty good, but in that case you need to "search" the whole string/text and check for every possible double combination available, which might be hard, since that would need some kind of AI, to be sure the "selected" string is duplicated anywhere..

But for a giant file with many "local X = {" and so on, its going to be pretty easy, since we could just "hardcode" it to add all local to a Value (#0AS1 or something like a "key")

Bomb Bloke #12

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 10 March 2015 - 01:03 AM

You don't need to implement an escape character for numbers. You just handle them the exact same way you handle characters.

Only now it's dawning on me that you're intending to write your entire output stream as "characters", aren't you, in which case that won't work… I suggest looking into binary-mode file handles instead. You'll find things significantly easier. Forget about writing your data as a serialised table, you'll likely only make your "compressed" files larger than the originals.

Take your "counter" value, and write it directly to your file. Then take the byte to be repeated, and write that one. Rinse and repeat. Eeeezeeee.

Mikk809h, on 09 March 2015 - 02:42 PM said:
Yeah, I've read about a dictionary compression method, by saving some strings which is repeated through the whole text, and placing that string into a table, with X positions.
However it might be pretty stupid to make such a method in CC, since it might take forever to execute that single function?

You'd be surprised. ;)/> It's not like you'll encounter any significant amounts of data to compress anyway, as the computers are limited to just under a megabyte per drive.

SpencerBeige #13

SpencerBeige's profile picture

283 posts

Posted 10 March 2015 - 01:32 AM

can u read it back as original? or is it just making a hash?
edit: oops wait. i see how it can be easy to read ;p

Edited on 10 March 2015 - 12:33 AM