This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
immibis's profile picture

Compressor

Started by immibis, 26 February 2013 - 06:13 PM
immibis #1
Posted 26 February 2013 - 07:13 PM
Inspired by urielsalis's compressor because it's a fun programming problem, especially if you don't know much about compression to start with.

Downloads:
Compressor (pastebin ID q5UhqCJN)
Decompressor (pastebin ID jH6YuJSi)

How to use:

compress <uncompressed file> <compressed file>
decompress <compressed file> <uncompressed file>

Designed to work on Lua files that are either not indented or indented with tabs (not spaces) - it'll still work on other files but it probably won't compress them very well.

This compressor compresses itself at 67%, which could easily be improved to 50% if LuaJ didn't mangle non-ASCII characters.
For comparison, 7-Zip is able to compress the compressor program at a ratio of 35% (lower is better), and urielsalis's compressor achieves a whoppingly useless 100.097%.
Obviously this isn't too useful as 7-Zip does a much better job, but it was fun making it.

How it works: The compression algorithm has three stages.
First, indentation is converted into &amp;+ and &amp;- symbols. This only works for tabs.
Using some code from the compressor itself as an example - original code:
Spoiler

local inFN, outFN

local args = {...}
if #args ~= 2 then
	error("Usage: "..shell.getRunningProgram().." <input file> <output file>", 0)
end
inFN = args[1]
outFN = args[2]

local function countTabs(l)
	for k = 1, #l do
		if l:sub(k, k) ~= "\t" then
			return k - 1
		end
	end
	return #l
end

...

for k = 1, #inText do
	local ch = inText:sub(k, k)
	local isIdent = idents:find(ch, 1, true) ~= nil
	if isIdent ~= lastIdent then
		if #parsed > 0 then
			onFinishSegment(lastIdent, parsed[#parsed])
		end
		parsed[#parsed+1] = ""
	end
	lastIdent = isIdent
	parsed[#parsed] = parsed[#parsed]..ch
end
After processing indentation:
Spoiler

local inFN, outFN

local args = {...}
if #args ~= 2 then
	error("Usage: "..shell.getRunningProgram().." <input file> <output file>", 0)
end
inFN = args[1]
outFN = args[2]

local function countTabs(l)
&amp;+for k = 1, #l do
&amp;+if l:sub(k, k) ~= "\t" then
	return k - 1
end
&amp;-end
return #l
&amp;-end

...

for k = 1, #inText do
&amp;+local ch = inText:sub(k, k)
local isIdent = idents:find(ch, 1, true) ~= nil
if isIdent ~= lastIdent then
&amp;+if #parsed > 0 then
	onFinishSegment(lastIdent, parsed[#parsed])
end
parsed[#parsed+1] = ""
&amp;-end
lastIdent = isIdent
parsed[#parsed] = parsed[#parsed]..ch
&amp;-end
&amp;+ is added wherever the indentation level increases, and &amp;- where it decreases. Single indented lines are left alone.

Stage 2: The compressor finds all "identifier groups" - sequences of alphanumeric characters and underscores. For example "if", "then", "onFinishSegment", "lastIdent", and "1". There's an option in the compressor called compressNonIdentGroups - if this was true then it would also find "punctuation groups" - sequences of other characters. The decompressor doesn't support this, so don't change it unless you're curious about its effect on the compressed file.

Groups are sorted in order of the number of bytes that the compressor estimates it would save by compressing that group. To compress a group, the compressor assigns it a unique sequence of alphanumeric characters, then replaces all the group's occurrences in by that sequence. If the group is not compressed (for example, it's shorter than the next available code) then an underscore is prepended to each occurrence of that group.

All compressed groups are written on the first line of the compressed file, so that the decompressor can un-replace them.

After this stage the file looks like this: (linebreaks added to the first line because it's extremely long)
Spoiler

139^local^treeFragments^end^isIdent^function^parsed^inText^lastIndent^then^groupsSorted^symbolCounts^
thisIndent^symbol^id_literal_escape^pc_literal_escape^compressedLength^estSavings^compressNonIdentGroups^
table^print^return^out^for^validchars_pc^isIdentString^count^groupCounts^bitsPerByte^tree^symbolCodes^
lines^onFinishSegment^numSymbols^validchars^symbolBitWidth^nextCompressed^countTabs^lastIdent^writeTree^
do^and^true^newCount^rv^line^lineNo^idents^total^if^prevIndent^testBit^nextIndent^avgCompressedLength^
iterate^false^saved^the^chars^groups^shallowCopyTable^ipairs^insert^byte^path^sub^firstCharacter^symbols^
pairs^segment^validchars_id^close^else^error^root^find^write^convert^while^sym^huffmanEncode^alphanumerics^
or^encode^compressing^in^number^str^gsub^next^args^size^open^group^digit^outFN^bit^compress^fragment^internal^
inFN^file^len^huffman^them^sort^ch^new^encoded^remove^padbit^occurs^elseif^string^to^of^we^about^codes^nil^
input^place^not^each^fs^math^need^type^it^char^bits^but^ok^by^32^so^
abcdefghijklmnopqrstvuwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_^subtree_1^subtree_0^_^b Mb, Hb

b Cb = {...}
X #Cb ~= _2 j
	lb("_Usage: ".._shell._getRunningProgram().." <6b Nb> <_output Nb>", _0)
d
Mb = Cb[_1]
Hb = Cb[_2]

b f L(_l)
&amp;+x _k = _1, #_l O
&amp;+X _l:db(_k, _k) ~= "\_t" j
	u _k - _1
d
&amp;-d
u #_l
&amp;-d

...

x _k = _1, #h O
&amp;+b Sb = h:db(_k, _k)
b e = V:nb(Sb, _1, Q) ~= 5b
X e ~= M j
&amp;+X #g > _0 j
	G(M, g[#g])
d
g[#g+_1] = ""
&amp;-d
M = e
g[#g] = g[#g]..Sb
&amp;-d
X #g > _0 j
	G(e, g[#g])
d

Stage 3 is to Huffman-compress the output from stage 2, and then encode it in a base64-like encoding; this produces total gibberish:
Spoiler

 M3=8./J1G3W<EL!2D-9)JBD8(/1H2EX=J](=]"PJ*UDT'C;1=QNNUO@WY0[;,:"W
SKDK91F:3LJ *FEQ1NVW06F&amp;-C15/N4_Z$S3A6SZH("7TX9;K<""VRQ,LD!3@X4+^
2>>CP80J4KBBI0U^DRX,B4:,_'8XXYG7X<6?*!>_5GN)$#R3H;>)4P[$BQK4'=R*S
1\67]PT-O$1<B$CC?!9"BN3(&amp;GU.AX=ZSKW"2.OGENYP:Z&amp;YX&amp;E7'Q@<6[2H@H[$B
QAUT-RQYEW*>)+U?AK)XY5K$OUIB7R2]7X:">>U*Q/S[VJ/!AI;[D0=8MY)(C\%0)
J#D?FWM5?""K>.2P[$B!V^\ZM@<V5@GXMZ+1,>?B5O'QSKSX5R[Y 7XEWGMOZA:K3
>)4P[$B1FU7BZP/SKC8\.]YU;94>=$#RC2&amp;DH4?MJAEXL^4B#KI;DC;4^="RQ+1)(
/'!>)J$VF<J3W$BQ1U[2NFN1..D#<A7>?V^J'RAUT-RQK5)S6EWLV<4B$SH\NMZ30
8ZF?F7'Q(^*@<\+(3P[$BQ-T7R2$WG2\[<^%ZFSZKGXHD!CRKC8\> .>)*QQ+1IJG
S+O3(&amp;DSR97X\R*\6]) !1W)$CS206M*9<4B$30T="QX7 =,*O;K^$$FNM3"UAUF7
R2]7L$OR%.,QA7 I;.K=B?V^J'R[TG7O%D#;?ACG;PH\6U0\5@GXEMN*[A\2$,USX
1-L$OU%K-'6/X5*[A,C:?U?=R2]<8-=C<<(OU59/D#3=ZF0,.P%>Y]9[K>H6>S/QS
A)>"'FXI^^\>>?>.C7I$=S-T7R2\S<$0JGQ-TN$/QAUTEWN>E*=O0"'=F?GX7XM:?
AJW-U7>"YY]\$Z,_.Q/S[VJ/!AEH.\2$'5O;DZY06/O#L>P1)2,^5?'('66U$/O5"
Q[TG7O%_NL3)CG7G!>UY1)C?FWM5?""Y5ZOZT-A8\2$+UCNPYE(PK.O2)>BCG0\VU
K4I[%BS:GZRAY],^9;.*9@G0GYWI;>?J;.0BWZZ.*9\\SSZ]0
...
Shnupbups #2
Posted 26 February 2013 - 07:38 PM
Interesting! Very nice to have a working compressor!
TwelveEight #3
Posted 26 February 2013 - 07:56 PM
This looks really nice, I'll make sure to check it out very soon, i.e. now.
superaxander #4
Posted 26 February 2013 - 07:59 PM
Looks good!
theoriginalbit #5
Posted 26 February 2013 - 09:00 PM
Nice. :)/> yay compression! :D/>
FuuuAInfiniteLoop(F.A.I.L) #6
Posted 27 February 2013 - 05:25 AM
mine code also compress an entire folder…
Lyqyd #7
Posted 27 February 2013 - 09:02 AM
mine code also compress an entire folder…

I think you mean "inflates", since your "compression" method increases file size. It would be a trivial matter to write a script to compress several files with immibis' script. I am curious to see if he adds multi-file archives, though.
immibis #8
Posted 27 February 2013 - 02:00 PM
mine code also compress an entire folder…

I think you mean "inflates", since your "compression" method increases file size. It would be a trivial matter to write a script to compress several files with immibis' script. I am curious to see if he adds multi-file archives, though.
I don't plan on adding any more features, since there's not much point in compressing things inside CC, and outside CC any off-the-shelf compression program will do better.
H4X0RZ #9
Posted 18 April 2013 - 10:37 AM
Where did you got the huffman code for lua?