Downloads:
Compressor (pastebin ID q5UhqCJN)
Decompressor (pastebin ID jH6YuJSi)
How to use:
compress <uncompressed file> <compressed file>
decompress <compressed file> <uncompressed file>
Designed to work on Lua files that are either not indented or indented with tabs (not spaces) - it'll still work on other files but it probably won't compress them very well.
This compressor compresses itself at 67%, which could easily be improved to 50% if LuaJ didn't mangle non-ASCII characters.
For comparison, 7-Zip is able to compress the compressor program at a ratio of 35% (lower is better), and urielsalis's compressor achieves a whoppingly useless 100.097%.
Obviously this isn't too useful as 7-Zip does a much better job, but it was fun making it.
How it works: The compression algorithm has three stages.
First, indentation is converted into &+ and &- symbols. This only works for tabs.
Using some code from the compressor itself as an example - original code:
Spoiler
local inFN, outFN
local args = {...}
if #args ~= 2 then
error("Usage: "..shell.getRunningProgram().." <input file> <output file>", 0)
end
inFN = args[1]
outFN = args[2]
local function countTabs(l)
for k = 1, #l do
if l:sub(k, k) ~= "\t" then
return k - 1
end
end
return #l
end
...
for k = 1, #inText do
local ch = inText:sub(k, k)
local isIdent = idents:find(ch, 1, true) ~= nil
if isIdent ~= lastIdent then
if #parsed > 0 then
onFinishSegment(lastIdent, parsed[#parsed])
end
parsed[#parsed+1] = ""
end
lastIdent = isIdent
parsed[#parsed] = parsed[#parsed]..ch
end
Spoiler
local inFN, outFN
local args = {...}
if #args ~= 2 then
error("Usage: "..shell.getRunningProgram().." <input file> <output file>", 0)
end
inFN = args[1]
outFN = args[2]
local function countTabs(l)
&+for k = 1, #l do
&+if l:sub(k, k) ~= "\t" then
return k - 1
end
&-end
return #l
&-end
...
for k = 1, #inText do
&+local ch = inText:sub(k, k)
local isIdent = idents:find(ch, 1, true) ~= nil
if isIdent ~= lastIdent then
&+if #parsed > 0 then
onFinishSegment(lastIdent, parsed[#parsed])
end
parsed[#parsed+1] = ""
&-end
lastIdent = isIdent
parsed[#parsed] = parsed[#parsed]..ch
&-end
Stage 2: The compressor finds all "identifier groups" - sequences of alphanumeric characters and underscores. For example "if", "then", "onFinishSegment", "lastIdent", and "1". There's an option in the compressor called compressNonIdentGroups - if this was true then it would also find "punctuation groups" - sequences of other characters. The decompressor doesn't support this, so don't change it unless you're curious about its effect on the compressed file.
Groups are sorted in order of the number of bytes that the compressor estimates it would save by compressing that group. To compress a group, the compressor assigns it a unique sequence of alphanumeric characters, then replaces all the group's occurrences in by that sequence. If the group is not compressed (for example, it's shorter than the next available code) then an underscore is prepended to each occurrence of that group.
All compressed groups are written on the first line of the compressed file, so that the decompressor can un-replace them.
After this stage the file looks like this: (linebreaks added to the first line because it's extremely long)
Spoiler
139^local^treeFragments^end^isIdent^function^parsed^inText^lastIndent^then^groupsSorted^symbolCounts^
thisIndent^symbol^id_literal_escape^pc_literal_escape^compressedLength^estSavings^compressNonIdentGroups^
table^print^return^out^for^validchars_pc^isIdentString^count^groupCounts^bitsPerByte^tree^symbolCodes^
lines^onFinishSegment^numSymbols^validchars^symbolBitWidth^nextCompressed^countTabs^lastIdent^writeTree^
do^and^true^newCount^rv^line^lineNo^idents^total^if^prevIndent^testBit^nextIndent^avgCompressedLength^
iterate^false^saved^the^chars^groups^shallowCopyTable^ipairs^insert^byte^path^sub^firstCharacter^symbols^
pairs^segment^validchars_id^close^else^error^root^find^write^convert^while^sym^huffmanEncode^alphanumerics^
or^encode^compressing^in^number^str^gsub^next^args^size^open^group^digit^outFN^bit^compress^fragment^internal^
inFN^file^len^huffman^them^sort^ch^new^encoded^remove^padbit^occurs^elseif^string^to^of^we^about^codes^nil^
input^place^not^each^fs^math^need^type^it^char^bits^but^ok^by^32^so^
abcdefghijklmnopqrstvuwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_^subtree_1^subtree_0^_^b Mb, Hb
b Cb = {...}
X #Cb ~= _2 j
lb("_Usage: ".._shell._getRunningProgram().." <6b Nb> <_output Nb>", _0)
d
Mb = Cb[_1]
Hb = Cb[_2]
b f L(_l)
&+x _k = _1, #_l O
&+X _l:db(_k, _k) ~= "\_t" j
u _k - _1
d
&-d
u #_l
&-d
...
x _k = _1, #h O
&+b Sb = h:db(_k, _k)
b e = V:nb(Sb, _1, Q) ~= 5b
X e ~= M j
&+X #g > _0 j
G(M, g[#g])
d
g[#g+_1] = ""
&-d
M = e
g[#g] = g[#g]..Sb
&-d
X #g > _0 j
G(e, g[#g])
d
Stage 3 is to Huffman-compress the output from stage 2, and then encode it in a base64-like encoding; this produces total gibberish:
Spoiler
M3=8./J1G3W<EL!2D-9)JBD8(/1H2EX=J](=]"PJ*UDT'C;1=QNNUO@WY0[;,:"W
SKDK91F:3LJ *FEQ1NVW06F&-C15/N4_Z$S3A6SZH("7TX9;K<""VRQ,LD!3@X4+^
2>>CP80J4KBBI0U^DRX,B4:,_'8XXYG7X<6?*!>_5GN)$#R3H;>)4P[$BQK4'=R*S
1\67]PT-O$1<B$CC?!9"BN3(&GU.AX=ZSKW"2.OGENYP:Z&YX&E7'Q@<6[2H@H[$B
QAUT-RQYEW*>)+U?AK)XY5K$OUIB7R2]7X:">>U*Q/S[VJ/!AI;[D0=8MY)(C\%0)
J#D?FWM5?""K>.2P[$B!V^\ZM@<V5@GXMZ+1,>?B5O'QSKSX5R[Y 7XEWGMOZA:K3
>)4P[$B1FU7BZP/SKC8\.]YU;94>=$#RC2&DH4?MJAEXL^4B#KI;DC;4^="RQ+1)(
/'!>)J$VF<J3W$BQ1U[2NFN1..D#<A7>?V^J'RAUT-RQK5)S6EWLV<4B$SH\NMZ30
8ZF?F7'Q(^*@<\+(3P[$BQ-T7R2$WG2\[<^%ZFSZKGXHD!CRKC8\> .>)*QQ+1IJG
S+O3(&DSR97X\R*\6]) !1W)$CS206M*9<4B$30T="QX7 =,*O;K^$$FNM3"UAUF7
R2]7L$OR%.,QA7 I;.K=B?V^J'R[TG7O%D#;?ACG;PH\6U0\5@GXEMN*[A\2$,USX
1-L$OU%K-'6/X5*[A,C:?U?=R2]<8-=C<<(OU59/D#3=ZF0,.P%>Y]9[K>H6>S/QS
A)>"'FXI^^\>>?>.C7I$=S-T7R2\S<$0JGQ-TN$/QAUTEWN>E*=O0"'=F?GX7XM:?
AJW-U7>"YY]\$Z,_.Q/S[VJ/!AEH.\2$'5O;DZY06/O#L>P1)2,^5?'('66U$/O5"
Q[TG7O%_NL3)CG7G!>UY1)C?FWM5?""Y5ZOZT-A8\2$+UCNPYE(PK.O2)>BCG0\VU
K4I[%BS:GZRAY],^9;.*9@G0GYWI;>?J;.0BWZZ.*9\\SSZ]0
...