How to create unique binary combinations that do not overlap for data compression

Creator #1

2679 posts

Location You will never find me, muhahahahahaha

Posted 31 March 2015 - 07:40 PM

Hi guys,

I am trying to make a compression algorithm for CC. It uses combinations like these:

These combinations are used to replce characters. The shortest combinations replace the most commmon caharacters, and the longest, the less frequent. the are more combinations, but I do not understand how they work neither what they are called.
The idea is that when they are conctenated in a file, only one posibiliti of unconcatenating exists.

For example,

010110111

Should only be unconcatenated like this

The question is, how do I generate the automatically, or whare can I find them.
It is close to the Huffman binary tree.

Thanks in advance,

Creator

Edited on 31 March 2015 - 06:13 PM

Bomb Bloke #2

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 31 March 2015 - 09:15 PM

You need to store all these values using a number of bytes equal to that required to store the largest value you might write. For example, if you might write 7 (which is three bytes - 111), then that means that 0 must be written with three bytes (000), 2 must be written with three bytes (010), etc. As soon as 8 becomes available for writing (1000), you then need to start reading and writing four bits at a time, and so on.

Edited on 31 March 2015 - 07:15 PM

Creator #3

2679 posts

Location You will never find me, muhahahahahaha

Posted 31 March 2015 - 09:23 PM

Bomb Bloke, on 31 March 2015 - 09:15 PM said:
You need to store all these values using a number of bytes equal to that required to store the largest value you might write. For example, if you might write 7 (which is three bytes - 111), then that means that 0 must be written with three bytes (000), 2 must be written with three bytes (010), etc. As soon as 8 becomes available for writing (1000), you then need to start reading and writing four bits at a time, and so on.

This is not what I am asking, What I am asking is that your combinations are ambigouos, meaning then can be iterpreted in many ways. For example, the final result 000010000 could be parsed as 000 010 000 or as 0000 1000 00. Waht I need is something like the example I gave. You can arrange them in any way you want, there is only one way to parse it. It would be something like:

but in a way that they cannot be parsed in an ambigouos way. They will also have different lengths.

valithor #4

valithor's profile picture

1023 posts

Posted 31 March 2015 - 09:54 PM

Creator, on 31 March 2015 - 09:23 PM said:
Bomb Bloke, on 31 March 2015 - 09:15 PM said:
You need to store all these values using a number of bytes equal to that required to store the largest value you might write. For example, if you might write 7 (which is three bytes - 111), then that means that 0 must be written with three bytes (000), 2 must be written with three bytes (010), etc. As soon as 8 becomes available for writing (1000), you then need to start reading and writing four bits at a time, and so on.

This is not what I am asking, What I am asking is that your combinations are ambigouos, meaning then can be iterpreted in many ways. For example, the final result 000010000 could be parsed as 000 010 000 or as 0000 1000 00. Waht I need is something like the example I gave. You can arrange them in any way you want, there is only one way to parse it. It would be something like:
01
001
011
010
0001
but in a way that they cannot be parsed in an ambigouos way. They will also have different lengths.

You might have misunderstood what he was saying, but there really is only one way to interpret 000010000. This is because "As soon as 8 becomes available for writing (1000), you then need to start reading and writing four bits at a time, and so on." They should all be the same length so there will only one way for it to be interpreted, in this given example that length is 3, so that is the only length that should be assumed to be used. The length would be set before hand, so there would be no reason for misunderstanding between reading/writing.

As for having them all different lengths there really is no way of doing this without another line of numbers/string giving instructions for how they are to be aranged. This may be counter productive and increase the size of the file instead of decrease it.

Assuming we have a list of lets say that consists of: 1001 010 01010 100 10001
we could put them together into a singe string, but then they loose their ability to be distinquished from each other. 10010100101010010001 There is no obvious way to break apart this list of numbers, and the only reason you can tell what they were before is because i changed the color of them and you have the list already. Without something seperating your characters, or a set length for all of them (which is what bomb bloke was saying) there is just no way to be able to divide up a string of numbers (apart from trial and error).

Edited on 31 March 2015 - 07:55 PM

Creator #5

2679 posts

Location You will never find me, muhahahahahaha

Posted 31 March 2015 - 10:03 PM

What I meant is that they can havedifferent lengths, and thus there is a special way to create them so there is no ambiguity.

valithor #6

valithor's profile picture

1023 posts

Posted 31 March 2015 - 10:12 PM

Creator, on 31 March 2015 - 10:03 PM said:
What I meant is that they can havedifferent lengths, and thus there is a special way to create them so there is no ambiguity.

There really is no way to do this, which is more or less what i implied.

"As for having them all different lengths there really is no way of doing this without another line of numbers/string giving instructions for how they are to be aranged. This may be counter productive and increase the size of the file instead of decrease it."

GopherAtl #7

GopherAtl's profile picture

871 posts

Posted 31 March 2015 - 10:15 PM

to have unique, variable-length encodings, you want to organize the items you're representing in binary into a binary tree. Every node of a binary tree has two nodes, and each of those nodes can either be a leaf (in this case, one of the values to be encoded) or another node. The path through the tree - left being '0', right being '1' - defines the output to encode for each.

Set-length encodings like people are talking about can be thought of as binary trees where every leaf is the same number of nodes down, so, for example with bytes, the path from the root to any element is always 8 steps. This gives fixed-width outputs which are easy to read and manipulate. But variable-width encoding is also possible, and sometimes desirable - what if you don't have 256 different values to encode? what if you only had, say, 150? too many to encode in 7 bits, which would cap out at 128 different items, but still leaving a lot of unused or invalid values in an 8 bit encoding.

Here's an example binary tree, which just encodes the letters 'A' through 'D'.


  {
	{ 'A', 'B' },
	{ 'C', 'D' },
  },

This ends up being a fixed-width tree; we'll say element 1 in each node array is left (0), and element 2 is right (1).
from the root, left gives you the node {'A','B'}. Left again, just 'A'. So, left left, or "00", is 'A'. right, right, or "11", would be 'D'.
But supposed you added "E" and "F". We could expand it to a 3-bit fixed width encoding, keeping these as 000b - 011b and adding 100 = E and 101 = F. Then 110 and 111 would be meaningless, wasted patterns in our output encoding.
Instead, with binary trees, we can do this:


{
  {
	{ 'A', 'B' },
	{ 'C', 'D' },
  },
  { 'E', 'F'},
}

the entire previous tree just became the left branch of a new node, with the right node containing the two new values. So now, left takes us to the old tree, and 000 (left, left, left) would take us from root to 'A', 011 (left, right, right) would take us to 'D.' But starting at root, right takes us to {'E','F'} and from there a left and right give us 'E' and 'F', so E is "10" and F is "11." Not fixed length anymore, and 2 of our 6 symbols take up only 2 bits instead of 3, and we have no patterns that end up meaning nothing - the only invalid cases would come at the end of the entire file. If you had an extra "0" at the very end, that doesn't decode to a character; a single "left" instruction just gives you the node containing A,B,C, and D.

Huffman encoding is a compression algorithm for building binary trees based on repetition counts - the most frequent things get the shortest encodings, the shortest branches in the tree, while the least frequent get the longest branches/encodings. But any method will give a valid encoding, so long as it builds a valid binary tree.

The big limitation of variable-length encoding is that you must read data encoded this way from the beginning, always. You can't jump in in the middle and start reading, because you will have no way of knowing which bits end one value and begin the next one. 'E' is encoded as "10", but not every "10" in output is an 'E'; 'C' was encoded as "110", which obviously has "10" right there in it! Fixed width encodings remove this limitation, which is a big part of why they're favored by programmers (even when, as in the case of utf8 (which is variable-width) vs ascii (fixed-width), there's a lot of very good reasons to use the variable width one!)

Edited on 01 April 2015 - 03:50 AM

Bomb Bloke #8

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 01 April 2015 - 04:46 AM

Interesting reading - I've certainly learned something - though I think you got your lefts and rights (… and F) a bit mixed up there. ;)/>

GopherAtl #9

GopherAtl's profile picture

871 posts

Posted 01 April 2015 - 04:55 AM

Bomb Bloke, on 01 April 2015 - 04:46 AM said:
Interesting reading - I've certainly learned something - though I think you got your lefts and rights (… and F) a bit mixed up there. ;)/>

er, so I did. Tried to clean those up, hopefully didn't miss any.