CCUnicode - Interact with Minecraft in Unicode - ComputerCraft Forums (archive)

Pyuu #1

224 posts

Posted 29 May 2015 - 07:08 AM

CCUnicode

For the first time, you can now talk in Japanese in CC peripherals.

Version 1.0

Hello, I spent some time pondering about why peripherals somehow can allow CC to "see" unicode strings, though we cannot make them normally. So I spent the last few hours experimenting with string.char, and eventually found a pattern.

0xeABBCC was the unicode format of CC strings, it's 3 bytes, not 2. I don't know much about Unicode or how it's formatted, but I was at least able to figure this much out. Group A, the main set of characters, is a group that contains 4,096 different characters; Group B, can go up to a maximum value of 64 (0xeA40) before looping. Group C goes up to 64 as well, adding up to a total of 0xFFFF characters; which is the standard unicode format (U+ABCD).

This apparently is UTF-8 Formatting.

So I created a little API that'll translate U+ABCD format to the way CC stores unicode, and also took the time to give a small language pack (JP Basic) for testing purposes.

Why should You care?
This system allows people to now be able to use unicode characters, my use will be in Chat Blocks to be able to color the text using the Section symbol. Others can use it for translation purposes; however, unicode will Not display in the CC computers, only in Peripherals or Items such as Floppy Discs (http://puu.sh/i4M2m/fac039315e.jpg).

This also works in FS Operations; however, viewed by some external editors may present symbols instead of the proper characters.

I also tested in Chat Blocks to see if it'd work properly, and sure enough: http://puu.sh/i4uyc/78c61e8a35.jpg

How I did this? (if you care)
I spent a few hours, literally string.char'ing random patterns together until I found something that made a bit of sense, and used the Chat Block to be able to send in data through the chat_message event and stared at the Hexadecimal of the strings for a while.

Source Code: http://pastebin.com/sseR1jk4

Usage:


Some Globals you have to be aware of until next update:
split ( str String ) - string.split in other languages.
num2hex ( int32 number ) - Converts number to hexadecimal, not sure if necessary.

API Usage:
unicode.char ( int32 number ) - Converts Unicode representation (0xABCD) to a UTF-8 representation which is compatible with CC.
unicode.format ( int32 number ) - Converts a number to a properly formatted Hexadecimal string representation.
unicode.unformat ( str hexadecimal ) - Reverses .format
unicode.read ( str char ) - Same as unicode.char, will be deprecated in next update.
unicode.readString ( str String , str LanguagePack ) - Returns a UTF-8 representation of the string based on the Language Pack.
Example: unicode.readString("a|i|u|e|o","jp_basic")

Data Sets:
unicode.set.eng
unicode.set.jp_basic

Thank you for reading this post! I hope it was of help.

Edited on 10 June 2015 - 07:11 PM

Creator #2

2679 posts

Location You will never find me, muhahahahahaha

Posted 29 May 2015 - 03:40 PM

This is actually amazing.

Pyuu #3

224 posts

Posted 29 May 2015 - 07:22 PM

Thank you, if you have any suggestions / request, feel free to ask!

ElvishJerricco #4

808 posts

Posted 29 May 2015 - 07:33 PM

So wait can this be used to fix that LuaJ string bug, where it automatically encodes binary strings as unicode, screwing with binary HTTP requests?

Pyuu #5

224 posts

Posted 29 May 2015 - 07:40 PM

ElvishJerricco, on 29 May 2015 - 07:33 PM said:
So wait can this be used to fix that LuaJ string bug, where it automatically encodes binary strings as unicode, screwing with binary HTTP requests?

Give me an example of the bug please! :)/>

SquidDev #6

SquidDev's profile picture

1426 posts

Location Does anyone put something serious here?

Posted 29 May 2015 - 08:08 PM

Mayushii, on 29 May 2015 - 07:40 PM said:
ElvishJerricco, on 29 May 2015 - 07:33 PM said:
So wait can this be used to fix that LuaJ string bug, where it automatically encodes binary strings as unicode, screwing with binary HTTP requests?

Give me an example of the bug please! :)/>

The thread for it is here. I've tried fixing it from the Java side but it doesn't seem possible. The issue stems from strings != binary data, but in CC they are represented the same.

Pyuu #7

224 posts

Posted 29 May 2015 - 08:32 PM

SquidDev, on 29 May 2015 - 08:08 PM said:
Mayushii, on 29 May 2015 - 07:40 PM said:
ElvishJerricco, on 29 May 2015 - 07:33 PM said:
So wait can this be used to fix that LuaJ string bug, where it automatically encodes binary strings as unicode, screwing with binary HTTP requests?

Give me an example of the bug please! :)/>

The thread for it is here. I've tried fixing it from the Java side but it doesn't seem possible. The issue stems from strings != binary data, but in CC they are represented the same.

Alright, through some testing I found some inconsistencies.
Saving to a file is accurate, which means storing it in RAM is fine.
However, loading from a file caused some "distortion" in the string.


e3 82 80 e3 81 97 e3 82 80 e3 81 97 - After Read
e3 02 00 e3 01 17 e3 02 00 e3 01 17 - Original RAM

It appears to add 128 to the bytes that aren't in the original unicode marker… the reason for this might be the wrap around effect on the unicode rendering done by Minecraft.
Each unicode section is split into different font files, http://puu.sh/i55ts/04c3ba977a.png Such as this one for Hiragana, so perhaps when an index goes over the amount in that section it just wraps around.

Now, don't get me wrong, the data is different, though, when rendered by peripherals shows that both strings are Identical in rendering. I tested this via the ChatBlock Peripheral.

Though, when saving a second time around, using the modified string e3 82 80 e3 81 97 e3 82 80 e3 81 97, then reloading into a different value, the values are now identical and saving/loading doesn't cause any further distortion. So, I guess the answer to it is: Yes, this fixes the LuaJ bug.
Also, even though the unicode is saved into the file under the LuaJ format I found (eABBCC), when opened by external editors, it really is Unicode.

Bomb Bloke #8

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 30 May 2015 - 02:00 AM

Mayushii, on 29 May 2015 - 08:32 PM said:
Though, when saving a second time around, using the modified string e3 82 80 e3 81 97 e3 82 80 e3 81 97, then reloading into a different value, the values are now identical and saving/loading doesn't cause any further distortion. So, I guess the answer to it is: Yes, this fixes the LuaJ bug.
Also, even though the unicode is saved into the file under the LuaJ format I found (eABBCC), when opened by external editors, it really is Unicode.

The problem is that people sending specific strings want them to be unaltered. They don't care about the unicode representation, they care about what values they get back from string.byte(); so any changes are undesirable.

Pyuu #9

224 posts

Posted 30 May 2015 - 02:09 AM

Bomb Bloke, on 30 May 2015 - 02:00 AM said:
Mayushii, on 29 May 2015 - 08:32 PM said:
Though, when saving a second time around, using the modified string e3 82 80 e3 81 97 e3 82 80 e3 81 97, then reloading into a different value, the values are now identical and saving/loading doesn't cause any further distortion. So, I guess the answer to it is: Yes, this fixes the LuaJ bug.
Also, even though the unicode is saved into the file under the LuaJ format I found (eABBCC), when opened by external editors, it really is Unicode.

The problem is that people sending specific strings want them to be unaltered. They don't care about the unicode representation, they care about what values they get back from string.byte(); so any changes are undesirable.

I can look into fixing that if it's a big deal, probably just some value shifting and some changes to how the files get read to prevent the looping thing.
I'm not sure why someone would want to use string.byte() on a unicode string, wouldn't make much of a difference unless what you want is a 0-255 Byte based save system, in which probably can't be done efficiently.

The strings made by this program all follow a certain pattern, and that pattern isn't plain ASCII binary, so U+FF30 won't be FF 30 in string.byte in the first place.
I'll probably write a function later, like unicode.byte, and stuff like that.

Bomb Bloke #10

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 30 May 2015 - 02:54 AM

Mayushii, on 30 May 2015 - 02:09 AM said:
I'm not sure why someone would want to use string.byte() on a unicode string, wouldn't make much of a difference unless what you want is a 0-255 Byte based save system, in which probably can't be done efficiently.

With the http API (for eg), the only way to send data is as a string. Users may not care about unicode at all - they may just want to eg read a BMP file into a single string and post that - and the translation is obviously undesirable in such cases.

Of course, such scenarios are the reason base64 exists, and so scripts have already been written to work around the issue.

But say it could be dealt with by writing a wrapper for eg http.post() - that could potentially be implemented into the mod itself, making things a bit easier for users.

Pyuu #11

224 posts

Posted 30 May 2015 - 03:00 AM

Bomb Bloke, on 30 May 2015 - 02:54 AM said:
Mayushii, on 30 May 2015 - 02:09 AM said:
I'm not sure why someone would want to use string.byte() on a unicode string, wouldn't make much of a difference unless what you want is a 0-255 Byte based save system, in which probably can't be done efficiently.

With the http API (for eg), the only way to send data is as a string. Users may not care about unicode at all - they may just want to eg read a BMP file into a single string and post that - and the translation is obviously undesirable in such cases.

Of course, such scenarios are the reason base64 exists, and so scripts have already been written to work around the issue.

But say it could be dealt with by writing a wrapper for eg http.post() - that could potentially be implemented into the mod itself, making things a bit easier for users.

Ahh, I get ya.
I'd have to experiment a bit with strings and see if the same interference happens if you just http.post a string with ASCII values;
As far as I know, control characters like 0x79 are were the game starts acting weird with storing information.

Pyuu #12

224 posts

Posted 10 June 2015 - 09:02 PM

Some updates / news regarding this:
I'll be making an update to this to support Pure Lua (i.e. the console):

Spoiler

And I'll further be updating the language packs and possibly adding an Input Mode, so you can switch keyboards on CC / automatically convert text based strings into UTF-8 formatting (i.e. eABBCC).

First language packs that will be added:

Japanese Hiragana and Katakana pack.
Korean pack.
Special Characters pack (TM/R/etc.)
Spanish Letters / Accents pack.

After that, requests or something else.
I'll be working on adding some Kanji into the JP pack as well, and adding some downloadable stuff in there as well (such as an update function which downloads the latest version and a function to download language packs).

I'm playing with the Lua Console to figure out the difference between LuaJ and Pure Lua so I can find the main reasons behind the inability to use certain strings in CC.

That's all for now.

Bomb Bloke #13

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 11 June 2015 - 02:37 AM

Mayushii, on 10 June 2015 - 09:02 PM said:
I'm playing with the Lua Console to figure out the difference between LuaJ and Pure Lua so I can find the main reasons behind the inability to use certain strings in CC.

Something that may or may not be worth knowing is how ComputerCraft deals with font rendering.

The base "ascii.png" file containing the font ComputerCraft uses is populated with square glyphs, each of which is 8x8 in the default set. ComputerCraft crops these to 6x8 and renders whatever's left. If a character can't be cropped in this manner without losing pixel data (eg symbol 0xDB, which is an entirely filled-in square), ComputerCraft will render a ? symbol instead.

Of course, none of this should affect non-rendering operations, but it does mean you won't be getting Asian characters on-screen, not without drawing them as eg paintutils images.

Creator #14

2679 posts

Location You will never find me, muhahahahahaha

Posted 11 June 2015 - 05:36 AM

How difficult would it be to integrate unicode chars in CC? And printing them?

Pyuu #15

224 posts

Posted 11 June 2015 - 06:04 AM

Creator, on 11 June 2015 - 05:36 AM said:
How difficult would it be to integrate unicode chars in CC? And printing them?

Honestly, depending on how CC is implemented in the first place, it could either be easy or hard.
Minecraft font rendering by default supports UTF-8, so any peripheral who uses Minecraft's default renderer would be able to display Unicode.
Not sure if this is OC compatible, maybe it uses the default rendering.

We definitely won't be able to use CC to display UTF-8 characters in the computers themselves without modding it (which I'm not sure anyone would go about.)

Bomb Bloke, on 11 June 2015 - 02:37 AM said:
Mayushii, on 10 June 2015 - 09:02 PM said:
I'm playing with the Lua Console to figure out the difference between LuaJ and Pure Lua so I can find the main reasons behind the inability to use certain strings in CC.

Something that may or may not be worth knowing is how ComputerCraft deals with font rendering.

The base "ascii.png" file containing the font ComputerCraft uses is populated with square glyphs, each of which is 8x8 in the default set. ComputerCraft crops these to 6x8 and renders whatever's left. If a character can't be cropped in this manner without losing pixel data (eg symbol 0xDB, which is an entirely filled-in square), ComputerCraft will render a ? symbol instead.

Of course, none of this should affect non-rendering operations, but it does mean you won't be getting Asian characters on-screen, not without drawing them as eg paintutils images.

Thanks for the information, I wasn't sure what files that CC were using.
The Pure Lua console can render the text just fine though, so compatibility there is nice just for educational purposes mainly.

Creator #16

2679 posts

Location You will never find me, muhahahahahaha

Posted 11 June 2015 - 10:28 AM

OC supports Unicode. I saw it on one of their videos.

MindenCucc #17

MindenCucc's profile picture

214 posts

Location /home/marcus/

Posted 11 June 2015 - 10:41 AM

CC's display supports ISO-8859-1. I used edit, then I entered an 'á', and it was displayed, and the cursor was offset by 1 char. After entering a regular character, the 'á' turned into "??". ~~I think it was CC1.5 or lower~~. But that 'á' is in the ISO-8859-1 charset too.

Edit: Proof

Spoiler

Note that there aren't any spaces. This is because those characters are 2bytes big.

You can enter ISO-8859-1 characters by typing that caracter, then pressing END on the keyboard.

Edit 4: since Minecraft uses ISO-8859-1 encoding, CC supports that too. Partially, but it's supported. In my language there's "ű", and that's an unicode character. I was pressing the button like crazy, but nothing happened.

Edit 5: "ű" is key43, and no "char" event is triggered.

Last edit I promise:

Spoiler

Not all chars are printable, but some are:

Also, a new bug :P/>

Edited on 11 June 2015 - 10:24 AM

Creator #18

2679 posts

Location You will never find me, muhahahahahaha

Posted 11 June 2015 - 12:22 PM

Mayushii, on 10 June 2015 - 09:02 PM said:
Some updates / news regarding this:
I'll be making an update to this to support Pure Lua (i.e. the console):
Spoiler

And I'll further be updating the language packs and possibly adding an Input Mode, so you can switch keyboards on CC / automatically convert text based strings into UTF-8 formatting (i.e. eABBCC).

First language packs that will be added:
Japanese Hiragana and Katakana pack.
Korean pack.
Special Characters pack (TM/R/etc.)
Spanish Letters / Accents pack.
After that, requests or something else.
I'll be working on adding some Kanji into the JP pack as well, and adding some downloadable stuff in there as well (such as an update function which downloads the latest version and a function to download language packs).

I'm playing with the Lua Console to figure out the difference between LuaJ and Pure Lua so I can find the main reasons behind the inability to use certain strings in CC.

That's all for now.

Where did you get the console from? Please explian step by step. Thanks.

MindenCucc #19

MindenCucc's profile picture

214 posts

Location /home/marcus/

Posted 11 June 2015 - 12:28 PM

This?

Spoiler

This is just a standard windows cmd window. (mine is a stupid ANSI one :(/>)
You can download the Lua shell frm http://lua-users.org/wiki/LuaBinaries
The closest one to CC is located here: http://sourceforge.net/projects/luabinaries/files/5.1.4/Executables/

Edited on 11 June 2015 - 10:30 AM

Creator #20

2679 posts

Location You will never find me, muhahahahahaha

Posted 11 June 2015 - 12:35 PM

How excactly? Which link excacly? What did you di to compile it? Lua 5.1, if possible.

MindenCucc #21

MindenCucc's profile picture

214 posts

Location /home/marcus/

Posted 11 June 2015 - 01:24 PM

The last link is Lua 5.1.x binary. Just select your operating system's. For example, if you have windows x64, then you should download the one ending with "win64_bin.zip", not the cygwin one.

Edited on 11 June 2015 - 11:26 AM

Creator #22

2679 posts

Location You will never find me, muhahahahahaha

Posted 11 June 2015 - 02:15 PM

Thank you.