This is a read-only snapshot of the ComputerCraft forums, taken in April 2020.
ardera's profile picture

Help getting a lua bytecode parser working

Started by ardera, 31 May 2014 - 01:55 PM
ardera #1
Posted 31 May 2014 - 03:55 PM
Hello, I'm trying to build some bytecode parser thing, that reads all data of a bytecode chunk. Almost everything works, but it seems like I'm missing something. Between the Instruction List Length Integer and the first constant list index, something gets wrong. I searched for the cause of the problem like 1 hour now, but I can't find it. I'm gonna demonstrate it on a hexdump of a file, which was made using


string.dump(function() return "athin" end)


Heres a picture of what I mean:
[attachment=1706:hexdump aktuell.PNG]

But - 250 constants? That doesn't work. There is only one constant.

If there would be one more 00 byte between "Instructions" and "Number of Constants" it would work perfectly. The constant count would be 1, the type of the first constant a string, the next 4 bytes would be the length of the string including a null at the end, etc.

So, what am I missing?
SquidDev #2
Posted 03 June 2014 - 12:33 PM
I'm not sure I can help you problem specifically, but have you had a look at No frills Introduction to Lua VM and ChunkSpy?
AgentE382 #3
Posted 05 June 2014 - 08:20 PM
Hello, I'm trying to build some bytecode parser thing, that reads all data of a bytecode chunk. Almost everything works, but it seems like I'm missing something. Between the Instruction List Length Integer and the first constant list index, something gets wrong. I searched for the cause of the problem like 1 hour now, but I can't find it. I'm gonna demonstrate it on a hexdump of a file, which was made using


string.dump(function() return "athin" end)


Heres a picture of what I mean:
[attachment=1706:hexdump aktuell.PNG]

But - 250 constants? That doesn't work. There is only one constant.

If there would be one more 00 byte between "Instructions" and "Number of Constants" it would work perfectly. The constant count would be 1, the type of the first constant a string, the next 4 bytes would be the length of the string including a null at the end, etc.

So, what am I missing?

You are, in fact, missing an 0x00 byte. As the "Instructions", you have "00 00 00 01 01 00 00 1E 00 00 1E 00"

Looking at the Lua VM reference SquidDev linked, that translates to:
LOADK 0 0 -- Load constant at index 0 into register number 0.
RETURN 0 2 -- Return 1 value, starting at register number 0.
MOVE 120 0 -- Copy the value of register number 120 into register number 0.
That last one doesn't make any sense. Why would the bytecode generator insert such a ridiculous instruction that will never be executed?

If you add one 0x00 byte to the last instruction, it reads as, "00 00 00 01 01 00 00 1E 00 00 00 1E".
That translates to:
LOADK 0 0 -- Load constant at index 0 into register number 0.
RETURN 0 2 -- Return 1 value, starting at register number 0.
RETURN 0 0 -- Return all values from register number 0 to the top of the stack.
If you read the PDF, you will find that the bytecode generator always adds a return statement to the end of the bytecode, even if there's already an explicit return statement in the Lua source. Therefore, this disassembly makes sense.

Anyway, if you add an extra 0x00 byte there, it shifts the rest of the bytecode over so it makes sense, like you said. It's just that the missing 0x00 byte isn't between "Instructions" and "Number of Constants", it's part of an instruction.

Now, I have no idea how this could be useful to you, since the output is directly from CC, but that's the problem.

Note: After modifying ChunkSpy to accept big-endian chunks, it errored on the bytecode as you posted it, but worked fine with the bytecode if modified either the way you suggested it, or I suggested it.
Edited on 05 June 2014 - 06:27 PM
ardera #4
Posted 08 June 2014 - 09:39 AM
Yes, it seems the byte is missing at the 3rd instruction. But, if you add it, there's one byte too much in the end. ChunkSpy didn't say anything about that because it only follows the counters, but isn't it odd that one byte is missing and one byte is too much in the end? Is it possible that it moved or something?

This is definitely a bug, we/I just have to find out if it's a LuaJ or CC one, although I think CC doesn't have to do anything with bytecode dumping.

I tried to dump now using vanilla LuaJ and it's working there. Hexdump:
[attachment=1724:Hexdump LuaJ.PNG]

I saw that CC uses some other packages of LuaJ too. Maybe there are some additional compilers causing the error.

I'm not sure I can help you problem specifically, but have you had a look at No frills Introduction to Lua VM and ChunkSpy?
I started programming the parser using the given introduction and tried to find the problem I mentioned in the op using ChunkSpy, which I tried to modify to accept big - endian numbs (like AgentE382), but I didn't get it working.

EDIT: Searched for the bug in the LuaJ bug list, but it isn't in there. This really seems to be a CC thing.
Edited on 08 June 2014 - 07:48 AM
Lyqyd #5
Posted 08 June 2014 - 08:08 PM
Looks fine here:


27,76,117,97,81,0,0,4,4,4,8,0,
0,0,0,0,
0,0,0,1,
0,0,0,1,
0,0,0,2,
0,
0,
0,
3,
0,0,0,1,
1,0,0,30,
0,128,0,30,
0,0,0,1,
4,
0,0,0,6,
97,116,104,105,110,0,
0,0,0,0,
0,0,0,0,
0,0,0,0,
0,0,0,0,

How are you dumping the generated byte code to file?

Edit: Looks like you must be trying to write the whole dumped string to file at once using a non-binary file handle. That's not going to work, because one of the values is > 127. If you look carefully, you'll notice that that byte is missing in the output. I'm not sure why it ends up missing entirely, but you may have better luck using a binary file handle and writing each byte to the file one at a time using string.byte. For reference, I just did a quick for loop in the lua prompt and used string.byte to dump in to a normal file handle, which is why my dump is comma-separated decimal values rather than hex values. You can see that the dump is still accurate, though. This is not a CC bug.
AgentE382 #6
Posted 09 June 2014 - 02:15 AM
Lyqyd's the man!
<snip> because one of the values is > 127. <snip>
I keep forgetting about this (extremely annoying) issue.
ardera #7
Posted 09 June 2014 - 11:14 AM
Thanks Lyqyd, that fixed everything.
Off-topic: What's causing this 127 byte issue anyway? Do handles only accept 7bit per char or something?
Edited on 09 June 2014 - 09:17 AM
Lyqyd #8
Posted 09 June 2014 - 07:39 PM
Something in the translation of Lua strings to Java strings doesn't like non-ASCII values. You can see the bug appear when trying to send strings through rednet as well.
ardera #9
Posted 10 June 2014 - 09:12 PM
Something in the translation of Lua strings to Java strings doesn't like non-ASCII values. You can see the bug appear when trying to send strings through rednet as well.
Yes already know that, it becomes 239. Googled a bit, and searched through the code of LuaJ, and it has to do with UTF8. Non ASCII characters like extended ASCII (128-255) let the first bit in the character bit (=the highest bit) be 1, which UTF-8 should encode into two bytes. The problem is, functions like string.dump don't let the LuaString process the characters and encode them, they store them directly in the byte array for already-UTF8-encoded bytes. Now, when a method wants to convert a luastring into a java string, the LuaString class converts the already-UTF8-encoded bytes array into java chars. If the highest bit of a UTF8 byte is 1, UTF8 thinks it is a byte sequence and there's the problem.
239 is part of the UTF8 replacement character, if something went wrong, it writes it to the stream. Normally, it should write two other bytes to the string too, 191 and 189, I don't know why they're not beeing outputted.
string.char and string.byte works too, they're not converting to UTF8 too. The only problem is LuaString.tojstring (and LuaString.valueOf(String s)), because it interprets bytes as UTF8 which aren't encoded in UTF8.

Why does LuaJ convert these bytes into UTF8? Normal files are ANSI, so only bytes, string.byte accepts only bytes, there's no possibility to get use of UTF8, and it's buggy.

Also, there are some unknown details, like, why the Terminal prints '?' instead of a symbol (ok, could be because theres no symbol for that), why only rednet gets UTF8 replacement character (65533, so 239, 191 and 189), why it's not outputting the two other bytes of the replacement character, or why the UTF8 "madness" just "consumed" a byte here.

Couldn't CC just fix this by using the byte array from the LuaString class instead of using tojstring? It's extremely rare that something uses LuaString.valueOf(String astring) with a String that contains bytes bigger than 127. If I'm right on how term.write uses LuaStrings, it doesn't use a ByteStream of the byte array in the LuaString class, meaning that this would open the possibility to get all ANSI characters. The weird rednet bug would be fixed, and everyone would be happy.
Edited on 11 June 2014 - 01:07 PM