Binary vs Text Files - ComputerCraft Forums (archive)

blunty666 #1

79 posts

Posted 12 December 2015 - 02:14 PM

I've got a couple of questions regarding manipulating files using either binary or text mode:

1) Is it possible to tell when a file has been written in binary mode vs text mode?

2) If I were to copy a binary file by opening it in text mode and then writing it to the new path in text mode, would I be at risk of corrupting the data?

2)ii) Similarly, if I were to copy a text file by opening it in binary mode and then writing it to the new path in binary mode, would I be at risk of corrupting the data?

3) What are the 'end-of-line conversions' that happen when opening a file in text mode?

Edited on 12 December 2015 - 01:15 PM

Lupus590 #2

Lupus590's profile picture

2427 posts

Location UK

Posted 12 December 2015 - 02:48 PM

I'm not sure, but I'll try to answer.

Not that I know of
Binary in text mode: I think you will have problems. Test in binary mode: Fairly sure that it will be fine.
Linux, Windows and Mac all have a different way of storing a new line. The conversion process converts the other two to the one your OS prefers.

Edited on 12 December 2015 - 01:49 PM

blunty666 #3

79 posts

Posted 12 December 2015 - 02:57 PM

Ok gotcha

Lupus590, on 12 December 2015 - 02:48 PM said:
Linux, Windows and Mac all have a different way of storing a new line. The conversion process converts the other two to the one your OS prefers.

Do the conversions only happen when opening a file in text mode, or when writing as well?

MKlegoman357 #4

1140 posts

Location Kaunas, Lithuania

Posted 12 December 2015 - 03:11 PM

There's no "conversion" going on. When reading the file, CC automagically detects which line endings are being used. You can always open a file in binary mode without corrupting it, because it returns the raw data that is saved inside the file. However, text mode can only get representable characters, so reading and then saving in text-mode will corrupt binary files.

Edited on 12 December 2015 - 02:11 PM

Bomb Bloke #5

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 12 December 2015 - 11:29 PM

On a little bit of a tangent, network-based file transfers typically take "text mode" as an excuse to only transmit seven bits per byte, reducing the total data stream by 12.5%; obviously this ruins anything that doesn't fall strictly within the 128-character ASCII range.

MKlegoman357, on 12 December 2015 - 03:11 PM said:
There's no "conversion" going on.

Er, yes there is. Even file.readAll() will strip out carriage returns, and so will file.write() - this goes regardless as to whether you put them back in again between reading and re-writing!

Lupus590, on 12 December 2015 - 02:48 PM said:
Linux, Windows and Mac all have a different way of storing a new line. The conversion process converts the other two to the one your OS prefers.

Running CC under Windows (as I imagine most users do), I find it seems locked to "convert to Unix" - line-feeds only, no carriage returns.

blunty666, on 12 December 2015 - 02:14 PM said:
Is it possible to tell when a file has been written in binary mode vs text mode?

No, because binary mode allows you to write a file exactly how you want it, bit-for-bit; this means that it's always possible a file was written using binary mode.

By inspecting file content via binary mode, it becomes possible to tell whether it's "safe" to use text mode on it… but unless you want to actually use the text in a file for text-based purposes, there's no point; text mode makes it easy to parse that data into strings, but that's the only benefit to using it. If you just want to copy files around, stick with binary mode all the way.

blunty666 #6

79 posts

Posted 13 December 2015 - 10:54 AM

Bomb Bloke, on 12 December 2015 - 11:29 PM said:
Running CC under Windows (as I imagine most users do), I find it seems locked to "convert to Unix" - line-feeds only, no carriage returns.

Brilliant, thanks for the extra detail. I'm asking because I was looking creating a modified filesystem that allows mounts (inspired by Lyqyd's raidmount for LyqydOS), and I was unsure if I needed to worry about any changes that are made when reading in text mode.

If I'm going to create any kind of raid array, I'll have to stripe the data (writing in binary mode), and then read in binary mode to piece it back together. So my questions were based on if I would have to do any end-of-line / carriage return conversions to account for the fact the data was opened in binary mode and hence hasn't had the conversions applied.

Would it break anything if I didn't?

Bomb Bloke #7

Bomb Bloke's profile picture

7083 posts

Location Tasmania (AU)

Posted 13 December 2015 - 11:21 AM

blunty666, on 13 December 2015 - 10:54 AM said:
Would it break anything if I didn't?

Nothing comes immediately to mind, but I'd do it anyway if I were you. If text mode was requested of your RAID, then it shouldn't be overly difficult to simply discard \r (character 0x0D) whenever you see it.

'spose I should elaborate on another point; ComputerCraft can't read certain characters in text mode at all, and converts them to char 0x3F (a question mark) if it encounters them. Does the same thing in reverse when writing to disk. Most all values above 0x7F are affected, though perhaps some others are too; easy enough for you to test out for yourself. As far as I'm aware there isn't any code out there which relies on this behaviour (though there are more than a few complaints about it), so you could probably get away with ignoring it.

The package.open() function (a doppelganger of fs.open()) in my Package API strips carriage returns when reading lines in text mode, but doesn't bother to "corrupt" non-ASCII characters as fs.open()'s handles do. Seemed to work fine in my tests but it likely hasn't gotten much use outside of those.

blunty666 #8

79 posts

Posted 13 December 2015 - 01:40 PM

Bomb Bloke, on 13 December 2015 - 11:21 AM said:
Nothing comes immediately to mind, but I'd do it anyway if I were you. If text mode was requested of your RAID, then it shouldn't be overly difficult to simply discard \r (character 0x0D) whenever you see it.

Thanks for this, going to go and play around and see what happens. Will probably just try raid 0 striping to start off with, no parity checks.

BTW I checked writing to a disk in text mode and can confirm any character over 0x7F (127) gets converted to a question mark.