Saving unicode string

tbutter · March 11, 2008, 10:42am

Hi,

there is a bug in BinaryOutputCapsule.write(String…). It saves the number of characters and not the number of bytes like the readString in BinaryInputCapsule.

Here is a patch:

#P jme

Index: src/com/jme/util/export/binary/BinaryOutputCapsule.java

===================================================================

RCS file: /cvs/jme/src/com/jme/util/export/binary/BinaryOutputCapsule.java,v

retrieving revision 1.6

diff -u -r1.6 BinaryOutputCapsule.java

— src/com/jme/util/export/binary/BinaryOutputCapsule.java 17 Dec 2007 14:48:28 -0000 1.6

+++ src/com/jme/util/export/binary/BinaryOutputCapsule.java 11 Mar 2008 10:35:30 -0000

@@ -519,7 +519,7 @@

write(NULL_OBJECT);

return;

}

- write(value.length());

+ write(value.getBytes().length);

byte[] bytes = value.getBytes();

baos.write(bytes);

}

renanse · March 11, 2008, 3:22pm

Thanks!

llama · March 11, 2008, 3:49pm

I just saw a commit go in…

That's better but still far from ideal, the behaviour is still platform depenend. While most "platforms"/OSs use some form of 1 byte encodings (or better put, the Sun JVM implementation often decides that one of these is the default), it's still very unpredicatable and not very future proof.

It's best to settle on one encoding such as UTF-8 or UTF-16 ( getBytes("UTF-8") ). There is a bad thing about switching though… it will break loading of some models that have been saved with non-ASCII characters in them (if it leads to a sequence of bytes that is illegal for UTF8). What we could do is catch the exception for decoding the UTF-8 string and fall back on something 1 byte (such as latin1) for any models that'll cause this. Might lead to some garbled characters for older models but at least you can still load them. Still better than the current situation (if I save a model on a chinese PC it will open different on an american one, for instance)

Should I implement this, or does anyone else have other ideas? (at work right now so no CVS access)

gerbildrop · March 11, 2008, 4:13pm

@llama: No, IMO, you are correct. I think that defaulting to 8859_1 is a good idea provided the UTF-8 blows up. I would also say that this might be a configuration parameter too… either with Spatial or the general game configs that defaults to UTF-8 and gracefully fails over to 8859_1 (basic latin).

I don't know what the best way to implement it is, but personally, it becomes possible that older models utilize 8859_1 and newer ones utilize UTF-8, in which case I'd like to be able to specify that per model and avoid the 2ndary check… but either way works I think.

As well, I don't think it would be a bad idea to investigate conversions for models from 8859_1 to UTF-8 then require UTF-8 across the board.

AFAIK, the JVM and some DBs utilize UTF-16, but do a lot of conversion from UTF-16 to whatever the db setting is (e.g. 8859_1, UTF-8, etc). I have not, yet, run into an issue where UTF-16 has been preferred by a common language.

llama · March 11, 2008, 6:08pm

I'll start implementation.

llama · March 11, 2008, 4:32pm

UTF-16 in usecases like this (text storage) is mostly used outside of Europe and the US (a lot of asian characters take 3 bytes for UTF-8 and only 2 for UTF-16). I don't think any platform will report UTF-16 as the default locale though.

UTF-8 however could very well be the default locale… (might be the case for tbutter or he would not have found this bug, if you use only iso_8859_1 the old code should work just fine). This is sometimes the case in linux for example.

Defaulting to 8859_1 is kind of nasty, that's like saying some Europeans and Americans can have special characters but the rest can not.

Converting your model from 8859_1 to UTF-8 would mean loading it and saving it, after I've made the changes to the code. However if your 8859_1 sequence parses as valid UTF-8 some characters will be wrong… well, too bad I'd say.

The binary format is very volatile and breakable… you should store your models in some other format for long term storage anyway. It's about fast storage and fast loading, not making all kinds of strange options for legacy bugs.

irrisor · March 11, 2008, 4:35pm

Llama++

We should get rid of that getBytes() quickly. Java can handle utf-8 very well, so we really should use it. It should not be a problem on any java platform (as we require 5+ anyway).

renanse · March 11, 2008, 7:19pm

String always had an issue with encoding. The getBytes was simply a patch to correct an improper bount of bytes.

I agree we should consider encoding though, but wonder if it would make sense to do that in 2.0 to avoid breaking existing assets. No strong preference though.

Also, to add a ++ to llama's point above, binary format is not intended for an authoritative version of your files. Here at work we store everything in collada format (or images in their native format) and then convert them nightly or as needed as part of the build process.

renanse · March 12, 2008, 6:05pm

Great job! Would there perhaps be any reason to allow more charsets, like UTF16 for example?

llama · March 12, 2008, 6:22pm

UTF-8 stores anything that can be in a Java String, in a standardized, portable fashion. (I did notice that Sun did some very weird escaping for some unicode characters it couldn't map to my native encoding too, not something you can rely on though. but it probably explains how tbutter ran into the issue).

UTF-8 also uses only 1 byte for most latin characters (still the most used I would think). The only real advatange of UTF-16 over UTF-8 is that many asian/arabic characters are only 2 bytes (all characters are 2 bytes with UTF-16) instead of 3 for UTF-8. So no real reason to support anything else (at least not until Java will change 'char' from 16 bit to 32 bit, which I doubt will happen at all)

gerbildrop · March 12, 2008, 6:54pm

as llama stated UTF-8 is the accepted default… asian/arabic/hebrew/syriac/hindi and a few others utilize UTF-16 occasionally natively. I do tons of internationalization work in Arabic, Russian and Japanese and have no problems with UTF-8 (nor do the files sent to me by native users). I still haven't had occasion to run into a case where UTF-8 didn't do what I needed b/c something was UTF-16. I've run into many cases converting from 8859_1 and vice versa.

I think it would be acceptable to add UTF-16, however, UTF-32 is still up and coming.

I really don't see that many that try to push UTF-16 as files. I actually see more UTF-7, believe it or not. I have seen UTF-6 and 9 as well. I think by keeping the defacto standard at UTF-8 for now, there will still be a conversion.

You could also detect the CharacterEncoding being used and simply set it to convert from that Character Encoding to UTF-8. I haven't messed much with UTF-16 lately, so this may not work.

Momoko_Fan · March 12, 2008, 7:10pm

You can save the encoding format name inside the jME model file and then use the Java Charset API to read all strings inside it, then the format will be backward/forward compatible if another conversion ever becomes necessary.

llama · March 12, 2008, 7:44pm

For almost any other case I would agree adding support for other charsets is good, but in this use case I just fail to see the relevence. UTF-8 can store any Java String.

If Java Strings in the future support "UTF-32" (not technically the right way to put, but you get the idea), we'll probably still want use a mapping to UTF-8 to support older Java versions anyway (which likely won't even require a single letter of code to change). And UTF-8 will still be the most obvious cause on average it will lead to the smallest file. This is all assuming users will use characters not in UTF-16 to begin with (how silly is that?)!!

Adding the encoding name to the file format will break the old format, it won't make loading and saving faster, and at most it will take 1 or 2 less byte per character if you use Chinese/Arabic/etc strings. It will also open up a host of issues (what if I use an encoding not supported on your platform?).

So, no real benifits now or in the future. Supporting different encodings is good for being interoperable, but the binary format is not for that. We have COLLADA/others and now Hevee's XML format for that. Since they use XML all encodings will work there, and now the binary format can store any content used in them.

If people have time or energy left to further go into this topic, I'd encourage them to focus their energies on Hevee's XML format, not this one

So, let's put it to rest