[SOLVED]Code error on checking UTF-8 data

Hi, dear monkeys. Long time no see. :grinning:

Problem

I got problem exporting 3d Scene to j3o file.

The problem is, I set Light name with Chinese characters, the exported j3o can’t save the name as UTF-8.

Here is the log:

警告: Your export has been saved with an incorrect encoding for its String fields which means it might not load correctly due to encoding issues. You should probably re-export your work. See ISSUE 276 in the jME issue tracker.
11月 27, 2020 11:24:12 上午 com.jme3.export.binary.BinaryInputCapsule readString

I read the source code of BinaryInputCapsule#readString, find that code has error when checking UTF-8 data.

Let a 3 bytes UTF-8 data = [0xE4, 0x8A, 0xBC], when b = 0xE4 (1110 0100), it will be treated as 2 bytes.

See this part:

                if (b < 0x80) {
                    // good
                }
                else if ((b & 0xC0) == 0xC0) {//   (0xE4 & 0xC0) == 0xC0     =====>  true
                    utf8State = UTF8_2BYTE;
                }
                else if ((b & 0xE0) == 0xE0) {//   (0xE4 & 0xE0) == 0xE0      =====>  true
                    utf8State = UTF8_3BYTE_1;
                }
                else {
                    utf8State = UTF8_ILLEGAL;
                }

3 bytes UTF-8 data while always be treated as 2 bytes UTF-8 data.

Bugfix

UTF-8 encoding data in this way:

bytes encoding
1 0xxxxxxx
2 110xxxxx 10xxxxxx
3 1110xxxx 10xxxxxx 10xxxxxx
4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Which means:

  • For 2 bytes data, the code should check first 3 bits as 110xxxxx, check 2 bits as 10xxxxxx with the following data.
  • For 3 bytes data, the code should check first 4 bits as 1110xxxx, check 2 bits as 10xxxxxx with the following data.

This is the fix

                if (b < 0x80) {
                    // good
                }
                else if ((b & 0xE0) == 0xC0) {//   (0xE4 & 0xE0) == 0xC0     =====>  false
                    utf8State = UTF8_2BYTE;
                }
                else if ((b & 0xF0) == 0xE0) {//   (0xE4 & 0xF0) == 0xE0      =====>  true
                    utf8State = UTF8_3BYTE_1;
                }
                else {
                    utf8State = UTF8_ILLEGAL;
                }

issue & pr

2 Likes

Nice detective work.

I wonder if this code is even needed anymore. It’s more than 10 years old and it seems it’s attempting to work around a problem where an older j3o might have platform-encoded strings. (And this was a really bad way to fix that problem because it’s super ugly and not even guaranteed to fix the problem it tries to fix.)

I suspect we can just get rid of all of that checking and always treat string data as UTF-8 now.

2 Likes

can’t agree more

3 Likes

anyone have different ideas?

shall I just modify the method like this?

    protected String readString(byte[] content) throws IOException {
        int length = readInt(content);
        if (length == BinaryOutputCapsule.NULL_OBJECT) {
            return null;
        }

        byte[] bytes = new byte[length];
        for (int x = 0; x < length; x++) {
            bytes[x] =  content[index++];
        }

        return new String(bytes);
    }
1 Like

I think you will want to use the String(bytes, Charset) constructor to explicitly force it to use UTF-8. The bytes-only constructor uses the platform’s default Charset.

2 Likes

As danielp says, it needs to be UTF8 or we are right back to the original problem that the 10+ year old code was trying to fix.

Write it in UTF8, read it in UTF8.

4 Likes

OK, commit updated.

1 Like