[SOLVED]Code error on checking UTF-8 data

yan · November 27, 2020, 8:58am

Hi, dear monkeys. Long time no see.

Problem

I got problem exporting 3d Scene to j3o file.

The problem is, I set Light name with Chinese characters, the exported j3o can’t save the name as UTF-8.

Here is the log:

警告: Your export has been saved with an incorrect encoding for its String fields which means it might not load correctly due to encoding issues. You should probably re-export your work. See ISSUE 276 in the jME issue tracker.
11月 27, 2020 11:24:12 上午 com.jme3.export.binary.BinaryInputCapsule readString

I read the source code of BinaryInputCapsule#readString, find that code has error when checking UTF-8 data.

github.com

jMonkeyEngine/jmonkeyengine/blob/2196e4ce3ad0b1a764ca392a41baed1e8df9559c/jme3-core/src/plugins/java/com/jme3/export/binary/BinaryInputCapsule.java#L1039


 * a multibyte codepoint:
 * (b & 0x80) == 0x80  (in other words, first bit must be 1)
 */
private final static int UTF8_START = 0; // next byte should be the start of a new
private final static int UTF8_2BYTE = 2; // next byte should be the second byte of a 2 byte codepoint
private final static int UTF8_3BYTE_1 = 3; // next byte should be the second byte of a 3 byte codepoint
private final static int UTF8_3BYTE_2 = 4; // next byte should be the third byte of a 3 byte codepoint
private final static int UTF8_ILLEGAL = 10; // not an UTF8 string
// String
protected String readString(byte[] content) throws IOException {
    int length = readInt(content);
    if (length == BinaryOutputCapsule.NULL_OBJECT)
        return null;
    /*
     * @see ISSUE 276
     *
     * We'll transfer the bytes into a separate byte array.
     * While we do that we'll take the opportunity to check if the byte data is valid UTF-8.
     *

Let a 3 bytes UTF-8 data = [0xE4, 0x8A, 0xBC], when b = 0xE4 (1110 0100), it will be treated as 2 bytes.

See this part:

                if (b < 0x80) {
                    // good
                }
                else if ((b & 0xC0) == 0xC0) {//   (0xE4 & 0xC0) == 0xC0     =====>  true
                    utf8State = UTF8_2BYTE;
                }
                else if ((b & 0xE0) == 0xE0) {//   (0xE4 & 0xE0) == 0xE0      =====>  true
                    utf8State = UTF8_3BYTE_1;
                }
                else {
                    utf8State = UTF8_ILLEGAL;
                }

3 bytes UTF-8 data while always be treated as 2 bytes UTF-8 data.

Bugfix

UTF-8 encoding data in this way:

bytes	encoding
1	0xxxxxxx
2	110xxxxx 10xxxxxx
3	1110xxxx 10xxxxxx 10xxxxxx
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5	111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6	1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Which means:

For 2 bytes data, the code should check first 3 bits as 110xxxxx, check 2 bits as 10xxxxxx with the following data.
For 3 bytes data, the code should check first 4 bits as 1110xxxx, check 2 bits as 10xxxxxx with the following data.
…

This is the fix

                if (b < 0x80) {
                    // good
                }
                else if ((b & 0xE0) == 0xC0) {//   (0xE4 & 0xE0) == 0xC0     =====>  false
                    utf8State = UTF8_2BYTE;
                }
                else if ((b & 0xF0) == 0xE0) {//   (0xE4 & 0xF0) == 0xE0      =====>  true
                    utf8State = UTF8_3BYTE_1;
                }
                else {
                    utf8State = UTF8_ILLEGAL;
                }

issue & pr

pspeed · November 27, 2020, 9:09am

Nice detective work.

I wonder if this code is even needed anymore. It’s more than 10 years old and it seems it’s attempting to work around a problem where an older j3o might have platform-encoded strings. (And this was a really bad way to fix that problem because it’s super ugly and not even guaranteed to fix the problem it tries to fix.)

I suspect we can just get rid of all of that checking and always treat string data as UTF-8 now.

yan · November 27, 2020, 9:30am

can’t agree more

yan · November 28, 2020, 6:58am

anyone have different ideas?

shall I just modify the method like this?

    protected String readString(byte[] content) throws IOException {
        int length = readInt(content);
        if (length == BinaryOutputCapsule.NULL_OBJECT) {
            return null;
        }

        byte[] bytes = new byte[length];
        for (int x = 0; x < length; x++) {
            bytes[x] =  content[index++];
        }

        return new String(bytes);
    }

danielp · November 28, 2020, 8:17am

I think you will want to use the String(bytes, Charset) constructor to explicitly force it to use UTF-8. The bytes-only constructor uses the platform’s default Charset.

pspeed · November 28, 2020, 10:53am

As danielp says, it needs to be UTF8 or we are right back to the original problem that the 10+ year old code was trying to fix.

Write it in UTF8, read it in UTF8.

yan · November 30, 2020, 9:12am

OK, commit updated.