JME's source code contains non-ascii characters!

Hello,



I am using Netbeans 6.0, on WinXP Traditional Chinese version.

I am trying to compile the sources of JME, and I see that you guys didn't realized that your source code contains some characters that are specific to your charset. In other words, you should pass your source code into a tool that will convert the characters from your native locale to the UTF-8 standard that is read by any platform or OS in any country.



Here is some sample of the warnings that I get:



C:Documents and SettingsVincent.CantinMy DocumentsNetBeansProjectsVincent Cantin stuffsjmesrcjmetestscenegeometryinstancingTestGeometryInstancing.java:67: warning: unmappable character for encoding UTF-8

  • @author Patrik Lindegr?n



    C:Documents and SettingsVincent.CantinMy DocumentsNetBeansProjectsVincent Cantin stuffsjmesrcjmetestterrainTestFluidSimHeightmap.java:55: warning: unmappable character for encoding UTF-8
  • @author Frederik B?lthoff





    The tool to transform your source code is called "native2ascii.exe", it is a part of each JDK and is fully documented in the doc of the JDK.



    Your special characters will be transformed into something that looks like "u0045", which is accepted without warnings by the compiler on any OS in any locale.

Totally, there is (only) 27 warnings of this case.

Maybe you can fix them by hand instead of having some unicode encoded notation in your comments.



For example, "D

I've seen the same problem.

I'm guessing this is all contributor names?

renanse said:

I'm guessing this is all contributor names?


There is also a few other places. Here is the complete list of places where it appears:

comjmemathMatrix3f.java:1069: warning: unmappable character for encoding UTF-8
    * @see "Tomas M?ller, John Hughes "Efficiently Building a Matrix to Rotate

comjmescenegeometryinstancingGeometryBatchInstance.java:47: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

comjmescenegeometryinstancingGeometryBatchInstanceAttributes.java:42: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

comjmescenegeometryinstancinginstanceGeometryBatchCreator.java:42: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

comjmescenegeometryinstancinginstanceGeometryInstance.java:40: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

comjmescenegeometryinstancinginstanceGeometryInstanceAttributes.java:42: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

comjmescenestatelwjglLWJGLLightState.java:358: warning: unmappable character for encoding UTF-8
                // with a call to glLightfv(GL_LIGHT_POSITION,?). If you later change
                                                              ^

comjmescenestatelwjglLWJGLLightState.java:362: warning: unmappable character for encoding UTF-8
                // light?s position, you must again specify the light position with a
                        ^

comjmescenestatelwjglLWJGLLightState.java:363: warning: unmappable character for encoding UTF-8
                // call to glLightfv(GL_LIGHT_POSITION,?).
                                                      ^

comjmescenestatelwjglLWJGLLightState.java:378: warning: unmappable character for encoding UTF-8
                // with a call to glLightfv(GL_LIGHT_POSITION,?). If you later change
                                                              ^

comjmescenestatelwjglLWJGLLightState.java:382: warning: unmappable character for encoding UTF-8
                // light?s position, you must again specify the light position with a
                        ^

comjmescenestatelwjglLWJGLLightState.java:383: warning: unmappable character for encoding UTF-8
                // call to glLightfv(GL_LIGHT_POSITION,?).
                                                      ^

comjmexsubdivisionSubdivisionButterfly.java:53: warning: unmappable character for encoding UTF-8
* 'Interpolating Subdivision for Meshes with Arbitrary Topology', Denis Zorin, Peter Schr?der, Wim Sweldens. Computer Graphics, Ann. Conf. Series, vol. 30, pp. 189-192, 1996.<br>

comjmexterrainutilFluidSimHeightMap.java:46: warning: unmappable character for encoding UTF-8
* @author Frederik B?lthoff

comjmexterrainutilHillHeightMap.java:45: warning: unmappable character for encoding UTF-8
* @author Frederik B?lthoff

jmetestinputVTextIcon.java:258: warning: unmappable character for encoding UTF-8
    * employ a horizontal baseline that is rotated by 90? counterclockwise so

jmetestinputVTextIcon.java:260: warning: unmappable character for encoding UTF-8
    * numbers may be rotated 90? clockwise so that the characters are also

jmetestscenegeometryinstancingTestGeometryInstancing.java:67: warning: unmappable character for encoding UTF-8
* @author Patrik Lindegr?n

jmetestterrainTestFluidSimHeightmap.java:55: warning: unmappable character for encoding UTF-8
* @author Frederik B?lthoff

jmetestterrainTestHillHeightmap.java:54: warning: unmappable character for encoding UTF-8
* @author Frederik B?lthoff

You made a point.



Could it be possible to have the incriminated files saved using UTF-8 encoding, please?

I am agree, pure ASCII or UTF-8 should be the default charcodes. Same problems here even if my Eclipse automatically ignore it… I don't know how.



Just to make the point. I this latest years charcodes are making a lot of confusion… there are web sites, for example, that are impossible to read correctly. The cause is probably an inconsistence between declared charset, html used charset and the actual encode of the files!!! A total chaos!!!

Ender said:

there are web sites, for example, that are impossible to read correctly. The cause is probably an inconsistence between declared charset, html used charset and the actual encode of the files!!! A total chaos!!!


I can't tell you how many times I've seen the http server serve a UTF-8 content header, and the html declare an ISO- charset, or vise-versa. I definitely agree that jME should use only one character encoding.. I just lean towards UTF since I've done a lot of work with internationalization and have always found it to be much easier with UTF.  :D

Yes, I saw a lot of mistakes recently in http servers about charsets. I think that the main cause are HTML developers that should care this things but they don't.


Haibijon said:
I just lean towards UTF since I've done a lot of work with internationalization and have always found it to be much easier with UTF.  :D


Yes. UTF-8 is also fully compatible with ASCII. ASCII characters in UTF-8 encoded files are encoded exactly as in ASCII (same bit length), to maintain compatibility with older systems. So if you write a UTF-8 encoded file that contains only ASCII characters they can be read also by applications that still not support UTF-8.

In fact, some Unix command line utilities still have a lot of problems with UTF-8.

One of the problems with HTML, for example, is that developers do not use anymore entities like "&egrave;" for the "
Ender said:
One of the problems with HTML, for example, is that developers do not use anymore entities like "&egrave;" for the "
Haibijon said:
The W3C seems to disagree with you


Oh yes, if we are talking about xhtml/html, as in the beginning of that page. And moreover I am agree that UTF-8 have to be used. But I am also open to use ASCII if there is any reason that makes it a requirement. Old HTML 3 and 4 often will not work with normal characters. UTF-8 support have been introduced by newer browsers about 7-8 years ago.

I know that only a subset of UTF-8 is compatible with ASCII. It was exactly what I said in my post, I have been just a bit less technical than you. What I wanted to point out is that if we use UTF-8 code (not comments) but code will be compatible also with ASCII.

And believe me. There are utilities in Unix systems that still do not support UTF-8. Bash, for example, can give a lot of problems reading UTF-8. But it is not a big trouble luckly because what outputs wrong are only output messages. But there are more important utilities that cannot work with UTF-8. You are lucky because you use en_US but for other languages there are a lot of confusion.

For jME I still agree that UTF-8 would be perfect choice.

I don't care too much since my locale is compatible with the offending files, but I agree that we should use an encoding that allows for everyone to comfortably use jME.



I couldn't understand some of the things you said. I think charsets and encodings have been mixed in this thread.



Anyway, I think Eclipse (as opposite to has been said above) can handle different encodings per file or globally:


  1. Right click on a source file -> Properties -> Resource: you can see and edit the text file encoding.


  2. Right click on a project -> Properties -> Resource: you can see and edit the text file encoding.


  3. Menu Window -> Preferences -> General -> Workspace: you can set the default text file encoding.



    The Java compiler can happily manage UTF8 files, and many unix utilities I'd use (like grep) can do it too. For other utilities, well, I guess it's time for them to update (we don't have to wait for all of them!) and anyway UTF8 is quite compatible with ASCII so unless you are searching/counting/etc. within non english strings you'll notice no difference.



    Most IDEs (Eclipse, Netbeans, MS Visual Studio) can handle and search along files encoded in different formats.



    If someone changes all files to UTF8 and commits them, our only concern will be setting the project to UTF8 (well, that could be changed in the CVS too I think) although mine already says "determined from content: UTF-8".



    My drastic point of view is that if anyone is still managing strings in the old char[] fashion is their only problem. Programming evolves and a char[] is no longer a good format to store text. Guys, use appropriate string libraries or die  }:-@.



    Corrections are welcome.
jjmontes said:

I couldn't understand some of the things you said. I think charsets and encodings have been mixed in this thread.


According to Wikipedia, charsets and encodings are synonymous in practical use... However, I've always seen files as being encoded using a specific charset, i.e. a file which uses the ASCII charset is encoded as ASCII, not sure how other people use them though.
Haibijon said:

jjmontes said:

I couldn't understand some of the things you said. I think charsets and encodings have been mixed in this thread.


According to Wikipedia, charsets and encodings are synonymous in practical use... However, I've always seen files as being encoded using a specific charset, i.e. a file which uses the ASCII charset is encoded as ASCII, not sure how other people use them though.


Umm however I think they are not always synonyms.

Unicode (which is a large charset) can be encoded as UTF8, UTF16 or ASCII with escape sequences.

ASCII can be encoded as ASCII 8 bits per character or 7 bits if you only want to use the first 128 characters.

Most codepages (Latin-1, etc.) are usually encoded as a byte stream, 8 bits per character, since they contain 256 characters, but it was my understanding that people working with large charsets (i.e. chinese) are used to encode strings in other formats (don't know what formats, though).

Unicode is handy because it can represent almost all alphabets, and UTF-8 and UTF-16 are handy because they are a good standard to encode Unicode characters and it's plenty of libraries to handle Unicode/UTF out there.

That said, it is true that I use ASCII to talk about either the charset and the encoding, and the same for UTF (which applies to the Unicode charset).

The CVS client usually translates the encoding while checking out. So if you check out from cvs all you need to do is configuring your cvs client correctly to get the files in utf-8 (e.g. eclipse).

jjmontes said:
Unicode (which is a large charset) can be encoded as UTF8, UTF16 or ASCII with escape sequences.

That said, it is true that I use ASCII to talk about either the charset and the encoding, and the same for UTF (which applies to the Unicode charset).


Infact I never talk about Unicode. For what I know UTF-8 is able to rapresent just a subset of Unicode.
The point is that UTF-8 can be compatible with ASCII.

irrisor said:

The CVS client usually translates the encoding while checking out. So if you check out from cvs all you need to do is configuring your cvs client correctly to get the files in utf-8 (e.g. eclipse).


Great! Thanks a lot for the info. I tried to set the encode into the Properties panel of the Project I have locally in Eclipse. It was MacRoman because Mac use it by default. Setting that to ISO-8859-1, I am able to see correctly non ASCII characters.

I guess that to convert the encoding while checking out I should modify some setting in the CVS section of Eclipse preference panel. Right?
irrisor said:

The CVS client usually translates the encoding while checking out. So if you check out from cvs all you need to do is configuring your cvs client correctly to get the files in utf-8 (e.g. eclipse).


I am using Netbeans 6.0 to checkout, and I didn't find such option in the SVN plugin.
I still feel that it would be better to have the files conversted in the subversion repository directly.

In Eclipse there is a page of the Project Properties panel that let the user modify the default encoding (or charset?) of the project. I simply have been able to change it an see correctly non MacRoman characters in the editor window. But I don't know if it affects CVS client included with Eclipse and makes it automatically convert encoding while downloading new files.