Crash in rendering ListSort

phr00t · August 19, 2014, 4:04pm

Hey guys,

Got this odd crash in jMonkeyEngine, using the latest git version. Looks like it is happening when it tries to sort the opaque rendering queue:

[java]Thread: LWJGL Renderer Thread

java.lang.ArrayIndexOutOfBoundsException: 5
at com.jme3.util.ListSort.sort(ListSort.java:230)
at com.jme3.renderer.queue.GeometryList.sort(GeometryList.java:145)
at com.jme3.renderer.queue.RenderQueue.renderGeometryList(RenderQueue.java:318)
at com.jme3.renderer.queue.RenderQueue.renderQueue(RenderQueue.java:374)
at com.jme3.renderer.RenderManager.renderViewPortQueues(RenderManager.java:782)
at com.jme3.renderer.RenderManager.flushQueue(RenderManager.java:738)
at com.jme3.renderer.RenderManager.renderViewPort(RenderManager.java:1002)
at com.jme3.renderer.RenderManager.render(RenderManager.java:1049)
at com.jme3.app.SimpleApplication.update(SimpleApplication.java:252)
at com.jme3.system.lwjgl.LwjglAbstractDisplay.runLoop(LwjglAbstractDisplay.java:151)
at com.jme3.system.lwjgl.LwjglDisplay.runLoop(LwjglDisplay.java:185)
at com.jme3.system.lwjgl.LwjglAbstractDisplay.run(LwjglAbstractDisplay.java:230)
at java.lang.Thread.run(Thread.java:745)[/java]

This is the line the crash happens on:

[java]runsIndices[nbRuns] = low;[/java]

nbRuns somehow got set to 5, when runsIndices can only accept 0-4. The allocateStack(len) function sets the length of runsIndices to 5, via apparently this block of code:

[java] int stackLen = (len < 1400 ? 5
: len < 15730 ? 10
: len < 1196194 ? 19 : 40);

    //Same remark as with the temp array
    if (runsIndices == null || stackLen &gt; runsIndices.length) {
        runsIndices = new int[stackLen];
        runsLength = new int[stackLen];           
    }[/java]

Anyway… any idea why this would happen? I’ve only seen it happened once but I wanted to report it right away since I’m not sure how to go about fixing it. Could it be a threading issue inside my code somewhere else…? :?

Empire_Phoenix · August 19, 2014, 4:44pm

Did you do anything multithreaded that is attached to the scenegraph?

phr00t · August 19, 2014, 4:51pm

@Empire Phoenix said: Did you do anything multithreaded that is attached to the scenegraph?

I haven’t gotten any “Scenegraph changed!” exceptions & I don’t think I’m doing anything in my other threads to cause this problem. However, I’ve never seen this specific exception before, and it has only happened once (even after trying to reproduce it), so I’m not sure what in my other threads could cause this problem to happen. Someone familiar with this function might be able to shed some light on what could cause it to happen, perhaps?

pspeed · August 19, 2014, 7:33pm

There can be lots of things you can mistakenly do from other threads that will not trigger the “Scenegraph changed!” warning. That only catches a very narrow set of things… changes basically.

If an object is attached to the main scene graph then you cannot change it, read from it, manipulate, or basically do anything with it from another thread. You can’t even call getters safely on it.

So if you do any of those things then you have an issue you need to fix. It’s hard to say why it may cause an issue in this case but the sorting relies on the sort order not changing during runtime… and not all of those issues will be easy to catch in the algorithm.

nehon · August 20, 2014, 7:23am

That’s weird, any chance you narrow it down to a test case?
I used this code for one year before committing it to core, and I never had any issue.
Since It’s in core there have been two crash reports (including this one) but unfortunately, the first one couldn’t be reproduced.
I guess that we could revert back to the previous merge sort, since it’s barely slower, but it uses twice the memory.

pspeed · August 20, 2014, 7:58am

@nehon said: That's weird, any chance you narrow it down to a test case? I used this code for one year before committing it to core, and I never had any issue. Since It's in core there have been two crash reports (including this one) but unfortunately, the first one couldn't be reproduced. I guess that we could revert back to the previous merge sort, since it's barely slower, but it uses twice the memory.

I think there have been more than two. Some get the error about comparator consistency or somesuch but those have (so far) all turned out to be threading related.

phr00t · August 20, 2014, 1:13pm

@nehon said: That's weird, any chance you narrow it down to a test case? I used this code for one year before committing it to core, and I never had any issue. Since It's in core there have been two crash reports (including this one) but unfortunately, the first one couldn't be reproduced. I guess that we could revert back to the previous merge sort, since it's barely slower, but it uses twice the memory.

I’d love to be able to narrow it down to a test case, but I haven’t seen the problem happen again… I don’t know where to begin to narrow the problem down.

The advice of “If an object is attached to the main scene graph then you cannot change it, read from it, manipulate, or basically do anything with it from another thread. You can’t even call getters safely on it” is extremely limiting… I can understand not changing things, but not being able to read or call getters seems like something is a little broken. I’d be for a more reliable solution at the cost of a little memory (although it shouldn’t build up junk over time that the garbage collector needs to deal with regularly).

zzuegg · August 20, 2014, 1:51pm

Correct me if am wrong, but calling getters on methods that do NOT use TempVars should be save…

Empire_Phoenix · August 20, 2014, 1:54pm

Well if you read about the java memory model (and understand it) you can easily distinguish what is save what not, and what kind of race faults are acceptable vs wich ones are not.

Eg a getLocalTranslation returns a Vector3f
-> or memory wise 12bytes(+overhead)
now when calling this, you could get any of those bytes from old update and from new update mixed. Determining if that is acceptable for your use case or not is then up to you.

Using immutable objects for game logic is one way to reduce this kind of problem from the root, but wont help you with renderer specific problems.

Then anyway usually you should not need to get stuff from the renderer, as all data should be in your game logic layer, and its mostly a datapush to jme to render it.

pspeed · August 20, 2014, 6:30pm

@phr00t said: I'd love to be able to narrow it down to a test case, but I haven't seen the problem happen again... I don't know where to begin to narrow the problem down.
The advice of “If an object is attached to the main scene graph then you cannot change it, read from it, manipulate, or basically do anything with it from another thread. You can’t even call getters safely on it” is extremely limiting… I can understand not changing things, but not being able to read or call getters seems like something is a little broken. I’d be for a more reliable solution at the cost of a little memory (although it shouldn’t build up junk over time that the garbage collector needs to deal with regularly).

As others have said, in Java multithreading it is never safe to call unprotected getters. You may get random data as each thread may have its own cached view. In JME’s case, sometimes data is cached on access to avoid recalculating it 1000x.

It may sound extremely limiting but in fact it is extremely liberating because you don’t have to guess what is ok: nothing is. So if you need to do anything with the attached scene graph then you have to enqueue a Callable. But if your game is properly designed then you should never have to query anything from the main scene graph anyway. It’s just the view, after all.

phr00t · August 20, 2014, 6:52pm

@pspeed said: As others have said, in Java multithreading it is never safe to call unprotected getters. You may get random data as each thread may have its own cached view. In JME's case, sometimes data is cached on access to avoid recalculating it 1000x.
It may sound extremely limiting but in fact it is extremely liberating because you don’t have to guess what is ok: nothing is. So if you need to do anything with the attached scene graph then you have to enqueue a Callable. But if your game is properly designed then you should never have to query anything from the main scene graph anyway. It’s just the view, after all.

Just for clarification, are we saying:

A) Getting random data is expected when calling unprotected getters, so you shouldn’t use them. However, it shouldn’t cause crashes in jme’s core & something should be done about that (use a more reliable method, perhaps?).

B) Getting random data is expected when calling unprotected getters & it can sporadically crash jme’s core, so you shouldn’t use them.

If it’s B, then so be it. However, that doesn’t inspire much confidence in jme’s ability to handle heavily multithreaded programs. In my humble opinion, unprotected getters should be at least identified, and at best converted to protected getters if it can cause core jme crashes.

pspeed · August 20, 2014, 7:08pm

[java]
public class MyMainThreadThing {
public Vector3f someLoc = new Vector3f();

// main thread modifies someLoc frequently

}
[/java]

Accessing someLoc from another thread may return random results… like the x part may have been updated but not the y and z. And that situation could last for many seconds. This is a Java thing, not a JME thing.

So, given that you should never do it anyway… I’m not sure we should slow down the scene graph so support a use case that should never be done.

For example, if you change the location of a spatial we avoid recalculating the world transform until it has been requested. Should we
a) recalculate it every time something changes creating needless garbage and sapping performance (note that the transform is built from multiple things so would be recalculated when changing location, then again when scale is changed, then again when rotation is changed… even if you are changing all three in a row… or when any of these things chance in the parent. Changing a node’s location would have to rebuild all of the transforms for all of the children.).
b) add synchronization around such methods “just in case” someone is doing something they shouldn’t from a poorly designed game. Note: synchronization incurs overhead even if there is no contention.

Neither option is really palatable.

JME makes many tradeoffs in the name of speed and expects users to avoid doing bad things. It trades trust for performance (which is why Vector3f.ZERO.set(1, 2, 3) can really mess you up). JME protects users from themselves only when it is performance-safe to do so.

Bottom line: don’t touch attached Spatials from a separate thread. If your app design is such that you must query the scene graph for something then you must do it from a callable (and you should think long and hard about what you’ve done :)).

Empire_Phoenix · August 20, 2014, 7:26pm

Well you can do it, but you then reall yhave to know what the java memory model specifies, and what exactly is inside each getter.
Eg getLocalTranslation is save in the sense that it would not produce crashes getWorldTranslation is probably not save for above mentiond reasons.

If you spend some time with the java memory model, it is often possible to use borderline suicide methods for increased performance, but this is nothing I would generally recommend.

nehon · August 20, 2014, 8:34pm

You’re mixing things…
Multithreading is dangerous if not handled properly, period. It’s not a JME thing, not even a java thing, it’s multithreading thing.

We don’t even know if the issue in sorting is multithreading related so that’s just throwing around assumptions.

I’ll look into the issue, but in the meantime, l’ll revert back to the old sort.

phr00t · August 21, 2014, 1:56am

@nehon said: You're mixing things... Multithreading is dangerous if not handled properly, period. It's not a JME thing, not even a java thing, it's multithreading thing.
We don’t even know if the issue in sorting is multithreading related so that’s just throwing around assumptions.

I’ll look into the issue, but in the meantime, l’ll revert back to the old sort.

Well said & thank you. If I can somehow reproduce it again, I’ll let you know.

nehon · September 7, 2014, 11:59am

I reverted to merge sort in the latest revision

Momoko_Fan · September 7, 2014, 8:12pm

I did some testing on the new sort. The issue can only occur in a quite narrow set of circumstances:

You have more than 128 opaque geometries on the screen
You are changing a texture on a material on one of those geometries from another thread

The new sorting algorithm now catches an error that previously would be unhandled. The previous sort was actually worse in that aspect by not throwing an exception, since it can cause rendering glitches.

phr00t · September 9, 2014, 1:01pm

Thanks for looking into this. Unfortunately, I just got the same crash again using the new changes. I’m trying to narrow down why it is happening in my application… could any other material changes cause the crash, or just setTexture(…)?

I assume that is an “128 opaque geometries” and “changing textures” to cause a crash, and not an or?

pspeed · September 9, 2014, 1:23pm

@phr00t said: Thanks for looking into this. Unfortunately, I just got the same crash again using the new changes. I'm trying to narrow down why it is happening in my application... could any other material changes cause the crash, or just setTexture(...)?
I assume that is an “128 opaque geometries” and “changing textures” to cause a crash, and not an or?

Yes, should be an AND.

Just about any material parameter change from another thread could cause the issue. Better to just never change material parameters from another thread.

Momoko_Fan · September 9, 2014, 2:34pm

@phr00t said: Thanks for looking into this. Unfortunately, I just got the same crash again using the new changes. I'm trying to narrow down why it is happening in my application... could any other material changes cause the crash, or just setTexture(...)?
I assume that is an “128 opaque geometries” and “changing textures” to cause a crash, and not an or?

That’s the scenario that reliably causes the aforementioned error. I tried other things like changing material parameters that effect defines (which would also change the sort ID), however, I kept getting other errors like ConcurrentModificationException.

But Paul is correct that unsynchronized access from other threads gets you deep into “undefined” territory… It is not a jME3 thing or a Java thing, just a general thing in computer science.