Driver crash?

pokemoen · January 5, 2009, 1:21pm

Hello,

I have a scene with a piece of terrain and a geometryinstancebatch. I was running it on Windows XP machines with Quattro M1600 GPUs in it and the driver seems to freeze up if I add instances too quickly. The canvas goes to black and the mouse is hardly responsive and after a couple of minutes the whole machine gives up. Sometimes freezing, sometimes BSOD…

I'm now suspecting either the cause being:

Because I'm running JME in an AWT canvas, instances are being added from the eventqueue thread (because that's where repaint() is being called) or
an OutOfMemoryError that gets eaten (by the driver(?)) before Eclipse gets a chance to show it in the console… but takes the whole system down

On OSX all is fine.

On Vista it looks ok also.

On other Windows XP machines it sometimes goes out of memory but not these huge crashes…

Any ideas for working around this? Know why I even get OOM errors? We're talking about between 3500-15000 instances.

Thanks in advance,

Alex

PS. I got the same problem when I was just adding boxes to the scene, so making it a geometry batch didn't solve anything…

pokemoen · January 6, 2009, 8:30am

Right, when I run it in my profiler it doesn't take down the system so I got a stacktrace:

[ERROR] (GameTask.java:152) Exception Jan 6, 2009 9:10:08 AM class com.jme.util.GameTask invoke() SEVERE: Exception org.lwjgl.opengl.OpenGLException: Out of memory (1285) at org.lwjgl.opengl.Util.checkGLError(Util.java:54) at com.jmex.awt.swingui.LWJGLImageGraphics.update(LWJGLImageGraphics.java:133) at com.jmex.awt.swingui.SPImageGraphics.update(SPImageGraphics.java:46) at com.jmex.awt.swingui.ImageGraphics.update(ImageGraphics.java:97) at com.jmex.awt.swingui.SPImageGraphics.update(SPImageGraphics.java:53) at nl.tygron.sge.client.jme.map.interactor.SelectionArea$1.call(SelectionArea.java:112) at com.jme.util.GameTask.invoke(GameTask.java:140) at com.jme.util.GameTaskQueue.execute(GameTaskQueue.java:111) at com.jmex.awt.lwjgl.LWJGLCanvas.paintGL(LWJGLCanvas.java:136) at nl.tygron.constructit.client.gui.CILWJGLCanvas.paintGL(CILWJGLCanvas.java:64) at org.lwjgl.opengl.AWTGLCanvas.paint(AWTGLCanvas.java:290) at org.lwjgl.opengl.AWTGLCanvas.update(AWTGLCanvas.java:321) at sun.awt.RepaintArea.updateComponent(Unknown Source) at sun.awt.RepaintArea.paint(Unknown Source) at sun.awt.windows.WComponentPeer.handleEvent(Unknown Source) at java.awt.Component.dispatchEventImpl(Unknown Source) at java.awt.Component.dispatchEvent(Unknown Source) at java.awt.EventQueue.dispatchEvent(Unknown Source) at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source) at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.pumpEvents(Unknown Source) at java.awt.EventDispatchThread.run(Unknown Source)

I'm not sure this is the cause of the freeze or that it's just a symptom of the context crashing.. If the GPU runs out of memory, why?
I found out it's not even if I add boxes fast, also sometimes just if I change my camera angle or just spontaneously..

The card is an NVIDIA Quadro FX 1600M (in a Dell Precision M6300 laptop) with 512MBs of memory.. (I know.. crappy exotic hardware)

Any ideas how to tame this beast? A workaround would be acceptable, so we can at least use the app. this week with the client's laptops..

I'll try a driver update, since the thing doesn't go OOM on any other setup..

Cheers,
Alex

mulova · January 6, 2009, 9:20am

Have you tried VM option "-Xmx"?

The OOM is thrown by VM, not by gfx card.

pokemoen · January 6, 2009, 9:23am

Yeah, I'm using -Xmx1024M -Xms512M.

Some more runs in the profiler:

Run 1:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError at sun.misc.Unsafe.$$YJP$$allocateMemory(Native Method) at sun.misc.Unsafe.allocateMemory(Unknown Source) at java.nio.DirectByteBuffer.<init>(Unknown Source) at java.nio.ByteBuffer.allocateDirect(Unknown Source) at com.jme.util.geom.BufferUtils.createByteBuffer(BufferUtils.java:850) at com.jme.util.TextureManager.loadImage(TextureManager.java:674) at com.jme.util.TextureManager.loadTexture(TextureManager.java:432) at com.jme.util.TextureManager.loadTexture(TextureManager.java:423) at nl.tygron.sge.client.jme.map.SimMap.updateMultiTexture(SimMap.java:622) at nl.tygron.sge.client.jme.map.SimMap.updateTextures(SimMap.java:731) at nl.tygron.sge.client.jme.map.SimMap.updateTextures(SimMap.java:652) at nl.tygron.sge.client.jme.map.SimMap$2.call(SimMap.java:698) at com.jme.util.GameTask.invoke(GameTask.java:140) at com.jme.util.GameTaskQueue.execute(GameTaskQueue.java:111) at com.jmex.awt.lwjgl.LWJGLCanvas.paintGL(LWJGLCanvas.java:136) at org.lwjgl.opengl.AWTGLCanvas.paint(AWTGLCanvas.java:290) at org.lwjgl.opengl.AWTGLCanvas.update(AWTGLCanvas.java:321) ...

Run 2:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError at sun.misc.Unsafe.$$YJP$$allocateMemory(Native Method) at sun.misc.Unsafe.allocateMemory(Unknown Source) at java.nio.DirectByteBuffer.<init>(Unknown Source) at java.nio.ByteBuffer.allocateDirect(Unknown Source) at com.jme.util.geom.BufferUtils.createFloatBuffer(BufferUtils.java:731) at com.jme.util.geom.BufferUtils.createVector2Buffer(BufferUtils.java:445) at nl.tygron.constructit.client.jme.MaquetteBlockModelBuilder.update(MaquetteBlockModelBuilder.java:254) at nl.tygron.sge.client.jme.map.SimMap$2.call(SimMap.java:711) at com.jme.util.GameTask.invoke(GameTask.java:140) at com.jme.util.GameTaskQueue.execute(GameTaskQueue.java:111) at com.jmex.awt.lwjgl.LWJGLCanvas.paintGL(LWJGLCanvas.java:136) at org.lwjgl.opengl.AWTGLCanvas.paint(AWTGLCanvas.java:290) at org.lwjgl.opengl.AWTGLCanvas.update(AWTGLCanvas.java:321) ...

Profiler shows a pretty constant memory use of 200-230mb heap (of alloc 512 and max 1024) and 26mb non heap.
Any ideas of how to debug the opengl native stuff going OOM? Something leaking?

Thanks again,
Cheers,
Alex

pokemoen · January 6, 2009, 10:04am

From java - How to avoid OutOfMemoryError when using Bytebuffers and NIO? - Stack Overflow

This can depend on the particular JDK vendor and version.

There is a bug in GC in some Sun JVMs. Shortages of direct memory will not trigger a GC in the main heap, but the direct memory is pinned down by garbage direct ByteBuffers in the main heap. If the main heap is mostly empty they many not be collected for a long time.

This can burn you even if you aren't using direct buffers on your own, because the JVM may be creating direct buffers on your behalf. For instance, writing a non-direct ByteBuffer to a SocketChannel creates a direct buffer under the covers to use for the actual I/O operation.

The workaround is to use a small number of direct buffers yourself, and keep them around for reuse.

Could this be my problem? :/

pokemoen · January 6, 2009, 3:01pm

Hello again,

It looks like something's indeed eating up direct buffers. Things are crashing all over the place on these machines (I have 10 of them set up here now), some crash in the opengl native code, some in texture creation code, sometimes in the add-block code I suspected before.

All suggestions are welcome… any ideas?

Is there a way to get a dump of all buffers without referencing them? (So I can see if the GC's actually cleaning up old ones)

Thanks,

Alex

Core_Dump · January 6, 2009, 5:29pm

it wont solve anything but maybe it helps to increase the direct memory size.

-XX:MaxDirectMemorySize=<value>

list of vm options:

http://www.jmonkeyengine.com/wiki/doku.php?id=links

Momoko_Fan · January 6, 2009, 10:10pm

Seems like you've got a memory leak. It doesn't matter where the crash occurs, because if there's not enough memory anything that tries to allocate some will fail. What you have to do is to track where all that memory is being allocated (forget stack traces, use your profiler memory usage tracker).

pokemoen · January 6, 2009, 10:55pm

Thanks core-dump for the possible temporary workaround, I'll try it out, see if it helps.

Yeah Momoko, you're right, though the leak isn't strictly my fault. I am creating alot of new buffers every time I update my geometry batch… The GC just doesn't appear to destroy the direct buffers that are orphined at that time. Only thing I can think of is re-using one buffer that is pre-set to a large capacity, but that feels like a hack aswell… and just eats up the whole direct memory up front… furthermore, there is other code creating its own buffers the same way (texture initialisation, etc.), just not as often… might be a good one to look into. (code assumptions that the discarded buffer will be cleared if replaced by a newly created one)

Know of other ways to update a geometrybatchcreator's mesh when the instances have changed, that doesn't involve creating new buffers for it?

Know of a profiler that shows direct memory? Yourkit Profiler appears to only show me 'heap' and 'non heap'… but it might be in there somewhere.

pokemoen · January 8, 2009, 9:51am

Just wanted to correct my earlier comment: It does crash on Vista. But only on those Dell Precision M6300s…

Those Quadro drivers sure are picky by the way, they don't allow exceptions of any kind to occur!

basixs · January 8, 2009, 5:56pm

Historically the Nvidia Quadro chipset/drivers were developed as a 'Professional CAD' system (and still must adhere to some pretty strict standards), so it almost stands to reason that they would be a little stricter and inform the developer of any potential problems, rather than just trying to take care of it as a 'game based card' might…

pokemoen · January 10, 2009, 7:53pm

Well we had some playtests this week anyway on a macbook pro (8800m) and a bunch of Toshiba's (ATI cards). These held out pretty well, though the Toshiba's crashed a couple of times in native code and even the mac crashed once.

At least I'm more aware of where the problem is coming from, so I'll rewrite the geometry instancing code so this just isn't an issue anymore…

Thanks for the tips!