[BUG] Multiple instances on linux

Netzapper · July 14, 2011, 8:39pm

We’re writing a networked multiplayer game, so I’m trying to launch two instances of my application in separate JVMs in order to test stuff.

And every time I launch the second instance, the first instance crashes. This happens immediately after the second instance’s window is put on the screen.

I’ve tracked this down, and it’s blowing up deep inside LWJGL in org.lwjgl.opengl.LinuxDisplay.nLockAWT. When it does JAWT_GetAWT, the program segfaults. I had thought that it was related to keyboard input, as the initial instance of this bug showed a stack trace that left Java and entered native code while it was polling the keyboard. But, I’ve subsequently removed all input, and now it blows up in glEnable–and core dumps gdb, no less, so I can’t even be sure exactly what’s going on for this crash.

Also, to be clear, the second instance continues running just fine (no exceptions, no glitches, nothing). And, launching a third instance will kill the second instance. So, basically, launch n copies, and only the n’th one with continue running. Launch two of them simultaneously, and you get a race, with whoever comes in first losing.

I’ve tried multiple instances of every LWJGL demo that I can find. They all work.

However, any two jme3 applications act the same way as mine, with the second instance crashing the first. I’ve tested this with about half a dozen of the jme demos available in the DemoChooser. (They blow up in different code paths with different combinations of apps, but nearly always the top of the Java stack will be nLockAWT.) No combination of a single jme3 app and another opengl app (lwjgl or native) will crash either of them.

I’ve tested this on three different versions of Ubuntu: 9.10, 10.04, and 11.04. I’ve tried it with every version of the nVidia drivers easily available to me (so far, 3 versions). I’ve tried it with desktop effects on, with them off. I’ve tried it with single monitors, multiple monitors. I’ve tried every resolution and color depth available. I’ve turned AA on and off. I’ve tried three different machines, one of them freshly installed for these purposes, with three different video card models.

I’ve tried both the most recent stable engine-only release of jme3 and as a SVN checkout from today.

Please, help me.

Netzapper · July 14, 2011, 8:46pm

Oh, I’ve also tried the version of lwjgl that ships with jme, the latest stable binaries from their site, and building from source (with and without debug) from svn.

Momoko_Fan · July 15, 2011, 12:41am

It seems like a bug on their side then … Perhaps you can raise this issue on their forum? Possibly with a full stack trace

Netzapper · July 15, 2011, 12:57am

How does that make any sense, Momoko_Fan?

Every scenario that I’ve tested with just LWJGL has worked flawlessly. Every combination of two jme3 apps fails. That points toward jme3 having a defect, not LWJGL.

I assume you attempted to reproduce the issue and found no defect, and that’s why you’re punting it off to the other project (whose software appears to work just fine in this situation).

Momoko_Fan · July 15, 2011, 1:23am

Did you try LWJGL running in canvas?

Netzapper · July 15, 2011, 2:05am

No, I didn’t. I’ll see if there’s a demo available tomorrow at work.

But, I traced the initialization of jme while I was trying to find the fix for this, and there’s a switch I remember seeing that had Canvas as an option and Display as another option. If I recall correctly, the execution path skipped Canvas and went through to Display. Which makes sense, since I’m asking for a context with “context = JmeSystem.newContext(videoSettings, JmeContext.Type.Display);”.

I want to be perfectly clear here: I am not trying to open two windows from one JVM. I am launching two totally separate JVMs. In one of my tests, I even went so far as to make sure that their most recent common ancestor process was init.

So what I’m looking for is pollution between the two instances of jme3. Like, a statically named temp file, or some hashed value that always comes out the same, or perhaps the system looks up windows by name? An environment variable? A shared library loaded with some sort of shared memory turned on? Is it maybe that thing where jme3 blindly copies native libs into the working directory? (It took me a long time to figure out that it does that every time, even if the file exists already. I kept recompiling lwjgl with debug enabled, and I kept getting “no symbol table”.)

Even if you don’t have the time to debug it for me, some pointer toward where I might find structures like that would be nice. And, perhaps, somebody running a different distro, or with an ATI card, could give it a shot for me and let me rule out system stuff.

Any way it works out, I need to build a fix or a workaround for this, because I only have one workstation at work, and trying to coordinate with the the other guy “No, stash your code and pull my one-line debug println patch so I can test this” is getting real old. Also, it’s totally spamming up my commit log.

Momoko_Fan · July 15, 2011, 2:51am

There were several issues before with running LWJGL canvas on Linux. One of them was fixed in the LWJGL nightly builds which are actually now used in jME3.

There were a lot of other issues, especially canvas related, that we were not able to fix or even understand yet.

jME3 does not share anything between instances. The native extraction mechanism will attempt to overwrite an existing native library only if it is not currently in use, so you should still be able to run multiple instances simultaneously.

Netzapper · July 15, 2011, 2:59am

Okay. Well, like I said, I’m not using canvas. Are there problems with Display, too?

But, have other people reported the same issue I have? Or is this totally unheard of? Am I totally on my own here because everybody else can run multiple instances? Or is this just not something anybody has tested?

Momoko_Fan · July 15, 2011, 4:39am

Well you said it is using nLockAWT, LWJGL on Linux doesn’t use AWT so to me it didn’t make sense that it crashed there

Netzapper · July 15, 2011, 11:28am

No, it isn’t using actual AWT. But it does appear to use JAWT. Perhaps for the input system? Or window management?

I’m reporting my findings here, and they are unintuitive. But, I do this for a living and I spent all day tracing. This bug report has at least 8 hours of work behind it.

Netzapper · July 15, 2011, 1:11pm

Okay, so I’ve found the fault. I made a complete separate copy of my application’s directory, and ran the two instances from two separate directories. And, lo and behold, shit worked perfectly.

So, I went back to tracing. It turns out that you are blindly copying the native library into the current working directory every time, on linux at least.

In Natives.extractNativeLib(String, String, boolean, boolean):

[java]try {

OutputStream out = new FileOutputStream(targetFile);

int len;

while ((len = in.read(buf)) > 0) {

out.write(buf, 0, len);

}

in.close();

out.close();

} catch (FileNotFoundException ex) {

if (ex.getMessage().contains(“used by another process”)) {

return;

}

throw ex;

} finally {

if (load) {

System.load(targetFile.getAbsolutePath());

}

}[/java]

Clearly that was written by a Windows hacker. The FileNotFoundException is never going to be thrown on linux because another process is using the file. Simply having a file open does not lock anything on unix systems, and even using the POSIX file locking system calls isn’t guaranteed to actually lock the file from a process not designed to respect them. Furthermore, you’re not dealing with an even kind-of normal file access–it’s almost assuredly mmap()ed by dlopen.

Now, it is true that in normal cases of opening a file, you’d have a copy-on-write, separate-inodes kind of situation. But, like I said, dlopen does some atypical shit. And depending on the flags (which are buried deep off in the Java JNI implementation), it’s possible to set it up to specifically, intentionally act like this–for instance, if you mmap() the file with MAP_SHARED.

So, I’m going to write a patch to fix this. It appears that you aren’t doing any versioning checks as it is, so it shouldn’t be necessary to take a hash of the file and compare that with the desired version. So, I’m just going to add a check that the file exists. Cool?

I’ll submit the patch in a couple hours.

Momoko_Fan · July 15, 2011, 1:45pm

Simply having a file open does not lock anything on unix systems, and even using the POSIX file locking system calls isn’t guaranteed to actually lock the file from a process not designed to respect them. Furthermore, you’re not dealing with an even kind-of normal file access–it’s almost assuredly mmap()ed by dlopen.

Linux sucks... Being able to override a library that is currently used by the OS is really dumb to me.

So, I’m just going to add a check that the file exists. Cool?

We had it work that way before but people were having crashes when they updated to newer versions of jME3 which used new LWJGL but the natives in their folder were still old.

Perhaps this might help?
http://stackoverflow.com/questions/128038/how-can-i-lock-a-file-using-java-if-possible

Netzapper · July 15, 2011, 3:15pm

Momoko_Fan said:
Linux sucks... Being able to override a library that is currently used by the OS is really dumb to me.

Not just linux, but any unix is going to act this way. It's part of the basic POSIX standard. I expect OSX acts the same way.

But, it should be noted, in the vast majority of cases, this isn't an issue. POSIX defines a copy-on-write semantic for files opened by two separate processes. Usually, a process has a coherent view of any file descriptor that it's opened, regardless of what other processes are doing to that file. However, I suspect that dlopen() is doing some sort of memory mapping with MAP_SHARED or something like that.

(Perhaps I'll pull down the glibc sources today and determine, for sure, what dlopen is doing.)

Momoko_Fan said:
We had it work that way before but people were having crashes when they updated to newer versions of jME3 which used new LWJGL but the natives in their folder were still old.

Perhaps this might help?
http://stackoverflow.com/questions/128038/how-can-i-lock-a-file-using-java-if-possible

Eh, that's not going to work.

I just wrote a little test case that uses two processes and a trivial shared object I wrote. One of the processes creates a RandomAccessFile in read/write mode on the shared object, does getChannel().lock() on it, then does System.load(). The second process simply creates a FileOutputStream on the same file (truncating it), and writes in "blahblahblah". The second process succeeds in writing the file, even while the first process is still running.

This is apparently because the basic file-locking systems on linux are advisory. They're designed so that after one process calls flock(), then all subsequent calls to flock() by other processes fail. But, there's no kernel-level enforcement of the lock.

Mandatory locking is possible on linux, but requires that you remount the filesystem with mandatory locking enabled. And that you flag the files you want mandatory locking to affect with a special set of (otherwise-nonsensical) permission bits. This obviously doesn't work for us, since it's unreasonable to expect that anybody running a jme3 game on linux will remount their filesystem. It also doesn't work because Java has no way of setting those particular permissions flags (that I can find), as they're highly system-dependent.

Even trying to cooperatively lock between the two processes doesn't appear to work. I'm quite surprised by that, actually. But, my surprise doesn't change what appear to be the facts.

So, the question becomes: what is the Right Thing to do with regards to the native libraries? Is the idea that the newest version of the libraries (from jME3-lwjgl-natives.jar) always be used? Or, should there be a flag to disable writing so that people can supply their own version without rebuilding the natives.jar?

I mean, the most heavyweight approach would be to take a hash of both files, and if they differ, overwrite the existing one.

The lightest weight approach would simply be to dump a pid lock file into the working directory. So, if my pid is 7, I'd drop 7.lock into the directory; on exit, I'd delete 7.lock. At initialization, if there are no existing .lock files, then I write in the jarred libs. If there are, I don't.

This approach comes with the caveat that if the program *crashes*, it's probably not going to properly clean up its pid.lock file. Which, naturally, means that the next time the program runs, even if there's been an upgrade, that no rewrite is going to occur. You're going to have to manually clean up the lock files to get it working normally again.

Although, having done a bit of googling, the second approach doesn't work. Apparently Java gives you no way to get the PID--which makes sense, since I don't think a numeric PID exists on Windows.

So, hashing?

pspeed · July 15, 2011, 5:34pm

You could also check the last modified dates since that could be set when the file is extracted. It’s not as reliable as a hash but reliable enough if one errs on the side of “always extract the file” since the use-cases where that causes problems are fairly isolated.

The idea is to make sure that the users (who are primarily the game players not we developers) are always running with the native libraries that are compatible with the jar set that they are running. This is important because debugging some problem when they are not, and/or determining that they are not in that case, is nearly impossible.

Netzapper · July 15, 2011, 6:32pm

Does that really work, though?

Obviously, there’s File.lastModified() for the existing (or non-existent) file.

But, we’re grabbing the native binary by doing getResourceAsStream(). There’s no concept that it’s a file, let alone that it has a modification time. We could manually open the jar file and access the ZipEntry, which has a getTime() method. But, then we have to keep track of the actual jar file that holds the native libraries.

Do y’all want to go that route? We lose a bit of the elegance, IMO.

pspeed · July 15, 2011, 6:35pm

If you use getResource() then you can get additional info from the URL… even when the files are in jars. Though I don’t know if last modified is one of those things.

I mean, though, if this is a critical issue then the build time could be saved as an additional class resource (I do this in all of my builds). It doesn’t have to be the actual time of the native… just something that can be used to mark the extracted native to see if it has changed.

…and this information would be useful for debugging, too. Which is why I do it in my programs.

Momoko_Fan · July 17, 2011, 4:36am

Should be fixed in SVN

Can you please check?

Netzapper · July 18, 2011, 1:08pm

Alright, checked it. Seems to work okay.

But, may I ask why you skip doing the System.load() on the already-extracted code path?

Momoko_Fan · July 18, 2011, 4:27pm

Okay, now it is fixed.