Multithreading Minie

sgold · August 11, 2021, 6:49pm

My August project is to add multithreaded physics spaces to Minie. Currently, I’m stuck. This post provides background. My next post will be status report and a plea for help/advice.

As the complexity of a Minie physics simulation increases, it tends to become CPU-bound. Modern computers provide multiple CPU threads, allowing software threads (lightweight processes) to execute multiple tasks in parallel. The Bullet software that underlies Minie already includes code to exploit task-level parallelism on multiple CPU threads. By exposing this feature in Minie, I hoped enable real-time simulation of more complex worlds, with greater accuracy.

Multithreaded physics spaces are not the only parallelism available, nor are they a panacea for all Minie performance issues.

From jme3-bullet, Minie inherited ThreadingType.PARALLEL, the capability to dedicate a single Java thread to physics. This allows physics simulation to proceed while the rendering thread is blocked, or (potentially) in parallel with rendering. Bullet’s multithreading occurs at a lower level and a finer grain. It is orthogonal with this feature.

Most physics games use a single physics space. If a game used multiple physics spaces, one could dedicate a Java thread to each physics space. Again, that’s orthogonal with the feature I’m pursuing.

Bullet v3 reportedly has the capability to exploit the SIMD parallelism provided by graphics adapters. Minie is still based on Bullet v2, so it doesn’t yet have access to this feature.

AFAIK, Bullet’s multithreading support is limited to btDiscreteDynamicsWorld::stepSimulation(), which corresponds to Minie’s PhysicsSpace.update(). If there’s a performance benefit, I expect it will be most pronounced for a PhysicsSpace with a large number of dynamic rigid bodies. I don’t expect any speedup for soft bodies, multibody objects, sweep tests, ray tests, or contact tests. I’m unsure how much multithreading will benefit kinematic rigid bodies, ghost objects, or characters.

Not all of stepSimulation() is parallelized, and threads will doubtless conflict somewhat over shared resources such as locks and memory bandwidth. So even in ideal cases, I don’t expect stepSimulation() to execute 12x faster on hardware with 12 CPU threads.

Bullet exploits task-level parallelism using an abstraction layer. The layer interfaces to 3 different thread-management APIs: OpenMP, Microsoft’s Parallel Patterns Library, and Intel’s Threading Building Blocks. To exploit task-level parallelism, the BT_THREADSAFE macro must be defined, and a thread-management API must be selected at compile time.

Meanwhile, open-source development of Bullet appears to have stopped. There have been no commits to the main repo since May, and no official releases since November. Most of the user documentation is in a PDF that hasn’t been updated in 6 years.

pspeed · August 11, 2021, 8:37pm

Just putting this wrinkle out there because it’s something that even JME’s multithreading doesn’t handle well…

The pure lock-step parallel execution like JME does it has the potential for odd frame stutter… and depending on how bullet is stepped it may result in varying accuracy on that side as frames are sliced up differently. Imagine the case where rendering and physics are almost taking the same amount of time. If physics is taking a little longer than 60 hz by even just a few microseconds then we may drop a whole visual frame. It can also work in the other direction where bullet has to slice up frames and ends up with one that’s super super small (which with float will definitely affect accuracy at some point).

Just something to think about.

My own physics integrations tend to be based around a networking model so I’ve already built buffering and tweening in. Physics and rendering can run completely independent as long as the view slightly lags the simulation (by at minimum a single frame). The view is always interpolated between known good values so motion is always smooth.

It comes with its own new set of things to worry about but it is nice that the physics loop and render loop are otherwise completely decoupled (can run rendering at 120 hz and physics at 50 hz and everyone is still happy).

sgold · August 12, 2021, 12:17am

bullet has to slice up frames and ends up with one that’s super super small

I’m pretty sure Bullet’s btDiscreteDynamicsWorld::stepSimulation() doesn’t simulate short steps unless you pass maxSubSteps = 0. (Default is 4.) To avoid short steps, Bullet simulation time lags clock time. This behavior also extends to PhysicsSpace and BulletAppState … in Minie and also in modern releases of jme3-bullet.

I know you’ve had conflicting experiences. We’ve had this discussion before. I don’t wish to rehash it here.

High-level status:

With a bleeding-edge build of Minie, applications can exploit 4 CPU threads on my Linux desktop. Said build passes all functional tests. On a simple performance test, CPU utilization (reported by “mpstat -P ALL 1”) hits 50-75%, but performance is measurably worse than single-threaded Minie.
Good reason to expect even worse behavior on Windows platforms.
Multithreaded Minie not built for macOS due to limitations of Xcode.
Doubts about whether standard JVMs and debuggers are compatible with multithreaded native code.

Details:

I’ve been using the OpenMP API because it looked portable across a broad range of compilers and platforms. However, I discovered that OpenMP requires special compile/link options:

for Microsoft Visual C++, /openmp
for GCC: -fopenmp / -lgomp
for LLVM: -fopenmp / -lomp

Apple’s Xcode is based on LLVM, but I had trouble using it to build for macOS. Xcode’s OpenMP support is reportedly disabled: OpenMP on macOS with Xcode tools

I should be able to work around this with build tools other than Xcode, but for now I’m focusing on Linux and Windows platforms.

For reasons I don’t yet understand, multithreading exposed an old bug in GImpact, which you can read about here: pure virtual method called during TestCloneShapes on Linux · Issue #17 · stephengold/Minie · GitHub

I’m concerned about the compile/link options because Minie’s native libraries get dynamically loaded into a JVM which was probably built without these options. How compatible are these binaries with one another? Do the necessary OpenMP libraries get loaded at runtime? I don’t know.

My concern deepened when I used GDB to step through multithreaded native code. GDB generally has good thread support. However, I noticed garbled local variable values in the stack traces. For instance:

(gdb) bt
#0  btDiscreteDynamicsWorldMt::predictUnconstraintMotion(float)
    (this=0x7f64f471b1e0, timeStep=4.56991455e-41)
    at /home/sgold/Git/Libbulletjme/src/main/native/bullet3/BulletDynamics/Dynamics/btDiscreteDynamicsWorldMt.cpp:213
#1  0x00007f6460280acb in btDiscreteDynamicsWorld::internalSingleStepSimulation(float)
    (this=0x7f6468034a30, timeStep=0.0166666675)
    at /home/sgold/Git/Libbulletjme/src/main/native/bullet3/BulletDynamics/Dynamics/btDiscreteDynamicsWorld.cpp:462

Notice how the value of timeStep (which is passed without modification from internalSingleStepSimulation() to predictUnconstraintMotion()) appears to become garbage. I’m pretty sure this is a GDB artifact, since stack corruption of this sort would break the app’s functionality. Without a reliable debugger, it’s difficult to be sure what’s going on.

For performance testing, I modified the TestBatchNodeTower app from jme3-examples:

disable ThreadingType.PARALLEL
setLinearSleepingThreshold(0f) to prevent deactivation of dynamic bodies
capture System.currentTimeMillis() before and after each physics step
log the average wall-clock time spent in physics after every 50 steps

Here’s a typical log for a non-multithreaded run:

Libbulletjme version 11.1.0 initializing
millisPerStep = 3.3
shoot
millisPerStep = 3.4
millisPerStep = 3.2
millisPerStep = 2.6
millisPerStep = 2.2
millisPerStep = 1.9
millisPerStep = 1.8

And here’s a multithreaded run:

Mt_Libbulletjme version 11.1.0 initializing
millisPerStep = 6.2
shoot
millisPerStep = 5.1
millisPerStep = 4.1
millisPerStep = 3.8
millisPerStep = 2.7
millisPerStep = 3.1
millisPerStep = 2.3

As you can see, multithreading actually made physics simulation take more wall-clock time.

Meanwhile, I have a colleague who tests Libbulletjme, which is basically Minie with all the JMonkeyEngine dependencies removed. They’ve run their own performance tests of multithreading. Their app creates 500 bodies on a Windows system with 12 CPU threads. They report that all 12 CPU threads go to 100% utilization, yet (qualitative) performance is no better than it was without multithreading.

Questions? Possible explanations? Ideas?

pspeed · August 12, 2021, 3:13am

Without knowing how the multithreading was actually implemented, poor scalability almost always means high contention of shared data… but then it’s weird to have all cores operating at 100%. It’s like they traded contention for shuffling data around into thread-specific buffers but then one thread (the visualization) still has to wait for all of the buffers to sync up.

Weird.

sgold · August 12, 2021, 5:18am

To be clear: I haven’t seen 100% CPU on Linux. More like 50%-75%. 100% was something my colleague saw on Windows 11 with many more CPU threads.

I’ll gather my own results on Windows, so I can directly compare between OSes.

sgold · August 13, 2021, 9:57pm

Today I measured wall-clock time per physics step on a tower-of-bricks app. I tried with multithreading and without it, on both Windows 7 and Linux. The same hardware (2012 desktop with 4 CPU threads) was used in all tests.

It wasn’t super scientific, but sufficient to convince me that MT sped up physics on Windows, while making it run slower on Linux. However, the single-threaded Windows performance was much worse to start with:

[Album] imgur.com

Note that the Y axis is time per step, so lower results are better.

I plan to re-run the Windows tests with Windows 10 on a modern laptop.

Also, I got Quickprof working on the native libraries. Analyzing some Quickprof dumps may reveal where the performance bottlenecks are.

tlf30 · August 13, 2021, 10:40pm

Wow, I did not expect to see such a discrepancy between Windows and Linux.

danielp · August 13, 2021, 10:46pm

Same, @tlf30. That’s a pretty wild disparity between them…

adi.barda · August 14, 2021, 7:10am

Can you post your testing app as source or executable so we can run on our machines? this way we can collect data from different devices including mobile ones

oxplay2 · August 14, 2021, 7:58am

how is it possible there is so much difference between Windows and Linux?

it shocked me. Are you sure Windows were not running its Disc/etc checks that slow everything

sgold · August 14, 2021, 10:47am

Me neither! The difference could derive from the compilers (GCC 9.3 -O3 versus Visual Studio 2015 default optimization). However, I suspect Windows 7 was intentionally crippled around the time of its EOL, to encourage customers to upgrade. That’s why I wanted to re-test on Windows 10.

Yes! My previous post was rushed, and I neglected to include that information:

https://github.com/stephengold/Minie/blob/16e011bee8048d21cc34f145a5a4ad463a34cedb/MinieExamples/src/main/java/jme3test/batching/TowerPerformance.java

Multithreading support isn’t in any Minie releases yet, so build from source. For maximum comparability, check out commit hash 16e011be.

If you run the app using Gradle, I recommend disabling assertions (line 20 of “common.gradle”).

Multithreading is enabled for Windows64 and Linux64 … and is not available for other platforms such as Android. To disable multithreading, edit lines 82 and 88 of “MinieLibrary/build.gradle” to remove “Mt” from the build flavor … and then perform a clean rebuild of the entire project.

Here are test results from the Windows 10 laptop…

Notice:

Multithreading made the performance worse.
The Visual Studio C++ compiler improved noticeably in 4 years!

The configuration is:

Processor        Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz   2.59 GHz
Installed RAM    32.0 GB (31.8 GB usable)
System type      64-bit operating system, x64-based processor
Edition    Windows 10 Home
Version    20H2
OS build   19042.1165

I’m not sure. If I re-run the tests, what would be a good way to check for that?

oxplay2 · August 14, 2021, 11:00am

i just mean look at Task manager, when sort by name, it show “background processes” where you can check if something slow down System. Usually at least for me even without that for first 5-10 minutes it use disc at 100% so it slow down System too(idk if it defragment or what). So it would be good to check there CPU/RAM/DISC if are low usage when start doing test.

Anyway it sounds like MT help CPU when physics work slow, while performance is worse for fast CPU.
Something is odd with this grahps, but hard to say what it is.

edit: seems like i will need upgrade NB to stop using Gradle 7* for Minie project, since i have issue:

sgold · August 14, 2021, 7:59pm

it would be good to check there CPU/RAM/DISC if are low usage when start doing test.

Thanks for the tip.

I re-ran yesterday’s Windows 7 measurements to see how reproducible they are. Before starting the first test, I terminated startup tasks (MRT.exe and various update managers) using Task Manager. I also verified that CPU utilization was mostly 0%. Performance improved, mostly. However, yesterday’s findings still hold:

MT provides modest speedups throughout the test
Linux with GCC 9.3 -O3 outperforms (by a large margin) Windows 7 with Visual Studio 2015 and default optimization

During the multithreaded test, I saw utilizations near 100% on all CPU threads.

I calculated speedups (Non-MT time divided by MT time) for the new measurements. They ranged from 1.18x to 1.78x. But it’s clear to me that Minie’s performance on Windows has major room for improvement!

The plan for today: test/compare various Visual Studio releases and compiler optimizations.

i will need upgrade NB to stop using Gradle 7* for Minie project

I’m still using JMonkeyEngine SDK v3.2.4, which I believe is based on NetBeans 7. When I run builds from the SDK using the Gradle v7 wrapper, I get warnings because the “-c” option is deprecated in Gradle v7. So I may have issues when the Gradle v8 wrapper is released, but for now I’m OK.

If the Gradle v7 wrapper is an issue, you can either:

build and run from the command line OR
downgrade to the Gradle v6.9 wrapper, by editing line 3 of “gradle/wrapper/gradle-wrapper.properties”

sgold · August 15, 2021, 7:41am

test/compare various Visual Studio releases and compiler optimizations

The upshot was a new release of native libraries for Windows, compiled with the “/O2” and “/Ob3” options. The good news is that the new libraries boost Windows performance. The boost was about 5x on non-MT tests. The gap between Windows and Linux performance (without multithreading) is now only 1.5x to 2x instead of 7.5x to 10x. The bad news is that MT performance did not improve as much as non-MT performance did. The MT speedups previously seen on Windows are now essentially gone.

adi.barda · August 15, 2021, 10:13am

Hi Stephen. I guess for the test app to work, there is a need to include Minie > 4.1.0 in the gradle config:

implementation 'com.github.stephengold:Minie:4.2.0+for34'

But for some reason gradle cannot find Minie > 4.1.0
Can you check & deploy the newest version?

Thanks!

oxplay2 · August 15, 2021, 11:17am

Using Linux Debian:

MT:

Warning: assertions are enabled.
Mt_Libbulletjme version 11.2.0 initializing
millisPerStep = 7,662
millisPerStep = 5,869
millisPerStep = 5,803
millisPerStep = 4,629
shoot
millisPerStep = 7,420
millisPerStep = 5,293
millisPerStep = 5,274
millisPerStep = 3,886
millisPerStep = 4,401
millisPerStep = 6,998
millisPerStep = 3,114
millisPerStep = 4,771
Warning: can't access CProfileManager!

Non-MT(removed “Mt” from build 64 BTW):

Warning: assertions are enabled.
Libbulletjme version 11.2.0 initializing
millisPerStep = 3,207
millisPerStep = 3,164
millisPerStep = 3,237
millisPerStep = 3,214
shoot
millisPerStep = 3,057
millisPerStep = 2,606
millisPerStep = 2,393
millisPerStep = 1,918
millisPerStep = 1,764
millisPerStep = 1,694
millisPerStep = 1,681
millisPerStep = 1,721
Warning: can't access CProfileManager!

idk when i will be able check windows currently.

sgold · August 15, 2021, 6:16pm

adi.barda:

Hi Stephen. I guess for the test app to work, there is a need to include Minie > 4.1.0 in the gradle config:
implementation 'com.github.stephengold:Minie:4.2.0+for34'
But for some reason gradle cannot find Minie > 4.1.0
Can you check & deploy the newest version?

Thanks!

There’s no such thing as v4.2.0+for34. Minie v4.2 only supports JME v3.4, so it doesn’t need the “for34” qualifier. Try:

implementation 'com.github.stephengold:Minie:4.2.0'

adi.barda · August 16, 2021, 11:47am

Gradle downloaded minie 4.2.0 fine but I have an issue with the below code:
NativeLibrary.resetQuickprof() and NativeLibrary.dumpQuickprof() methods are not recognized.

    @Override
    public void physicsTick(PhysicsSpace space, float timeStep) {
        physicsNs += System.nanoTime() - preTickNs;
        ++numSteps;
        if ((numSteps % 50) == 0) {
            float millisPerStep = (physicsNs * 1e-6f) / 50;
            System.out.printf("millisPerStep = %.3f%n", millisPerStep);
            physicsNs = 0L;
        }

        if (numSteps == 200) {
            NativeLibrary.resetQuickprof();
            enqueue(new Callable<Void>() {
                @Override
                public Void call() throws Exception {
                    shoot();
                    return null;
                }
            });

        } else if (numSteps == 600) {
            NativeLibrary.dumpQuickprof();
            enqueue(new Callable<Void>() {
                @Override
                public Void call() throws Exception {
                    stop();
                    return null;
                }
            });
        }
    }

sgold · August 16, 2021, 5:03pm

Yes, those were just added last week. They don’t exist in v4.2. Expect them in v4.3.

pspeed · August 17, 2021, 12:21am

It’s probably changed a lot over the years but a decade ago that would have been a HUGE difference.