Static temp objects and threading

I noticed static temp variables all across jME code, for example some Vector3f in Quaternion. Static temp's are nice for performance, but my guess is, they kill multithreading. It is quite possible that two threads will be executing the same code at the same time, thrashing each others temp data. Making methods synchronized is not the way to go, those variables would be best to make thread-specific. I'm i right with my guess, and how could it be solved?


The temp variables should be instance members, that would resolve threading issue. Im not sure what you mean by "thread-specific", but if you are implying Thread-Local pattern, then you should consider speed penalties for using thread-local variables. Basically its memory svaing with thread-local vs speed with temp member fields.

Thread local variables also have a big performance penalty. Right now the solution is using one thread, or synchronizing your access across threads. For accesing OpenGL related methods you have to do this anyway…

IMO the vector/quaternion classes should be made immutable and then you wouldn't need those temp variables… and the code might actualy become faster since java has very fast allocation.

Almost all of the static variables were added after the fact because of huge slowdowns due to object allocation.  Yes, we need to look at thread safety, but please also consider that these things were not added on a hunch.  In any case, it's something that is on the backlist of things to consider for 2.0

Momoko_Fan said:

IMO the vector/quaternion classes should be made immutable and then you wouldn't need those temp variables.. and the code might actualy become faster since java has very fast allocation.


How would making them immutable make anything faster? That would would mean we'd have to allocate an object for each calculation. I also don't see how doing extra allocation (as opposed to doing nothing) "might" make something faster. Also, allocation (while that would slow things down for sure) is not as much of a problem as delocation.

As for things that *would* work, I can think of supplying your own temp variables (very ugly both in API, and that we now make it the users problem), and using temporary (scope limited) primitives, that would require updating the math API to work with primitive arguments aside from just objects. I'm also curious if this could cover all cases. And it's still ugly of course, though this is the kind of ugly you'd only notice by the extra method signatures, aside from that nothing would change for the user.

lex said:

The temp variables should be instance members, that would resolve threading issue. Im not sure what you mean by "thread-specific", but if you are implying Thread-Local pattern, then you should consider speed penalties for using thread-local variables. Basically its memory svaing with thread-local vs speed with temp member fields.


Making temp variables member fields is too much overhead i think. Just by instantiating a Quaternion (4 floats), i would get another 9 floats from 3 Vector3f-s, and 3 objects created for each Quaternion object created.

llama said:

Thread local variables also have a big performance penalty.

I have just looked at the ThreadLocal class, it has JVM support in the Thread class (as the spec says), so it has to be faster than a programmed solution. But surely not as fast as a static variable.

Momoko_Fan said:

IMO the vector/quaternion classes should be made immutable and then you wouldn't need those temp variables.. and the code might actualy become faster since java has very fast allocation.

Allocation is one thing, garbage creation and JVM pauses due to the garbage collection is another. I can imagine that the JVM knows if an object is created inside and used only in one method, then it could dispose it when the method returns (objects created on the stack). Read a page about it, and it suggests, that temp objects can be created, because Java 1.6 knows how to allocate on the stack.

renanse said:

Almost all of the static variables were added after the fact because of huge slowdowns due to object allocation.  Yes, we need to look at thread safety, but please also consider that these things were not added on a hunch.  In any case, it's something that is on the backlist of things to consider for 2.0

I'm not asking You to fix it, I'm asking how to fix it myself.  :wink:

I think i'll run the profiler, to see if Java 1.6 can really allocate objects on the stack.
vear said:

I think i'll run the profiler, to see if Java 1.6 can really allocate objects on the stack.

Be careful on the profiler options (if you actually use one). It's likely that the measurements/instrumentations of a profiler alter the jvm's strategy!

Very intresting results indeed!!



disclaimer: post filled with wild guesses about what's going on…



My guess is that at least part of it is due to your access patterns during your tests, you're deliberately accesing that one (or three in this case) static Vector from several threads at once, almost as much as you possibly can. But of course for some multithreading scenarios that's exactly what you'd want to do (do a math problem on several cores at once), so it's not a bad test…



The VM is allowed to keep copies of variables in the thread's local memory space, unless they are declared volatile, however that doesn't mean it actually does this. Maybe when an object is accessed from several threads the VM decides automatically it shouldn't do that, because even though it would be technically correct that the values inside that Vector would have different values (as long as you don't enter a synchronized block) for an undetermined amount of time (seconds, hours, days, years, etc), it would be "weird".



It seems like it then DOES do this for the thread local created objects, because they (regardless of the fact that they're created through a ThreadLocal) are only accessed from one thread.



You could try running with the static test with the -XX:+UseTLAB option… it might make the VM mysteriously lean more towards keeping copies of variables in the thread's local memory space.



This all would of course imply static is still faster than ThreadLocal for a single thread, did you test that?

With Hashmaps, if you keep the load factor down, you do not actually get much slower access with more object (of course a bigger map does mean it takes up more space, espc in your L1 cache that hurts). I think Java even uses one big specialized Hashmap for all ThreadLocal objects.



Your "trick" of using one class that contains all the temps is definatly a good one.



But it's still surprising that it's that much faster… at how many threads is the "sweet point"? (the point where ThreadLocal beats static access with the largest margin). Is it at 2?



With some further guessing, for the threadlocal's it's probably using the SSE registers for the x,y and z variables, without storing and retrieving them from the (L1 cached) memory all the time (or at least half as much as for the static). That would explain why you get such a huge boost for using hyperthreading too. Hyperthreading can much

Uhm yes, the "one big hashmap per thread" bit is very important.



Would you be willing to run the test with hyperthreading off, or to post all the code for the tests so I can run it on my single core Pentium M?

Here are the classes of the test:



The main class:


package localtest;

public class LocalTest {

    public static long runs = 100000000;
   
    public static long runSingle() {
        Vector3fM dir=new Vector3fM(2f, 0.1f, 1.5f );
        Vector3fM up = new Vector3fM(0.1f, 1f, 0.1f );
        Quaternion quat=new Quaternion();
       
        double accum=0;
        long runtime=System.currentTimeMillis();
        for(long i=0;i<runs;i++) {
            quat.lookAt(dir, up);
            accum+= quat.w + quat.x + quat.y + quat.z;
        }
        runtime=System.currentTimeMillis()-runtime;
        System.out.println("Checksum "+accum+" should be 1.345954418182373E7");
        return runtime;
    }
   
    public static long runSingleT() {
        Vector3fM dir=new Vector3fM(2f, 0.1f, 1.5f );
        Vector3fM up = new Vector3fM(0.1f, 1f, 0.1f );
        QuaternionT quat=new QuaternionT();
       
        double accum=0;
        long runtime=System.currentTimeMillis();
        for(long i=0;i<runs;i++) {
            quat.lookAt(dir, up);
            accum+= quat.w + quat.x + quat.y + quat.z;
        }
        runtime=System.currentTimeMillis()-runtime;
        System.out.println("Checksum "+accum+" should be 1.345954418182373E7");
        return runtime;
    }
   
    public static long runMultiple(int num) {
        System.out.println("Threads: "+num);
        Vector3fM dir=new Vector3fM(2f, 0.1f, 1.5f );
        Vector3fM up = new Vector3fM(0.1f, 1f, 0.1f );
       
        long allruntime=System.currentTimeMillis();
       
        // start all the threads
        QuatCompute computes[] = new QuatCompute[num];
        for(int i = 0; i < num ; i++) {
            computes[i]=new QuatCompute(dir, up, runs/num);
            new Thread(computes[i]).start();
        }

        // wait till all the threads complete
        boolean more=true;
        while(more ) {
            try {
                Thread.sleep(500);
                more=false;
                for(int i=0; (i < num) && !more ; i++) {
                    if(computes[i].accumall == 0) {
                        more=true;
                    }
                }
            } catch ( Exception e) {

            }
        }
       
        // calculate total checksum and the biggest run time
        double accum=0;
        long runtime=0;
        for(int i=0;i<num;i++) {
            accum+=computes[i].accumall;
            if(computes[i].runtime>runtime) {
                runtime=computes[i].runtime;
            }
        }
        allruntime=System.currentTimeMillis()-allruntime;
        System.out.println("Checksum "+accum+" should be 1.345954418182373E7");
        System.out.println("Total runtime " + allruntime);
        return runtime;
    }
   
    public static void main(String[] args) {
        System.gc(); System.gc(); System.gc(); System.gc();
        System.gc(); System.gc(); System.gc(); System.gc();
        System.gc(); System.gc(); System.gc(); System.gc();
        System.gc(); System.gc(); System.gc(); System.gc();
        System.out.println(System.getProperty("java.version"));
        System.out.println("Processors: " + Runtime.getRuntime().availableProcessors());
        long usedmem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
       
        long runtime = //runSingleT();
                //runSingle();
                runMultiple(2);
       
        System.out.println("Runtime: " + runtime);
        long usedmem1 = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        System.out.println("Consumed memory: " + (usedmem1 - usedmem)/1024);
    }

}



The class executed in each thread:


package localtest;
public class QuatCompute implements Runnable {

    public double accumall=0;
    final Vector3fM dir;
    final Vector3fM up;
    long runtime;
    final long count;
   
    public QuatCompute(Vector3fM dir, Vector3fM up, long count) {
        this.dir = dir;
        this.up = up;
        this.count = count;
    }
   
    public void run() {
        // replace with QuaternionT for ThreadLocal version
        Quaternion quat=new Quaternion();
        runtime=System.currentTimeMillis();
        double accum=0;
        for(long i=0;i<count;i++) {
            quat.lookAt(dir, up);
            accum+= quat.w + quat.x + quat.y + quat.z;
        }
        runtime=System.currentTimeMillis()-runtime;
        accumall = accum;
    }
}



For making the QuaternionT class use the code from one of my previous posts.

ThreadLocal 1.6.0_03:



Threads: 1

Runtime: 30766, 30719

Consumed memory: 97

That was on a Pentium M 740 by the way, a single core processor.



Thoughts:



As expected, on a single core system static is still always faster than ThreadLocal. ThreadLocal is still pretty impressive for thread safe access though. 1.6 is also a lot faster than 1.5! On a Hyperthreading or multicore system however it's ThreadLocal that becomes faster with 2 or more threads.



Speculation: This is because the VM seems to optimize access to variables only accessed from 1 thread, probably by storing the different variables in different CPU register sets, enabling concurrent usage, as opposed to trying to synchronize them in the case of a single static variable.



Still, on a single core 2 threads are almost as fast as 1, or even unexplainably faster! (see the last 1.5 test, which I just repeated with the same result… crazy)

llama said:


Still, on a single core 2 threads are almost as fast as 1, or even unexplainably faster! (see the last 1.5 test, which I just repeated with the same result.. crazy)


Well, the computer OS runs a lot of processes anyway. I cannot even imagine how many context switches the CPU does in a second. It would be pretty bad design for a modern tho not hyperthreading CPU if one more context switching would be more consuming than what the JVM could achieve with some (who knows what kind of) optimization.

Developed on the idea with the system property.



public class LocalContext extends ThreadLocal {
   
    private static final LocalContext manager;
    private static final Context staticContext;
    private static final boolean useMultithreading;
   
    static {
        if( Boolean.getBoolean("com.jme.multithreading")) {
            manager = new LocalContext();
            staticContext = null;
            useMultithreading = true;
        } else {
            manager = null;
            staticContext = new Context();
            useMultithreading = false;
        }
    }
  
    public static boolean isMultithreading() {
        return useMultithreading;
    }
       
    @Override
    protected Object initialValue() {
          return new Context();
        }
   
    public static Context getContext() {
        return useMultithreading ? (Context)manager.get() : staticContext;
    }
}




public class Context {

         // tmp variables for Quaternion
         public final Vector3f tmpYaxis = new Vector3f();
         public final Vector3f tmpZaxis = new Vector3f();
         public final Vector3f tmpXaxis = new Vector3f();
        
         // tmp variables for Ray
         public final Vector3f tmpVa=new Vector3f();
         public final Vector3f tmpVb=new Vector3f();
         public final Vector3f tmpVc=new Vector3f();
         public final Vector3f tmpVd=new Vector3f();
}




    public void lookAt(Vector3f direction, Vector3f up ) {
        Context tmp = LocalContext.getContext();
        tmp.tmpZaxis.set( direction ).normalizeLocal();
        tmp.tmpXaxis.set( up ).crossLocal( direction ).normalizeLocal();
        tmp.tmpYaxis.set( direction ).crossLocal( tmp.tmpXaxis ).normalizeLocal();
        fromAxes( tmp.tmpXaxis, tmp.tmpYaxis, tmp.tmpZaxis );
    }



Measurments with Java 1.7 EarlyAccess:
1.7.0-ea

Static
Runtime: 13516
Runtime: 13437

Static context
Runtime: 13579
Runtime: 13563

ThreadLocal
Runtime: 14672
Runtime: 14704

ThreadLocal 2 Threads
Runtime: 10469
Runtime: 10859

1.6.0_03

Static
Runtime: 18172
Runtime: 18156

Static context
Runtime: 17813
Runtime: 18188

ThreadLocal:
Runtime: 18047
Runtime: 18032

ThreadLocal 2 threads
Runtime: 13344
Runtime: 13188

Good code… make it look a bit more neat :slight_smile:

Also looks like the VM is smart enough to do the same branch optimizations within a static method call, and it looks like it might even be smart enough to understand that (when not using multithreading) for



LocalContext.getContext().tmp.tmpZaxis



it can just refer to that particular instance of tmpZaxis, during (bytecode) compile time, or something almost like that.



It's a bit weird to see ThreadLocal suddenly "beat" static on a single thread under 1.6 though… Considering how close they were before on your system, that might just be a measuring error.



It's also good (or bad for 1.5 and 1.6) to see how 1.7-ea again reduces the times all across the board. I wonder if that's due to faster startup times or better instruction compilation.



(Quickly adding a small for loop around the test shows for 1.6 the "warmup" bonus is very very small)