Efficiency Improvements

phr00t · February 4, 2014, 3:41pm

Hey all,

There are many places I’ve found in jME3 to improve performance, both in memory usage & CPU usage. I’ve tested changes through the profiler to see significant improvements. The first I’d like to share requires changes to the Material, Technique & DefineList classes. Basically, temporary iterators are being created that is taking up a bunch of CPU & memory in the DefineList’s “equalParams” function. We don’t need to make these iterators – we can access the ListMap’s values directly since there is a backing array. The start of the function can be changed to this:

[java] public boolean equalsParams(ListMap params, TechniqueDef def) {

    int size = 0;

    for(int i = 0; i &lt; params.size() ; i++ ) {
        MatParam param = (MatParam)params.getValue(i);[/java]

… it requires passing the ListMap to this function, instead of Collection<MatParam>. To make that change, Material needs to return its ListMap and Technique needs to pass this variable along. Anyway, I’m benefiting from this change (and more) in my local repository & I’d like to push at least this one to the SVN.

Phr00t

abies · February 4, 2014, 3:48pm

Please take a look at
http://hub.jmonkeyengine.org/forum/topic/huge-garbage-creation-from-technique-makecurrent/page/2/

In patch 2 there are some optimalizations which reduce amount of garbage generated around these areas. I’m not sure if it will still apply cleanly, but you can take a look there.

Not sure if patch1 was applied - it was already long time ago.

phr00t · February 4, 2014, 4:03pm

Good find. These solutions need to be worked into the engine quicker, so people like me don’t need to solve the same problems again. I’ll push my changes to the SVN, which are clean. Before anyone says “garbage collection is fast & this isn’t a concern”: it might not be for you, but apparently it is for many people:

I also have a solution for shadows, but it isn’t as clean & I’d like other input. Basically, I create a pool of BoundingVolumes to use, instead of always creating new ones:

[java] // re-use BoundingVolumes so we don’t keep piling on the memory usage
private static BoundingVolume[] tempBVs;
private static int BVindex;

static {
    tempBVs = new BoundingVolume[2048];
    for(int i=0;i&lt;tempBVs.length;i++) tempBVs[i] = new BoundingBox();
    BVindex = 0;
}

private static BoundingVolume GetFreeBV() {
    BVindex = (BVindex + 1) % tempBVs.length;
    return tempBVs[BVindex];
    //return new BoundingBox();
}[/java]

… which fixes garbage collection problems and is very fast. However, I’m wondering if the fixed 2048 number of BoundingVolumes works universally… works for me.

abies · February 4, 2014, 4:12pm

On the other hand, people are mostly hit by long-lived garbage, not by immediate garbage - and at least my patches were reducing short-lived garbage. I have done them mostly to be able to profile my own code - without them, ‘garbage garbage’ from engine was dominating ‘real garbage’ from user code. Plus, I had trial version of profiler, so I had to use it

phr00t · February 4, 2014, 4:19pm

You can add the NetBeans profiler as a plugin to the jME3 SDK, which is free to use. It has been very helpful, and I’ve also noticed the garbage from the engine greatly overwhelms any garbage in my code. I’ve also used the CPU profiler to clean up other spots in the engine – for example, using FastMath.abs(x) is slower than just doing “(x <= 0.0f) ? 0.0f - x : x;”, but you see FastMath.abs(x) used everywhere. I’d love to start submitting those changes too in a few key places.

abies · February 4, 2014, 4:27pm

I really don’t understand why FastMath.abs would be slower. Be sure that you are doing proper measurements - this kind of code was inlined by Hotspot already 10 years ago. Are you using server jvm and allowing it to warm up? In any case, I really doubt that any difference here would be visible in real game profile.

There is a difference between cutting 80% of garbage by changing few lines versus optimizing 0.001% of speed by making 100 lines unreadable. We need to balance these things.

phr00t · February 4, 2014, 4:43pm

For example, this is what Matrix3f.absoluteLocal() looks like not using FastMath.abs:

[java] public void absoluteLocal() {
// <editor-fold defaultstate=“collapsed” /* 0: aload_0
* 1: aload_0
* 2: getfield #3 * 5: fconst_0
* 6: fcmpg
* 7: ifgt 19
* 10: fconst_0
* 11: aload_0
* 12: getfield #3 * 15: fsub
* 16: goto 23
* 19: aload_0
* 20: getfield #3 * 23: putfield #3 * 26: aload_0
* 27: aload_0
* 28: getfield #4 * 31: fconst_0
* 32: fcmpg
* 33: ifgt 45
* 36: fconst_0
* 37: aload_0
* 38: getfield #4 * 41: fsub
* 42: goto 49
* 45: aload_0
* 46: getfield #4 * 49: putfield #4 * 52: aload_0
* 53: aload_0
* 54: getfield #5 * 57: fconst_0
* 58: fcmpg
* 59: ifgt 71
* 62: fconst_0
* 63: aload_0
* 64: getfield #5 * 67: fsub
* 68: goto 75
* 71: aload_0
* 72: getfield #5 * 75: putfield #5 * 78: aload_0
* 79: aload_0
* 80: getfield #6 * 83: fconst_0
* 84: fcmpg
* 85: ifgt 97
* 88: fconst_0
* 89: aload_0
* 90: getfield #6 * 93: fsub
* 94: goto 101
* 97: aload_0
* 98: getfield #6 * 101: putfield #6 * 104: aload_0
* 105: aload_0
* 106: getfield #7 * 109: fconst_0
* 110: fcmpg
* 111: ifgt 123
* 114: fconst_0
* 115: aload_0
* 116: getfield #7 * 119: fsub
* 120: goto 127
* 123: aload_0
* 124: getfield #7 * 127: putfield #7 * 130: aload_0
* 131: aload_0
* 132: getfield #8 * 135: fconst_0
* 136: fcmpg
* 137: ifgt 149
* 140: fconst_0
* 141: aload_0
* 142: getfield #8 * 145: fsub
* 146: goto 153
* 149: aload_0
* 150: getfield #8 * 153: putfield #8 * 156: aload_0
* 157: aload_0
* 158: getfield #9 * 161: fconst_0
* 162: fcmpg
* 163: ifgt 175
* 166: fconst_0
* 167: aload_0
* 168: getfield #9 * 171: fsub
* 172: goto 179
* 175: aload_0
* 176: getfield #9 * 179: putfield #9 * 182: aload_0
* 183: aload_0
* 184: getfield #10 * 187: fconst_0
* 188: fcmpg
* 189: ifgt 201
* 192: fconst_0
* 193: aload_0
* 194: getfield #10 * 197: fsub
* 198: goto 205
* 201: aload_0
* 202: getfield #10 * 205: putfield #10 * 208: aload_0
* 209: aload_0
* 210: getfield #11 * 213: fconst_0
* 214: fcmpg
* 215: ifgt 227
* 218: fconst_0
* 219: aload_0
* 220: getfield #11 * 223: fsub
* 224: goto 231
* 227: aload_0
* 228: getfield #11 * 231: putfield #11 * 234: return
* */
// </editor-fold>
}[/java] desc=“Compiled Code”>
// Field m00:F
// Field m00:F
// Field m00:F
// Field m00:F
// Field m01:F
// Field m01:F
// Field m01:F
// Field m01:F
// Field m02:F
// Field m02:F
// Field m02:F
// Field m02:F
// Field m10:F
// Field m10:F
// Field m10:F
// Field m10:F
// Field m11:F
// Field m11:F
// Field m11:F
// Field m11:F
// Field m12:F
// Field m12:F
// Field m12:F
// Field m12:F
// Field m20:F
// Field m20:F
// Field m20:F
// Field m20:F
// Field m21:F
// Field m21:F
// Field m21:F
// Field m21:F
// Field m22:F
// Field m22:F
// Field m22:F
// Field m22:F

Then, with FastMath.abs():

[java] public void absoluteLocal() {
// <editor-fold defaultstate=“collapsed” desc=“Compiled Code”>
/* 0: aload_0
* 1: getfield #3 // Field m00:F
* 4: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 7: pop
* 8: aload_0
* 9: getfield #4 // Field m01:F
* 12: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 15: pop
* 16: aload_0
* 17: getfield #5 // Field m02:F
* 20: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 23: pop
* 24: aload_0
* 25: getfield #6 // Field m10:F
* 28: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 31: pop
* 32: aload_0
* 33: getfield #7 // Field m11:F
* 36: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 39: pop
* 40: aload_0
* 41: getfield #8 // Field m12:F
* 44: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 47: pop
* 48: aload_0
* 49: getfield #9 // Field m20:F
* 52: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 55: pop
* 56: aload_0
* 57: getfield #10 // Field m21:F
* 60: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 63: pop
* 64: aload_0
* 65: getfield #11 // Field m22:F
* 68: invokestatic #13 // Method com/jme3/math/FastMath.abs:(F)F
* 71: pop
* 72: return
* */
// </editor-fold>
}[/java]

… then each FastMath.abs():

[java] public static float abs(float fValue) {
// <editor-fold defaultstate=“collapsed” desc=“Compiled Code”>
/* 0: fload_0
* 1: fconst_0
* 2: fcmpg
* 3: ifge 9
* 6: fload_0
* 7: fneg
* 8: freturn
* 9: fload_0
* 10: freturn
* */
// </editor-fold>
}[/java]

Basically, skipping FastMath.abs() inlines the functionality – no function calls needed. It probably isn’t that noticeable for most projects, but in tight loops it adds up. I noticed it in shadow processing, if I remember correctly. It is a bit harder to read, but I modify each line like this:

[java] public void absoluteLocal() {
m00 = (m00 <= 0.0f) ? 0.0f - m00 : m00; //FastMath.abs(m00);
[/java]

Anyway, this is relatively minor to other improvements. For example, Quaternion was taking up a significant amount of memory in toRotationMatrix. I made a non-temporary Vector3f to store results, so one doesn’t need to be created every call:

[java] private Vector3f originalScale;
public Matrix4f toRotationMatrix(Matrix4f result) {
if( originalScale == null ) {
originalScale = new Vector3f();
}
result.toScaleVector(originalScale);[/java]

abies · February 4, 2014, 4:53pm

Don’t look at bytecode, look at generated machine opcodes after long-term compilation kicks in. Inlining is done in Hotspot, not in javac.
https://wikis.oracle.com/display/HotSpotInternals/PrintAssembly

There are cases where inlining is getting disabled due to too long methods etc, but I would be really suprised if it fails to inline FastMath.abs.

phr00t · February 4, 2014, 5:01pm

Interesting. I do remember using the profiler & seeing a modest CPU reduction when not using FastMath.abs(), so I stuck without it. Anyway, not big enough to push changes to the main SVN, but people looking to squeeze a few more FPS might find some benefit there.

I’m considering pushing the toRotationMatrix one, though. It does add a null check to the function, but that is to prevent a new Vector3f being created with every Quaternion unnecessarily – it instead gets created when needed only once for all calls to toRotationMatrix(…)

nehon · February 4, 2014, 5:34pm

My bad about the patches Abies posted, I should have merged them.

@phr00t what @abies says about warming up, micro benching is trickier than one may think, and have a good design for the test is crucial to have meaningful results.
I can be wrong , but I seriously doubt java7 is not inlining this kind of things through hot spot. Byte code won’t give you what really happens in the long run.

About the shadows patch is that in ShadowUtil? If yes 2048 seems a bit excessive, usually one would use some kind of growing pool instead of static instances.
Also your pool system doesn’t guarantee to have a “free” bounding volume as it cycles through the array to avoid freeing instance (i suppose). I agree it may not occur in the general case since you have 2048 instances…but what if you need more boundingBox than 2048…you will just get the first one and have some silent failure… honestly that’s not the way to do it.

From what I gathered there are 3 instanciations of boundingBox in ShadowUtil.updateShadowCam (that’s 3 per frame), but I’d really like to see the numbers about the garbage it generates in the end, and how it’s managed by java.

nehon · February 4, 2014, 5:37pm

@phr00t said: I'm considering pushing the toRotationMatrix one, though. It does add a null check to the function, but that is to prevent a new Vector3f being created with every Quaternion unnecessarily -- it instead gets created when needed only once for all calls to toRotationMatrix(...)

Please, don't. We have tempVars for this, so please use them.

phr00t · February 4, 2014, 5:56pm

@nehon said: Please, don't. We have tempVars for this, so please use them.

Then jMonkeyEngine should follow your own advice & toRotationMatrix should use tempVars to get the same benefit as in my code.

@nehon said: My bad about the patches Abies posted, I should have merged them.
@phr00t what @abies says about warming up, micro benching is trickier than one may think, and have a good design for the test is crucial to have meaningful results.
I can be wrong , but I seriously doubt java7 is not inlining this kind of things through hot spot. Byte code won’t give you what really happens in the long run.

About the shadows patch is that in ShadowUtil? If yes 2048 seems a bit excessive, usually one would use some kind of growing pool instead of static instances.
Also your pool system doesn’t guarantee to have a “free” bounding volume as it cycles through the array to avoid freeing instance (i suppose). I agree it may not occur in the general case since you have 2048 instances…but what if you need more boundingBox than 2048…you will just get the first one and have some silent failure… honestly that’s not the way to do it.

From what I gathered there are 3 instanciations of boundingBox in ShadowUtil.updateShadowCam (that’s 3 per frame), but I’d really like to see the numbers about the garbage it generates in the end, and how it’s managed by java.

Yes, I believed there was a better, more universal way of handling the ShadowUtil problem & you are right. My system has been working for my situation, but a very similar system with your concerns accounted for should be put in place as soon as possible.

nehon · February 4, 2014, 6:01pm

@phr00t said: Then jMonkeyEngine should follow your own advice & toRotationMatrix should use tempVars to get the same benefit as in my code.
Yes, I believed there was a better, more universal way of handling the ShadowUtil problem & you are right. My system has been working for my situation, but a very similar system with your concerns accounted for should be put in place as soon as possible.

Dude you make patches…make them right.
I couldn’t care less that it works for 3089 or whatever you’re working on. We didn’t gave you commit access to push your own changes in the engine to your convenience, I’d like you to at least wait for us to discuss your changes before committing them.
You have valid points here, but the way you solve them and force them into the engine is not right.

phr00t · February 4, 2014, 6:07pm

@nehon said: Dude you make patches...make them right. I couldn't care less that it works for 3089 or whatever you're working on. We didn't gave you commit access to push your own changes in the engine to your convenience, I'd like you to at least wait for us to discuss your changes before committing them. You have valid points here, but the way you solve them and force them into the engine is not right.

I haven’t pushed any of FastMath.abs, ShadowUtil updates or toRotationMatrix ones because I’m doing just what you suggest: discussing the changes.

I know you don’t care if it works for 3089 or not – that is why I don’t push those updates, but I take time to discuss them here.

I’m not forcing anything on the engine, but I am trying to be efficient and avoid improvements not making them into the engine at all due to bureaucracy. For example, the DefineList change was very small, but saw huge improvements & fixes problems existing for months.

phr00t · February 4, 2014, 6:19pm

Here is the toRotationMatrix code using TempVars. Tested to work OK in 3089 – good for submission?

[java] public Matrix4f toRotationMatrix(Matrix4f result) {
TempVars tempv = TempVars.get();
Vector3f originalScale = tempv.vect1;

    result.toScaleVector(originalScale);
    result.setScale(1, 1, 1);
    float norm = norm();
    // we explicitly test norm against one here, saving a division
    // at the cost of a test and branch.  Is it worth it?
    float s = (norm == 1f) ? 2f : (norm &gt; 0f) ? 2f / norm : 0;

    // compute xs/ys/zs first to save 6 multiplications, since xs/ys/zs
    // will be used 2-4 times each.
    float xs = x * s;
    float ys = y * s;
    float zs = z * s;
    float xx = x * xs;
    float xy = x * ys;
    float xz = x * zs;
    float xw = w * xs;
    float yy = y * ys;
    float yz = y * zs;
    float yw = w * ys;
    float zz = z * zs;
    float zw = w * zs;

    // using s=2/norm (instead of 1/norm) saves 9 multiplications by 2 here
    result.m00 = 1 - (yy + zz);
    result.m01 = (xy - zw);
    result.m02 = (xz + yw);
    result.m10 = (xy + zw);
    result.m11 = 1 - (xx + zz);
    result.m12 = (yz - xw);
    result.m20 = (xz - yw);
    result.m21 = (yz + xw);
    result.m22 = 1 - (xx + yy);

    result.setScale(originalScale);
    
    tempv.release();
    
    return result;
}[/java]

Thank you for pointing out TempVars so I could “get it right” – we can help each other out here without hostilities!

pspeed · February 4, 2014, 6:44pm

@phr00t said: I haven't pushed any of FastMath.abs, ShadowUtil updates or toRotationMatrix ones because I'm doing just what you suggest: discussing the changes.
I know you don’t care if it works for 3089 or not – that is why I don’t push those updates, but I take time to discuss them here.

I’m not forcing anything on the engine, but I am trying to be efficient and avoid improvements not making them into the engine at all due to bureaucracy. For example, the DefineList change was very small, but saw huge improvements & fixes problems existing for months.

FYI: It wasn’t bureaucracy in this case. It just needed a bump. Since the patches were posted as a download link and Nehon had already looked at them, I think the rest of us just moved on.

Point here: while it’s good to post download links to avoid patch formatting issues. It’s also good to post the diffs right to the thread so that ‘we of little time’ can read them. So, thanks for doing that Phr00t. These changes/discussion have been much easier to follow because of that.

I looked over the commits so far and they are good. We will have to move them over to the new build for them to really take effect but they look fine to me.

phr00t · February 4, 2014, 7:02pm

@pspeed said: FYI: It wasn't bureaucracy in this case. It just needed a bump. Since the patches were posted as a download link and Nehon had already looked at them, I think the rest of us just moved on.
Point here: while it’s good to post download links to avoid patch formatting issues. It’s also good to post the diffs right to the thread so that ‘we of little time’ can read them. So, thanks for doing that Phr00t. These changes/discussion have been much easier to follow because of that.

I looked over the commits so far and they are good. We will have to move them over to the new build for them to really take effect but they look fine to me.

The toRotationMatrix change look good? I’d like to submit that.

I’m also curious about these new builds and the new fangled “JOGL” stuff… wonder if I should be preparing for it… maybe for the next game? Another thread, perhaps…

pspeed · February 4, 2014, 7:09pm

@phr00t said: The toRotationMatrix change look good? I'd like to submit that.

What’s the “diff” version look like? Is it just as long or does it do a good job of showing the difference over the old one?

(I don’t have the code open at the moment to do an eye-ball diff so I don’t know how much changed.)

nehon · February 4, 2014, 7:16pm

Your 3 changes looks fine (already committed), but they touches classes that are very central to the engine, so it raised the red flag when I saw them. As a rule of thumb you should post this kind of changes before committing them.

The toRotationMatrix change looks ok, go ahead.

I know I’m not as active as before but life’s a bitch. I’ll eventually commit those patches (abies ones), one main reason is that you guys won’t stop to remind me to do it. But one point of the core team is to centralize contributions and we don’t want to discover issues months later due to commits we weren’t aware of.

phr00t · February 4, 2014, 7:35pm

Submitted toRotationMatrix changes.

Looks like the only potentially useful stuff remaining in abies’s patches are string caching stuff, since the other things are now fixed. Looks like we will still need a solution for the BoundingVolumes in ShadowUtil, though. Not sure when I will get to a better solution, since I’ve hacked one that works for me, but I do like contributing to jME3 when I can.