Possible minor optimization in scaleTextureCoordinates

jonathan_pikalek · February 15, 2012, 8:19pm

I was digging through Mesh.java & checking out the code for scaleTextureCoordinates(Vector2f scaleFactor) and I noticed the following loop:

[java]for (int i = 0; i < fb.capacity() / 2; i++){

float x = fb.get();

float y = fb.get();

fb.position(fb.position()-2);

x *= scaleFactor.getX();

y *= scaleFactor.getY();

fb.put(x).put(y);

}[/java]

Given that the variable i isn’t used in the loop body, wouldn’t it be more efficient to rewrite the loop header as follows below to avoid division?

[java]for (int i = 0; i < fb.capacity(); i+=2)[/java]

Admittedly, this method probably doesn’t see a lot of use…

Sploreg · February 15, 2012, 9:06pm

It’s hard to tell how the loop would be optimized internally in java, but I’m guessing that it doesn’t end up doing the /2 every time.

A sure test would be to profile it both ways and see if there is a performance impact. If there is, then it is worth changing. Or just divide it by 2 first and store that in a local variable.

kwando · February 15, 2012, 9:13pm

There are almost better things to do than microoptimizing. The JVM is smart and will probably optimize it if it needs to.

sbook · February 16, 2012, 5:48am

I created two tests and ran javap on it. Based on the byte code, it seems that the JVM is doing the calculation each time… which would make sense since it can’t guarantee that the buffer will be of the same length on the loop’s next iteration.

[java]

public class ForLoopOld{

private int size;

public ForLoopOld(){

size = 100;

}

public int getSize(){

return size;

}

public static void main(String args){

ForLoopOld old = new ForLoopOld();

for(int i=0; i<old.getSize() / 2; i++){

// Do Something Here

}

}

}

[/java]

Compiled from "ForLoopOld.java"
public class ForLoopOld extends java.lang.Object{
public ForLoopOld();
Code:
0: aload_0
1: invokespecial #1; //Method java/lang/Object."":()V
4: aload_0
5: bipush 100
7: putfield #2; //Field size:I
10: return

public int getSize();
Code:
0: aload_0
1: getfield #2; //Field size:I
4: ireturn

public static void main(java.lang.String[]);
Code:
0: new #3; //class ForLoopOld
3: dup
4: invokespecial #4; //Method "":()V
7: astore_1
8: iconst_0
9: istore_2
10: iload_2
11: aload_1
12: invokevirtual #5; //Method getSize:()I
15: iconst_2
16: idiv
17: if_icmpge 26
20: iinc 2, 1
23: goto 10
26: return

}

[java]
public class ForLoopNew{

private int size;

public ForLoopNew(){
size = 100;
}

public int getSize(){
return size;
}

public static void main(String[] args){
ForLoopNew newClass = new ForLoopNew();
int limit = newClass.getSize() / 2;
for(int i=0; i<limit; i++){
// Do Something Here
}
}
}
[/java]

Compiled from "ForLoopNew.java"
public class ForLoopNew extends java.lang.Object{
public ForLoopNew();
Code:
0: aload_0
1: invokespecial #1; //Method java/lang/Object."":()V
4: aload_0
5: bipush 100
7: putfield #2; //Field size:I
10: return

public int getSize();
Code:
0: aload_0
1: getfield #2; //Field size:I
4: ireturn

public static void main(java.lang.String[]);
Code:
0: new #3; //class ForLoopNew
3: dup
4: invokespecial #4; //Method "":()V
7: astore_1
8: aload_1
9: invokevirtual #5; //Method getSize:()I
12: iconst_2
13: idiv
14: istore_2
15: iconst_0
16: istore_3
17: iload_3
18: iload_2
19: if_icmpge 28
22: iinc 3, 1
25: goto 17
28: return

}

Empire_Phoenix · February 16, 2012, 5:54am

Yes makes sense that it cant optimize then, but woudn’t this be the better aproch then

int c = fb.capacity()/2

for (int i = 0; i < c; i++)

As far as I know the i++ can be done in one operation,while a i+=2 needs two operations.

But I guess this kind of optimisation makes no impact at all anyway, I don#t think the method is called that often, so I guess we should use the best readable/understandable version.

sbook · February 16, 2012, 6:01am

@EmpirePhoenix said:But I guess this kind of optimisation makes no impact at all anyway, I don#t think the method is called that often, so I guess we should use the best readable/understandable version.

Yeah I don't think it'll ever make a detectable difference. Then again, this could be more optimal and as you wrote it doesn't really impact readability.

Jeff Atwood, exposer of logical traps.

tralala · February 16, 2012, 6:19am

i+=2 means i = i + 2 which needs only 1 operation.
you cant compare byte-code and understand which is faster. I tried in a big project a tool that automatically optimizes java bytecode, result same fps. Which means that java hotspot generally does its job better than static optimizing of source code.

Empire_Phoenix · February 16, 2012, 6:51am

not really, at least in bytecode it would be a read two values to stack, add them, write new value to varaible

te other option would be :read varaible to stack, increase, write out, since ++ has a own bytecode operation.
complelty agree, and even if, the next java version might complelty kill that when the jit works in another way, and optimizes better by default.

jonathan_pikalek · February 16, 2012, 3:54pm

Given that I wasn’t originally going through the code looking for tweaks (I actually wanted a quick & dirty means to flip textures on demand (and I found what I needed for that)), I didn’t expect to generate this much discussion. I briefly thought about looking at the byte code, but frankly, was too lazy. In light of the contributions everyone else has made to the discussion however, I felt obligated to dig a little deeper myself & so I ran some tests of the form:[java] public long doTest01(long reps){

Vector2f scaleFactor = new Vector2f(1, 1);

final long startTime = System.nanoTime();

final long endTime;

for(long a=0; a<reps; a++){

VertexBuffer tc = spriteMesh.getBuffer(Type.TexCoord);

FloatBuffer fb = (FloatBuffer) tc.getData();

fb.clear();

for (int i = 0; i < fb.capacity() / 2; i++){

float x = fb.get();

float y = fb.get();

fb.position(fb.position()-2);

x *= scaleFactor.getX();

y *= scaleFactor.getY();

fb.put(x).put(y);

}

fb.clear();

tc.updateData(fb);

}

endTime = System.nanoTime();

return endTime - startTime;

}[/java]

Variation 01: [java]for (int i = 0; i < fb.capacity() / 2; i++){[/java]

Variation 02: [java]for (int i = 0; i < fb.capacity(); i+=2){[/java]

Variation 03: [java]int max = fb.capacity() / 2;

for (int i = 0; i < max; i++){[/java]

Variation 04: [java]int max = fb.capacity() >> 1;

for (int i = 0; i < max; i++){[/java]

Since I’ve been burned in the past by doing 1 time runs, I harnessed them as follows:[java] System.out.println(doTest01(1000000));

System.out.println(doTest01(1000000));

System.out.println("

");
System.out.println(doTest02(1000000));
System.out.println(doTest02(1000000));
System.out.println("

");
System.out.println(doTest03(1000000));
System.out.println(doTest03(1000000));
System.out.println("

");
System.out.println(doTest04(1000000));
System.out.println(doTest04(1000000));
System.out.println("

");
[/java]
And here were my results:
215869634
211115082

199974337
203226892

201448184
202007579

204168997
201960206

Because it's just what I happened to have on hand, spriteMesh is a 2 triangle quad. Presumably, the results would become somewhat more differentiated as the number of vertices increases. Admittedly, this is weak testing due to the low number of reps & the unvaried nature of the meshs, but it more or less demonstrates that I've probably taken up your time rather needlessly.

Sploreg · February 16, 2012, 4:01pm

It’s never a waste of time to check for performance improvements. Your look into this is appreciated.

sbook · February 16, 2012, 4:10pm

@jonathan-pikalek said:Given that I wasn't originally going through the code looking for tweaks (I actually wanted a quick & dirty means to flip textures on demand (and I found what I needed for that)), I didn't expect to generate this much discussion. I briefly thought about looking at the byte code, but frankly, was too lazy. In light of the contributions everyone else has made to the discussion however, I felt obligated to dig a little deeper myself & so I ran some tests of the form

Curiosity brings about greatness ;)

pspeed · February 16, 2012, 4:17pm

For the future, depending on iterations, usually 3-5 times per test is best to get things to settle down. This is just a rule of thumb from experience of lots of microbenchmarking. That’s about the point where things stop fluctuating.

Also, usually I would run all tests together in a loop… rather then letting them each run sequentially. Otherwise, you penalize the early tests because they don’t get the same system-level hotspotting as the later tests.

pspeed · February 16, 2012, 4:18pm

By that last part, I mean:

loop five times {

doTest1

doTest2

doTest3

doTest4

}

jonathan_pikalek · February 16, 2012, 7:47pm

Thank you @Sploreg and @sbook for the positive responses; I agree with you, but in the future I’ll try to remember to test before rather than after.

@pspeed That’s a great tip. Really, I figured I should have separated the tests into independent executables; your suggestion provides a nice middle ground. Also, I’ve never done any serious empirical testing with VM based languages, so I also appreciate you and @tralala bringing hotspot biasing to my attention.

tralala · February 17, 2012, 12:36pm

I created my test to confirm the results.

[java]

package tests.performance.other;

import com.jme3.util.BufferUtils;

import java.nio.FloatBuffer;

public final class TestPerformance

{

private static final FloatBuffer fb = BufferUtils.createFloatBuffer(1024);

private final int times;

private static double random;

public TestPerformance(int times)

{

this.times = times;

}

public void doSomething()

{

for (int i = 0; i < fb.capacity() / 2; i++)

{

random += Math.random();

}

}

public final void doTest()

{

for (int i = 0; i < times; i++)

{

doSomething();

}

long time = System.currentTimeMillis();

for (int i = 0; i < times; i++)

{

doSomething();

}

System.out.println(System.currentTimeMillis() - time);

}

public static void main(String[] args)

{

final int TIMES = 100000;

TestPerformance test = new TestPerformance(TIMES);

test.doTest();

System.out.println("random "+TestPerformance.random);

}

}

[/java]

jvm Server: reps = 100000, 500000

Test01 : 2968, 14797

Test02 : 2906, 14657

Test03 : 2782, 13860

Test04 : 2766, 13813

pspeed · February 17, 2012, 4:10pm

Hmmm… that test code does not match the output. There’s something missing, I guess. What exactly were you confirming?

tralala · February 17, 2012, 7:31pm

from oracle hotspot manual :

The server compiler likes a loop with an int counter (int i = 0), a constant stride (i++), and loop-invariant limit (i <= n).

Make your array be loop invariant, typically by loading it into a local variable.
Use at most simple linear expressions to index each array.

https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination

pspeed · February 17, 2012, 7:36pm

Yes, I understand that… and of course it’s true.

I just didn’t understand how your posted code was testing that… and the code didn’t match the output you showed anyway. Just curious, really.

sbook · February 17, 2012, 7:40pm

@pspeed said:I just didn't understand how your posted code was testing that... and the code didn't match the output you showed anyway. Just curious, really.

+1, trying to get a handle on what that code did. The doSomething() method does the same "something" on every run :)

tralala · February 17, 2012, 7:44pm

the doSomething does :

for (int i = 0; i < fb.capacity() / 2; i++)

the rest is code that to ensure it doesn’t get inlined.