Speeding up FastMath.floor(float)

Hey guys,

recently I was reviewing the code that creates generated textures (during my work on sky loading in blender importer).
I started to think what to do to speed it up.
Since I do not (yet) have any improved algorithms I decided to bring forward some simple ‘technical’ optimisations.

I looked into Math.floor() method and I think it is really complicated.
This method is used in FastMath.floor().

What do you think about replacing current implementation of FastMath.floor() with this code:
[java]
private static float floor(float a) {
if (a >= 0) {
return (int) a;
} else {
if (a == (int) a) {
return (int) a;
}
return (int) a - 1;
}
}
[/java]

At the end of this post there is a shor application that tests it.
It compares the speed and the validity of the results.
It looks like the speed improvement is about 25% - 35% depending on where the loop with my method is placed :slight_smile: (java does some runtime optimisations during execution of the bytecode).
But even if it is (the worst case) 20% improvement - I think it is wotrh trying.
The method is heavily used during generated textures loading (loops can be repeated milions of times if the texture is large).

Here is the test code. Proper computations of this method are most important so if anyone know of any case that would cause errors please let me know.
Anyway it is a minor change so I could live without it :wink:

[java]
import java.util.Random;

public class FloorTest {
private static final int LOOP_COUNT = 60000000;

public static void main(String[] args) {
    float[] randoms = new float[LOOP_COUNT];

    float[] floors1 = new float[LOOP_COUNT];
    float[] floors2 = new float[LOOP_COUNT];
    long time;
    
    System.out.println("Generating random numbers.");
    Random r = new Random(0);
    for (int i = 0; i < LOOP_COUNT; ++i) {
        randoms[i] = (r.nextFloat() - 0.5f) * 2000;
    }
    
    time = System.currentTimeMillis();
    for (int i = 0; i < LOOP_COUNT; ++i) {
        floors1[i] = (float) Math.floor(randoms[i]);
    }
    System.out.println("JME floor: " + (System.currentTimeMillis() - time));

    time = System.currentTimeMillis();
    for (int i = 0; i < LOOP_COUNT; ++i) {
        floors2[i] = FloorTest.floor(randoms[i]);
    }
    System.out.println("MY floor: " + (System.currentTimeMillis() - time));

    System.out.println("Comparing the results.");
    for (int i = 0; i < LOOP_COUNT; ++i) {
        if (floors1[i] != floors2[i]) {
            System.err.println("Inequal values: " + floors1[i] + " != " + floors2[i] + ". Input value: " + randoms[i]);
        }
    }
    System.out.println("Comparing completed.");
}

/**
 * The method to test.
 */
private static float floor(float a) {
    if (a >= 0) {
        return (int) a;
    } else {
        if (a == (int) a) {
            return (int) a;
        }
        return (int) a - 1;
    }
}

}
[/java]

Looks reasonable. It’s a shame there is no easy way to get rid of that second branching…subtracting 0.999999 instead of the comparison then -1 might work but I have a feeling there might be holes once float accuracy issues hit.

Just a minor thing before I really get into it but the micro-benchmark is invalid for lots of reasons:

  1. it doesn’t warm up first. You should run the tests at least 5 times before sampling values.
  2. it uses the millis timer instead of the nano timer so your timings could be a full 10 ms off in either direction… or up to 20 ms off in total (more on some platforms).
  3. make sure you use a loop count enough so that the loop takes 10 or more seconds to run.
  4. put your floor method as a static method in a different class. The private method is likely inlined plus you also avoid the method lookups, etc…

I’d be very surprised if there were any difference in performance over all… certainly I’d be surprised if this had any impact at all on the generated texture stuff.

Also, your method comes with pretty severe limitations that would need to be documented. There are a whole bunch of floats that this will just plain fail for. Float can represent much higher and lower numbers than int can. Maybe they don’t come up in the generated texture case… but then this isn’t a general purpose solution and you can tailor a floor to that use-case.

It’s still be tremendously surprised if it matters at all as I suspect that Math.* routines are some of the first to get converted to native code.

Well the default floor ues double, wich would explain most of the difference.

Also at least in openjdk that method seems to be native (read jni? or similar)
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/Math.java#Math.floor(double)
->
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/StrictMath.java#StrictMath.floor(double)
and since hotspot and openjdk share much codebase, this might be the same in oracles jvm. Both togehter could explain a up to 30% improvement.

Open is the question, what does your method do if you encouter a Float.Nan or Positive/Negative infinity?
The accuracy problems are not that problematic, as the original itself is also a but unclean when passing numbers that cannot be represented anymore as a integer.

Btw the core problem goes deeper
http://stereopsis.com/FPU.html#convert

The lack of inlining could explain all of the difference. The JDK does most floating point operations in double because that’s generally what the hardware is supporting. Any time you do float math instead of double, there is the potential of a little overhead related to that.

Anyway, you’d be surprised how expensive a method dispatch is in this case versus inlining the code. Especially when the code itself is doing almost nothing.

The accuracy problems are not that problematic, as the original itself is also a but unclean when passing numbers that cannot be represented anymore as a integer.

Maybe you misunderstood what I was talking about regarding “accuracy” problems. The version above will fail on any float > max int and any float < min int. Period. It will return totally erroneous values.

He can change his test to compare his version with the Math.floor() version using this number:
float f = Integer.MAX_VALUE;
f++;
Or:
f *= 2;

…see what happens.

Not to mention… if this actually has a real measurable impact on the performance of generated textures then there is something super wrong with the generated textures code.

and what about bitwise operations? something like return float | float, but his is only valid for float > 0

OK looks like it would be better to drop the issue.

First: because it is really complicated and I wouldn’t like to risk any errors in such basic math operation.
Second: I changed the FastMath code locally and went through the importer code to make sure it is used everywhere. It seems that the change did not speed it up significantly.

@pspeed said: Not to mention... if this actually has a real measurable impact on the performance of generated textures then there is something super wrong with the generated textures code.

The generated textures code is really heavy - I agree.
I have already found some more websites about it and will try to find out if there are other ways of doing the same.
I will improve the current code at the moment. I will try to change the algorithm later if I can.

Cheers :wink: