~4x Faster Quaternion, Vector multiplication

pspeed · February 26, 2016, 1:55am

In my tests with any reasonable amount of code around the mult, it was overall about a 10% improvement and a lot less accurate.

…unless you did everything in double. Then the accuracy holds.

destroflyer · February 26, 2016, 2:43am

The thing, that I’m missing in this discussion is: Why is everyone talking about replacing the old method with the new one instead of trying to offer both. This could be done similar to the hardware skinning method - Something like setSuperNewFasterButSlightlyInaccurateMultiplicationEnabled(boolean).

Of course, this would mean a bit of a mess for the quaternion code (like having an interface for multiplication and just have the setEnabled method create an object and use that for the calculation). And I am totally fine with people wanting the core to stay as “pure” as possible, if I may phrase it that way. Also I might be underestimating how this would affect the overall performance and might even make it worse.
I’m not sure if offering a choice to the user would be great from a core perspective, but I’m surprised nobody is talking about it. I see that there are advantages and disadvantages but it may come in handy for the 1 out of 1000 users, But on the other hand… If someone needs this super performance he might need to implement other stuff in a different way as well.

Anyways, if I had to choose one out of the two options, I would go with the old method. Performance is nice and all, but as long as it’s not really a big problem, a high accuracy should always be a main goal of core methods.

ia97lies · February 26, 2016, 6:19am

Fair enough but … read carefully @pspeed talks about 10% improvement and not 50 or even 100%. 10% is far away from super fast. It even sounds to me “not really worth the effort”.
Performance tuning is the devels right hand

pspeed · February 26, 2016, 7:42am

Yes, the indirection instead of a final method would definitely negate any benefits. Swappable code will always have an overhead associated with it and in this case the gain is tiny to begin with. It would be swamped by the other.

And note: my 10% figure was with code just doing little else other than calling this mult. It was not only calling this mult as it had some code around it. Still, the code around it was teeny tiny compared to the code around a general quaternion.mult in JME.

For 36,000 (yes thousand) calls, I was seeing somewhere between .3 and .4 millisecond improvement. I haven’t counted, but I suspect that JME calls Quaternion.mult(Vector3f) very few times a frame, depending on how animated things are. Like, I know it will be called when calculating the world transform for a Geometry (which is cached) but normal view space rendering is multiplied in shader (and with a 4x4 matrix).

Hardware skinning moves the other case where I thought Quaternion.mult(vec3) might be used a lot also out to the shader.

So, even if we are generous and say that somehow quat.mult(vec3) is called 10x per object… you’d have to have ~7000 objects in your scene before it made a milliseconds worth of difference. Putting that into real numbers, if you happened to somehow get 7000 objects running at 1000 FPS… (How?!?) Then this performance improvement will give you back 1 frame per second.

Else, if you have 1000 objects at 100 FPS… this will give you nothing back… an almost immeasurably small amount of time per frame. 100 FPS still.

The_Leo · February 26, 2016, 12:08pm

There is already posted performance test on this thread. The performance improvement for my machine is 50-70%, for

That is 90-110% faster for that machine.

Not true at all. I already posted a precision test. And the result was that mult15 is much more precise than mult60.

Here is PrecisionTest2:

The results are:

Results: (15,60,Draws): 50,41,9
mult15 is more precise
Average Error mult15: 8.683967590332032E-4
Average Error mult60: 0.0011390304565429687

The average error of mult15 is smaller than of jme mult60.

The thing is mult15 is both faster and more precise. On top of it Nvidia Sdk uses mult15, thus it is not some random untested code.

pspeed · February 26, 2016, 12:11pm

More precise as compared to what?

My test compared the results to pure trig and the JME mult was less jittery. Accuracy is slightly less important than jitter.

The performance numbers I’m talking about are in real world examples. In nearly 100% of the types of cases that quat.mult() is used, one can expect overall at best a 10% improvement. In actual frame time, it’s closer to 0.000001% improvement.

The_Leo · February 26, 2016, 12:24pm

More precise when compared to the actual expected mathematical result.
Regarding jitter. Running your test results in:

Jme 0.0008137615737373594
Fast 0.0010742516680835035

It is true that the jitter from this test is higher in the fast method but the difference is 0.0002604900943461441. Now since the test was run with a radius = 1000; The difference is in the 7th significant place.

It is much easier to see the 7th significant place if you run the test with radius = 1f; then the results are
7.669068973882674E-7
1.0513749158171017E-6
The difference is
2.844680184288344E-7
Thus the jitter difference is negligible since floating points offer ~6-7 digits of precision.

pspeed · February 26, 2016, 12:28pm

If you are close to a spaceship that is 1000 units from the planet it is orbiting, 0.001 is potentially more than a pixel.

Count the number of times quat.mult() is called per frame… scaled by object count. Is it worth the possibility of visible jitter for almost no gain per frame?

I’ll let others with push power decide.

The_Leo · February 26, 2016, 1:12pm

Here is a test to see if such things do indeed occur:

The teapot above is moved with mult15, the teapot below is moved with jme quaternions.
I have not noticed any visible differences between the two.

pspeed · February 26, 2016, 1:26pm

I’m going to let another core developer take over as this has already taken more of my time than I can devote… for a statistically 0 increase per frame.

However, if you want the test to truly illustrate that it doesn’t matter… put a second set of teapots above the others, rotating with standard JME rotation (rotate the center node they are attached to)… then bump the radius out to 10,000 just to be sure.

…and to whomever is testing it, run the app full screen.

For fun, you could wrap the quat.mult() calls in System.nanoTime() timing, add them up, and average them over frames. See what the average per-frame time spent in quat multiply is… as JME shouldn’t be calling it otherwise in this test. Double or triple it if you like to be sure… and compare it to the total frame time.

Ben1 · February 26, 2016, 3:33pm

Haven’t been on this forum for nearly a year. I pop in just to take a quick look at what changed. I see this thread. Yep, same old story you guys

destroflyer · February 27, 2016, 11:13pm

Fair enough but … read carefully @pspeed talks about 10% improvement and not 50 or even 100%. 10% is far away from super fast. It even sounds to me “not really worth the effort”.

10% for such a low level method is definitely worth the “effort” (which isn’t a fitting word here)… IF it comes without disadvantages, which turns out is not the case here.
I’m sure everybody would love to take even a minimal performance gain in a core method for free, but yeah… it’s sadly not for free it seems^^

bubuche · February 27, 2016, 11:26pm

an other test that you could do:
create a lot of random (and invalides) quaternions mult them and check if your method gives the good result even when quaternions are not valid.

Also, i will likely use this method (the 15 mult) in Xiaoyu (a game engine of my own)