~4x Faster Quaternion, Vector multiplication

Recently, I have been reading stuff about maths, mainly solving non linear equations with rotations and I have stumbled about a method to multiply vector by quaternion with 15 multiplications and 15 additions. This obviously got me curious so I’ve checked the jme Quaternion.mult(Vector3f, Vector3f) implementation.

To my suprise, it uses 60 multiplications and 21 additions! Below I post the method along with a comparison test. The result shows no significant difference in precision.

After testing this method yourself, could one of you put it on git. I would also propose to change the quaternion’s x, y, z, w from protected to public, just as a Vector3f has its x, y, z public, but that’s for another topic.

Comparison Test

The following test compares results of Quaternion multiplication between the jme’s Quaternion.mult, the 15 mult 15 add method and one other method that i found on internet.
From the tests it seems that the 15 mult 15 add method would be the optimal candidate for quaternion vector multiplication.

public static void quatMult(Quaternion q, Vector3f v, Vector3f s) {
        //15 mult, 15 add
        float x = q.getX();
        float y = q.getY();
        float z = q.getZ();
        float w = q.getW();
        
        float vx = y*v.z-z*v.y;
        float vy = z*v.x-x*v.z;
        float vz = x*v.y-y*v.x;
        vx += vx; vy += vy; vz += vz;
        s.x = v.x + w*vx + y*vz-z*vy;
        s.y = v.y + w*vy + z*vx-x*vz;
        s.z = v.z + w*vz + x*vy-y*vx;
    }
     public static void quatMult2(Quaternion q, Vector3f vec, Vector3f result){
        float x = q.getX();
        float y = q.getY();
        float z = q.getZ();
        float w = q.getW();
         
        float num = x * 2f;
        float num2 = y * 2f;
        float num3 = z * 2f;
        float num4 = x * num;
        float num5 = y * num2;
        float num6 = z * num3;
        float num7 = x * num2;
        float num8 = x * num3;
        float num9 = y * num3;
        float num10 = w * num;
        float num11 = w * num2;
        float num12 = w * num3;
        result.x = (1f - (num5 + num6)) * vec.x + (num7 - num12) * vec.y + (num8 + num11) * vec.z;
        result.y = (num7 + num12) * vec.x + (1f - (num4 + num6)) * vec.y + (num9 - num10) * vec.z;
        result.z = (num8 - num11) * vec.x + (num9 + num10) * vec.y + (1f - (num4 + num5)) * vec.z;
    }
    public static void main(String[] args) {  
        Quaternion q = new Quaternion();
        q.fromAngleNormalAxis(0.4547f, new Vector3f(0.5454f,0.21f,0.12f).normalizeLocal());
        //For the following quaternion 0 difference
        //q.fromAngleNormalAxis(FastMath.PI, Vector3f.UNIT_Z);
        
        Vector3f res1 = new Vector3f();
        Vector3f res2 = new Vector3f();
        Vector3f res3 = new Vector3f();
        Vector3f v = new Vector3f(0.55f,0.48f,0.54f);
        
        for(int i = 0; i < 10; i++) {
            q.mult(v, res1);
            quatMult(q, v, res2);
            quatMult2(q, v, res3);
        
            System.out.println("v " + v);
//            System.out.println("r1 " + res1);
//            System.out.println("r2 " + res2);
            System.out.println("quatMult " + res1.subtract(res2));
            System.out.println("quatMult2 " + res1.subtract(res3));
        
            v.set(FastMath.nextRandomFloat(),
                    FastMath.nextRandomFloat(),
                    FastMath.nextRandomFloat());
         }
    }
8 Likes

You might want to test it with coordinate values outside the range 0…1. Having all coordinates right around 0 is not very realistic and larger coordinates may exhibit larger errors if there are any.

Off the top of my head, another thing to look for would be “jittering” in the larger coordinates. Like, if you had a 100 sized cube rotating smoothly, would the corners move smoothly or would they stutter.

I’m not saying the new approach will have these errors but I’m saying that if it did have them then it is a significant problem and therefore should be tested.

As to changing the fields from protected to public, I actually think they should be private. If a user needs a Vector4f then they can use a Vector4f but a Quaternion is a specific thing that should mostly be treated like a black box. It is not a general storage thing like Vector3f.

Did you make any performance test?

@pspeed Thanks for that idea, I will test the method with larger coordinates as well.

@nehon Not yet. I still have to make a proper precision test. As from the title wrote ~4x faster since it uses ~4x less instructions (15 mult 15 add) compared to (60 mult, 21 add)

well, i’d be surprised it’s that linear…

2 Likes

Agreed… the method calls, fields references, etc. themselves are probably 1/4 to 1/2 the time it takes to do the actual math.

1 Like

agree with nehon and pspeed, moreover even if it’s slightly faster, i doubt that will be the bottleneck of your game.

I have finished writing the precision test. You can find it at bottom of this post. From 1000 iterations 585 times was mult15 more precise,328 times mult60 and 87 times they had same results.

In general, mult15 is more precise.

To compare the two methods I wrote QuaternionHD class that uses BigDecimal for precision. I implemented both the mult15 and mult60.
However, even when using BigDecimal mult15 and mult60 did not return same results. Thus, as a reference I used SageMath (snipped included below). Here is an example of mult15, mult60 and SageMath output as a reference:

Result15: 8660.6676565037429751590, -975.005134151990581755232, -5685.59870990733054087308
Result60: 8660.66773692422773365288, -975.005003052134989081516, -5685.59847736184420776348
SageMath: (8660.66765650374, -975.005134151991, -5685.59870990733)

It can be seen that mult15 is precise for every digit. Mult60 is precise for ~7 digits.

In the end, it is interesting that Mult60 still managed to be more precise 328/1000 times when done with floating points.

Next coming up: Performance Test.

Sage Math:

i = -0.005492622
j = 0.7388727
k = -0.30399343
w = 0.60135263

m = matrix([[1-2*j*j-2*k*k, 2*(i*j-k*w), 2*(i*k+j*w)],
            [2*(i*j+k*w), 1-2*i*i-2*k*k, 2*(j*k-i*w)],
            [2*(i*k-j*w), 2*(j*k+i*w), 1-2*i*i-2*j*j]])
v = vector([3001.57, 4893.099, 8679.4])
prod = m*v
print prod

Precision test:

1 Like

Testing in on my machine mult15 is 1.5-1.6x faster than mult60. It’s still a nice improvement, especially this can speed thing up Android apps.

Cool!
I’m a bit confused though how harshly the community is reacting to this! It’s a contribution to the engine, and some of the replies give the effect of, “ehh, even if it is a performance boost it’s probably not that much of a performance boost. It’s not even worth it really…” or “eh, you probably did something wrong.”

1 Like

Replacing core code with something that might be slightly faster but also might be error prone deserves scrutiny. I thought I was pretty clear that I just wanted more tests to verify that it wasn’t going to cause issues… not that I suspected that there were actual issues. But it’s all too common that “this math routine uses 900 less instructions” often equates to “this math routine has some odd behavior”. So verifying is important.

Whenever one faces a case where “This might save you 2 ms of your total application run time but could cause really strange behavior” (an exaggeration for effect) one should be cautious.

Now, having not seen the performance testing code it’s hard to say if it’s constructed correctly. Microbenchmarks are notoriously hard to get right… and in this case, unless JME’s Quaternion math was extracted into its own static method, it’s possible it JME’s version had an unfair advantage.

I’d be curious to see the performance test.

1 Like

Sorry I didn’t want to sound dissmissive. We are usually very cautious when replacing code so deep in the core, and especially this kind of code that was there even in JME2 times. And in case you didn’t get it…we suck at math. So we just can’t say if the method is correct just by looking the code… So we need tests that prove the validity.

Also I was pointing out that x4 less instruction wouldn’t necessarily mean x4 faster. Ifit’s 1.5 to 1.6 faster as Op is saying, it’s nice, but I’m not sure you’ll see the difference in a full fledge game. So the gain can be not worth the risk.

Anyway we’ll wait for the perf tests. Don’t be discouraged @The_Leo, it’s nice work, go ahead :wink:

Maths are magic !

Would you please provide more perf numbers ?
As for the precision, can’t we prove it’s more or less precise in a mathematical/theorical way ?
(I suck at maths too so I can’t help for that)

I hope it’s as good as it looks so it can find its way into the core. Thank you for the contribution sir.

@nomnom Thanks nomnom.
@pspeed Yes, that’s why I make these tests, to ensure it’s correctness.
@nehon It’s okay, it’s nothing personal. I did assume you knew the math behind the code.
@guillaumevds Thanks, that’s what I’ll answer below

Precision:
The two methods are mathematically equal. [1]

[1] http://en.wikipedia.org/wiki/Quaternions_and_spatial_rotation#Performance_comparisons

Regarding the precision tests I did, I was interested in the results of both methods on floating points, since otherwise they are mathematically equal.

I have read an article regarding floating point arithmetic [2]. A summary: the more operations you do with floating points, the more error you introduce. Multiplication introduces more error then addition. From this is follows that, 15 mult, 15 add generates less error then 60 mult, 21 add. To double check, I also wrote a precision test, which agreed with the previous point.

[2] http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Performance Test:

2 Likes

Nice, the test looks good to me. So the result is what you announced before? 1.5 times faster?

Yes, ~1.5x faster, on my machine it’s on average 1.6x, peak 1.7x.

I only looked far enough to see:
long millis = System.currentTimeMillis();

…and didn’t bother reading much further.

Microbenchmarks are bad already but even worse if not using nanoTime(). You can often chuck up to 10 ms right out the window just for possible precision error.

Also, you should warm up a few times before taking measurements… but generally, I run a full timing test 10 times or so and print each value… then I can see if the numbers are stabilizing. You also need to use enough inner iterations to get timings of several seconds per run or you aren’t really measuring the code you think you are.

It’s possible that it won’t change the results much but sometimes it really can by a lot.

Edit: and note, this is no longer about whether to apply the change or not as they seem good but it would be nice to know what the real perf difference is just for your bragging rights. :slight_smile:

2 Likes

Haha yes … benchmarking in java is a real pain.
(for those interested : http://www.ibm.com/developerworks/library/j-benchmark1/)

Maybe a profiler can help … ?

I must admit I once reimplemented an algorithm in C just to benchmark it :slight_smile:

Out of interest I wanted to run the test myself. I modified the test slightly to use nanoTime() instead and I ran this on two different machines to get some kind of comparison of hardware and OS as well.


Machine 1 (Laptop):
Kubuntu Linux 3.10.9-031009-generic x64
Intel Core i7-3840QM @ 2.8Ghz
24Gb RAM

Results for Machine 1:
Testing 10 times with 3000 iterations each time calculating for 100000 items
15: 2807ms 60: 5288ms Difference: 1.9x
15: 2671ms 60: 5633ms Difference: 2.1x
15: 2635ms 60: 5303ms Difference: 2.0x
15: 2644ms 60: 5149ms Difference: 1.9x
15: 2655ms 60: 5134ms Difference: 1.9x
15: 2772ms 60: 5236ms Difference: 1.9x
15: 2691ms 60: 5151ms Difference: 1.9x
15: 2752ms 60: 5111ms Difference: 1.9x
15: 2603ms 60: 5116ms Difference: 2.0x
15: 2831ms 60: 5310ms Difference: 1.9x


Machine 2 (Desktop):
Windows 8.1 x64
Intel Core i7-4790K @ 4.6Ghz
16Gb RAM

Results for Machine 2:
Testing 10 times with 3000 iterations each time calculating for 100000 items
15: 1953ms 60: 3284ms Difference: 1.7x
15: 1733ms 60: 3157ms Difference: 1.8x
15: 1719ms 60: 3151ms Difference: 1.8x
15: 1731ms 60: 3229ms Difference: 1.9x
15: 1719ms 60: 3219ms Difference: 1.9x
15: 1728ms 60: 3253ms Difference: 1.9x
15: 1838ms 60: 3198ms Difference: 1.7x
15: 1761ms 60: 3287ms Difference: 1.9x
15: 1772ms 60: 3164ms Difference: 1.8x
15: 1804ms 60: 3422ms Difference: 1.9x


3 Likes

Uh if nanotime are kind of integer, mult60 seems to be faster here, or i misunderstood smth

run:
=== Test ===
items : 100000
test  : 1000
reps  : 100 * 100000000
CSV output style :
rep, mult15 nanoS, mult60 nanoS
1, 2661495763006, 1550433945
2, 2663755856633, 1548449792
3, 2666013420188, 1548967155
4, 2668270708423, 1550862799
5, 2670532270205, 1551108074
6, 2672793386882, 1549030119
7, 2675051197411, 1549578804
8, 2677310563706, 1551400577
9, 2679572662489, 1549568619
10, 2681831384826, 1564110575
11, 2684108104612, 1551440088
12, 2686369287065, 1549955663
13, 2688630971585, 1551065525
14, 2690890396654, 1553016552
15, 2693153497842, 1556189067
16, 2695428301650, 1550908215
17, 2697687215641, 1552716962
18, 2699951381920, 1548136096
19, 2702210443062, 1550451758
20, 2704472532529, 1552128791
21, 2706735607960, 1551769732
22, 2708995104990, 1550265241
23, 2711255175365, 1553454866
24, 2713519272938, 1560833690
25, 2715790597471, 1549883541
26, 2718050399485, 1549853025
27, 2720307983455, 1548780910
28, 2722563672535, 1551670048
29, 2724824076899, 1547180525
30, 2727080443977, 1547397920
31, 2729336247521, 1549147991
32, 2731594993019, 1551386577
33, 2733854070867, 1550080396
34, 2736111746077, 1547730430
35, 2738369590225, 1549451629
36, 2740627725691, 1550073055
37, 2742887339766, 1549240135
38, 2745148104622, 1548779669
39, 2747409253152, 1551847974
40, 2749671474225, 1548688037
41, 2751930586458, 1549298157
42, 2754188726509, 1551633221
43, 2756451272924, 1572073042
44, 2758741089494, 1560642054
45, 2761045754070, 1582979510
46, 2763349986318, 1576166413
47, 2765651491175, 1561943768
48, 2767937129912, 1559017180
49, 2770209955395, 1558565368
50, 2772485949749, 1559250244
51, 2774761242998, 1559782011
52, 2777036414690, 1560521002
53, 2779312897863, 1559221335
54, 2781585062084, 1556754276
55, 2783856336448, 1559225160
56, 2786129706337, 1556191677
57, 2788400903662, 1558523360
58, 2790672722844, 1561515884
59, 2792949677593, 1560416670
60, 2795226328220, 1558549877
61, 2797496882236, 1554722257
62, 2799764929276, 1558673901
63, 2802038840437, 1556989478
64, 2804309357546, 1556167622
65, 2806579226320, 1559700595
66, 2808855412160, 1557097426
67, 2811127673308, 1560401399
68, 2813400784227, 1557486802
69, 2815674328147, 1557330888
70, 2817945206847, 1558698642
71, 2820219390445, 1555848396
72, 2822489947665, 1556065894
73, 2824760470528, 1558868233
74, 2827033434152, 1558999227
75, 2829306550997, 1558100024
76, 2831579705886, 1560612310
77, 2833854898619, 1561095084
78, 2836132665825, 1560569240
79, 2838414732139, 1588381550
80, 2840729209995, 1608635051
81, 2843080726454, 1567973712
82, 2845370800717, 1587039847
83, 2847670377069, 1559456747
84, 2849942936215, 1556386203
85, 2852211267820, 1559462337
86, 2854483264668, 1560312093
87, 2856757659347, 1559965068
88, 2859029563996, 1555804373
89, 2861298950306, 1558587959
90, 2863572873822, 1559440004
91, 2865845773140, 1554634106
92, 2868115636303, 1557924238
93, 2870386001275, 1562627036
94, 2872656368893, 1549884983
95, 2874913321420, 1547991048
96, 2877168805524, 1555411744
97, 2879434320528, 1550825102
98, 2881692877108, 1548654265
99, 2883949182555, 1548133331
100, 2886208645928, 1548878112
BUILD SUCCESSFUL (total time: 3 minutes 47 seconds)

mult60 seem to be ~1700 times faster than mult15, and also mult 15 take longer and longer time to calculate.
Im quite sure i did a mistake : http://pastebin.com/VLMxUWnp

Edit : This is impossible sorry : 2697687215641 ~= 2 000 sec (33 min) (build only took 3min for 100 time)