Vectors and hash are not unique

erwin82 · July 2, 2023, 8:31am

Hi. Interesting problem with Vector3f. The hashCode implementation does not adhere to the hash-equals contract to return a unique hash code for different objects. Here is the test cases and a possible solution. As you can see, for (1, 1, 1), (-1, 1, 1) and (1, -1, 1) the same hash is returned. The custom hash function returns a unique hash but it’s more computation.

I would recommend, a) remove the equals and hashCode methods from Vector3f, b) add a documentation that the hash code is not unique or c) use a more complicated algorithm to return the hash code.

PS: it’s not a bug or help request, I shouldn’t use Vector3f in a hash map anyway. It was just a quick and dirty solution for some debugging.

class Vector3fTest {

    static test_hashCode() {
        Stream.of(
                of(new Vector3f(0, 0, 0), 2030264), //
                of(new Vector3f(1, 1, 1), -627115336), //
                of(new Vector3f(-1, 1, 1), -627115336), //
                of(new Vector3f(1, -1, 1), -627115336), //
                of(new Vector3f(1, 1, -1), 1520368312), //
                )
    }

    @ParameterizedTest
    @MethodSource
    void test_hashCode(def v, def expected) {
        assert v.hashCode() == expected
    }

    static test_hashCode_custom() {
        Stream.of(
                of(new Vector3f(0, 0, 0), 59), //
                of(new Vector3f(1, 1, 1), 1098907707), //
                of(new Vector3f(-1, 1, 1), 1082130491), //
                of(new Vector3f(1, -1, 1), 1065353275), //
                of(new Vector3f(1, 1, -1), 1084227643), //
                )
    }

    @ParameterizedTest
    @MethodSource
    void test_hashCode_custom(def v, def expected) {
        assert toHash(v) == expected
    }

    static int toHash(def v) {
        final int PRIME = 59;
        int result = 1;
        float f = calcN(calcN(v.x, v.y), v.z)
        result = (result * PRIME) + (Float.hashCode(f));
        return result;
    }

    static float calcN(float x, float y) {
        return (x + y) * (x + y + 1) / 2 + y;
    }
}

pspeed · July 2, 2023, 8:48am

I think you misunderstand how hash codes work.

It’s perfectly fine for different objects to return the same hash code. In fact, it’s impossible that you could create unique objects and not eventually hit a collision. (Just think of how the hash code for Long must work.)

There are only 2^32 hash codes but there are waaaaaaaay more Vector3fs than that.

The contract says if two objects are equals()=true then they MUST return the SAME hash code. It makes no statement at all about the uniqueness of hash codes. In fact, it fits the contract to just return 0 from hashCode().

Good hashes make hashtables more efficient. That’s it.

erwin82 · July 2, 2023, 8:55am

Maybe I didn’t convey my comment correctly. I know that hashes are not perfectly unique. In a perfect world they would be.

At least it’s not expected. (1, 1, 1) != (-1, 1, 1) != (1, -1, 1) therefore the hash should be different. If a behavior is not expected then I would change it to the expectation or document it.

pspeed · July 2, 2023, 9:02am

This is your misunderstanding.

a.equals(b) then a.hashCode() == b.hashCode()

That’s the only rule. There is no rule at all about the case where a.equals(b) is false. None.

They cannot ever be. Never ever be unique unless the limit on the number of possible object instances is < 2^32.

They are never intended to be unique.

In fact, simply being unique is not even a good hash!

Hash codes are meant to find a good bucket in a hash table. That’s it. They should be well-distributed but not unique. Lots of deep math goes into making sure certain hashes are well distributed under common use-cases (see String.java). Collisions should be minimized but can never be avoided. (Edit: and note that ‘collision’ in this case isn’t even about the whole hash code but just the interesting bits that a particular hashtable might be interested in.)

No, this is wrong. Again, no rule about this. You are expecting WAY too much from hash codes.

My guess is that you are attempting to use hash codes for something they were not intended to be used for.

Edit:

It is fine to use Vector3f in a hash map. It does not matter that those three values end up in the same bucket. The hash map will function correctly. Collisions happen. They are unavoidable. Depending on the size of your hash map, it’s probably only using a small number of the bits anyway.

richtea · July 2, 2023, 12:52pm

To add to pspeed’s answer, a hash is supposed to do these things

Randomly selected objects are approximately evently split between hashes (optional but highly desirable)
The hash is fast (optional but highly desirable)
Objects which are equal have equal hashcodes (mandatory)

What it’s not required to do

Avoid hash collisions for hand picked values
Non equal values have non equal hashcodes

2 and 3 are not being discussed so I wrote this program to try to prove that it was also achieving (1), and it does

public static void main(String[] args){
    int noOfHashBins = 10;

    int[] bucketHits = new int[noOfHashBins];

    Random rnd = new Random(1);

    for(int i=0;i<10000;i++){
        Vector3f item = new Vector3f(
                (rnd.nextBoolean() ? -1 : 1) * rnd.nextFloat(10000),
                (rnd.nextBoolean() ? -1 : 1) * rnd.nextFloat(10000),
                (rnd.nextBoolean() ? -1 : 1) * rnd.nextFloat(10000));

        int hash = item.hashCode();
        int bin = Math.abs(hash % noOfHashBins);
        bucketHits[bin]++;
    }

    for(int i=0;i<bucketHits.length;i++){
        System.out.println(i + "\t" + bucketHits[i]);
    }
}

To be clear, the fact that (1, 1, 1), (-1, 1, 1) and (1, -1, 1) have the same hash is absolutely fine (mildly suprising, but not a problem).

I shouldn’t use Vector3f in a hash map anyway

Why not, that’s perfectly fine. I can’t imagine it coming up very often that you’d want to do that, but if it has there is no problem.

Aside

Interestingly I did find that Vector3f doesn’t bucket very well when all the values are small, tending to put 75% of all random numbers in the even buckets (My first version of this didn’t have rnd.nextFloat(10000) it just had rnd.nextFloat() but all hashcode functions have those sorts of trade offs. That said, Vector3d’s hashcode (which is very similar to Vector3f’s) has special code in it to remove that effect)

pspeed · July 2, 2023, 5:37pm

Thinking about this statement after some sleep, I wonder if you were seeing some problem using Vector3f in a hashmap and (incorrectly) thought it was related to hash collisions.

This code will work fine with each key being “unique” in the hashmap:

Vector3f key1 = new Vector3f(1, 1, 1);
Vector3f key2 = new Vector3f(-1, 1, 1);
Vector3f key3 = new Vector3f(1, -1, 1);
HashMap<Vector3f, String> test = new HashMap<>();
test.put(key1, "key1");
test.put(key2, "key2");
test.put(key3, "key3");
if( test.get(key1) != test.get(key2) ) {
    System.out.println("Hashmaps are working as expected.");
}

sgold · July 2, 2023, 5:45pm

Whenever I use Vector3f in a hash collection, I “standardize” the vectors so they don’t contain -0 components. For some applications, this substantially reduces the chance of hashing collisions.

oxplay2 · July 2, 2023, 11:06pm

I understand this topic is more about “Vectors hash optimization”

since even if their hash are not unique, it just mean it will need do additional O(n) with equals and nothing more.

There is no BUG, just optimization issue at max.

Im not sure myself if sometimes its faster to have same bucket for examples, or its always better to have them unique per key.

Because i didnt test it so idk, if there is a case, where Bucket with same keys will improve speed of Map or having unique hash per key will always be most optimal. Here someone could tell me,
since i dont know if there is possibility that same Bucket could improve speed.

My guess is that unique hash will always be faster than non-unique ones. (not counting size-rehashing ofc)

pspeed · July 3, 2023, 12:44am

FALSE.

A hash table is only using part of the bits of a hash. You could have 100% unique hashes that all find themselves in the same bucket.

Anyway, there are 4,294,967,296 possible unique hash code values.

There are 79,228,162,514,264,337,593,543,950,336 possible unique Vector3f values.

No matter what scheme you choose, there will be 79,228,162,514,264,337,589,248,983,040 possible collisions at BEST. That’s a lot. And there is nothing you can do about it.

Being well-distributed (especially in the most-used bits) is the most important feature of a hash meant for hashtables.

(FWIW, by my in-head math, OPs version of hash has many trivial collisions as well… for example, I think [9, -10, 0] and [10, -10, 0] produce the same hash… as would do for any same value of z [9, -10, 1] and [10, -10, 1], etc… )

It would be pretty much impossible to write a better hash without also (at minimum) knowing how hash tables work, doing a bunch of common usage analysis, and a mess of statistical testing. And then it will still be ‘less optimal’ in some common use-cases.

I find richtea’s analysis really interesting. I’d also be curious to know how the 0…1 version of Vector3fs fared when normalized. Positions and direction vectors seem to be the most common use-cases for Vector3f in JME. Seems positions hash pretty well distributed so I wonder about normalized directions.

oxplay2 · July 3, 2023, 1:08am

i didnt specifically mean Vector3f here. I understand fully that buckets have limit. I just meant if we keep within limit, then what i said would apply right?

For Vector3f there are too many possibilities, that go over the limit - i understand that. There is no way to keep all in own bucket, but like you said if someone would know how hash tables works, then at least could optimize it more. But in general no sense of trying it.

pspeed · July 3, 2023, 1:16am

Depending on what you mean by “what I said” and what you mean by “within limit”, no.

Think of the case of Integer which can have perfectly unique hash codes. For a given hashtable and a given set of Integer keys, you can still have 100% collisions because of the bits chosen for the buckets. A hashtable is not using all of the 32 bits of the hash code.

In fact, HashMap.java does some key mangling in order to improve bit distribution in common cases… and in my reading it actually greatly reduces the uniqueness of the hash values themselves.

So uniqueness is not really the most desirable trait.

oxplay2 · July 3, 2023, 1:23am

thanks.

so there are even 2 reasons of it. Not just chosen bits but also magic happening within HashMap that reduce uniqueness of the hash.

Sounds like taboo topic to optimize usage of it any way.

pspeed · July 3, 2023, 1:30am

And by the way, since I’ve just looked at the source code again for the first time in probably 10 year.

If you want to learn how hashtables (the CS general concept) work then Hashtable.java is a better learning example. It implements a purer form of hashtables.

The get() method is exactly as straight-forward as one might expect:

    public synchronized V get(Object key) {
        Entry<?,?> tab[] = table;
        int hash = key.hashCode();
        int index = (hash & 0x7FFFFFFF) % tab.length;
        for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
            if ((e.hash == hash) && e.key.equals(key)) {
                return (V)e.value;
            }
        }
        return null;
    }

Translation:

chop off the lower order bits of the hash to use as a bucket index
iterate over the collisions in the bucket until we find the match

So for a small hash table it may only be using 5-6 bits of the hash code.

HashMap.java has a bunch of other code to better optimize large hash maps, linked hash maps, etc… It makes it more confusing to read if you just want to know how hashtables work.

oxplay2 · July 3, 2023, 1:35am

yup, i see. Thanks.

So i will just need keep in mind only to touch initialCapacity and loadFactor for optimizing purpose. At least for low size maps, i understand if i keep low loadFactor, at least it will create more Buckets, so there is less chance for conflict(so lower chance for O(n) within single bucket).

And for very big collection as i see it can make a lot of magic itself. (“linked hash maps, etc”)

pspeed · July 3, 2023, 1:49am

And probably mostly don’t worry about. Hashtables are still considered O(1) even with their ‘worst case’ considerations.

In case OP has realized where their misunderstanding was and perhaps feels a little embarrassed to follow up to this thread… there is a very embarrassing reason I know a lot about this particular topic (and doubled-down on my CS knowledge in general).

Somewhere around 1997/1998, I was a fresh young Java developer on my “first big Java project” after having done a decade of hardcore C work. (I had dropped out of college in year 2 or so doing C work for startup companies but tried to supplement my CS knowledge where I could… but obviously there were giant gaps.) Java made it very easy to treat a lot of “advanced” data structures like a black box. I could quote Gang-of-four software patterns like reading from the gospel but had never really looked into Hastable or TreeMap, etc. beyond recognizing the terms of the implementation.

So “be me” and discover a bunch of hash code collisions in Long.java because I was abusing hashcodes for a purpose not intended. Run to team lead and explain situation. Team lead (who should have known better) and I file a bug report with Sun’s Java bug database. Yes, we really did!!! Where we were curtly-but-politely told how hashtables work and our bug canceled within about an hour of filing it.

Since then I’ve tried to assume that I’m the one that’s wrong and then doing the research to prove/disprove that hypothesis… to hopefully avoid “whole Java community” levels of embarrassment.

But yeah, I still think about that 25 years later.