What is the best algorithm for overriding GetHashCode? – Dev

The best answers to the question “What is the best algorithm for overriding GetHashCode?” in the category Dev.

QUESTION:

In .NET, the GetHashCode method is used in a lot of places throughout the .NET base class libraries. Implementing it properly is especially important to find items quickly in a collection or when determining equality.

Is there a standard algorithm or best practice on how to implement GetHashCode for my custom classes so I don’t degrade performance?

ANSWER:

ValueTuple – Update for C# 7

As @cactuaroid mentions in the comments, a value tuple can be used. This saves a few keystrokes and more importantly executes purely on the stack (no Garbage):

(PropA, PropB, PropC, PropD).GetHashCode();

(Note: The original technique using anonymous types seems to create an object on the heap, i.e. garbage, since anonymous types are implemented as classes, though this might be optimized out by the compiler. It would be interesting to benchmark these options, but the tuple option should be superior.)

Anonymous Type (Original Answer)

Microsoft already provides a good generic HashCode generator: Just copy your property/field values to an anonymous type and hash it:

new { PropA, PropB, PropC, PropD }.GetHashCode();

This will work for any number of properties. It does not use boxing. It just uses the algorithm already implemented in the framework for anonymous types.

ANSWER:

I usually go with something like the implementation given in Josh Bloch’s fabulous Effective Java. It’s fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

As noted in comments, you may find it’s better to pick a large prime to multiply by instead. Apparently 486187739 is good… and although most examples I’ve seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I’ve used numbers which apparently work well – but the initial value isn’t a prime. (The multiplication constant is prime though. I don’t know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is “good enough” and it’s incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we’re using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn’t actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = (int) 2166136261;
        // Suitable nullity checks etc, of course :)
        hash = (hash * 16777619) ^ field1.GetHashCode();
        hash = (hash * 16777619) ^ field2.GetHashCode();
        hash = (hash * 16777619) ^ field3.GetHashCode();
        return hash;
    }
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

  • You can compute the hash code from fields that are not mutable; or
  • You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled – The Art of Hashing

ANSWER:

Using System.HashCode

If you are using .NET Standard 2.1 or above, you can use the System.HashCode struct. On earlier frameworks it is available from the Microsoft.Bcl.HashCode package. There are two methods of using it:

HashCode.Combine

The Combine method can be used to create a hash code, given up to eight objects.

public override int GetHashCode() => HashCode.Combine(this.object1, this.object2);

HashCode.Add

The Add method helps you to deal with collections:

public override int GetHashCode()
{
    var hashCode = new HashCode();
    hashCode.Add(this.object1);
    foreach (var item in this.collection)
    {
        hashCode.Add(item);
    }
    return hashCode.ToHashCode();
}

GetHashCode Made Easy

An alternative to System.HashCode that is super easy to use while still being fast. You can read the full blog post ‘GetHashCode Made Easy’ for more details and comments.

Usage Example

public class SuperHero
{
    public int Age { get; set; }
    public string Name { get; set; }
    public List<string> Powers { get; set; }

    public override int GetHashCode() =>
        HashCode.Of(this.Name).And(this.Age).AndEach(this.Powers);
}

Implementation

public struct HashCode : IEquatable<HashCode>
{
    private const int EmptyCollectionPrimeNumber = 19;
    private readonly int value;

    private HashCode(int value) => this.value = value;

    public static implicit operator int(HashCode hashCode) => hashCode.value;

    public static bool operator ==(HashCode left, HashCode right) => left.Equals(right);

    public static bool operator !=(HashCode left, HashCode right) => !(left == right);

    public static HashCode Of<T>(T item) => new HashCode(GetHashCode(item));

    public static HashCode OfEach<T>(IEnumerable<T> items) =>
        items == null ? new HashCode(0) : new HashCode(GetHashCode(items, 0));

    public HashCode And<T>(T item) => 
        new HashCode(CombineHashCodes(this.value, GetHashCode(item)));

    public HashCode AndEach<T>(IEnumerable<T> items)
    {
        if (items == null)
        {
            return new HashCode(this.value);
        }

        return new HashCode(GetHashCode(items, this.value));
    }

    public bool Equals(HashCode other) => this.value.Equals(other.value);

    public override bool Equals(object obj)
    {
        if (obj is HashCode)
        {
            return this.Equals((HashCode)obj);
        }

        return false;
    }

    public override int GetHashCode() => this.value.GetHashCode();

    private static int CombineHashCodes(int h1, int h2)
    {
        unchecked
        {
            // Code copied from System.Tuple a good way to combine hashes.
            return ((h1 << 5) + h1) ^ h2;
        }
    }

    private static int GetHashCode<T>(T item) => item?.GetHashCode() ?? 0;

    private static int GetHashCode<T>(IEnumerable<T> items, int startHashCode)
    {
        var temp = startHashCode;

        var enumerator = items.GetEnumerator();
        if (enumerator.MoveNext())
        {
            temp = CombineHashCodes(temp, GetHashCode(enumerator.Current));

            while (enumerator.MoveNext())
            {
                temp = CombineHashCodes(temp, GetHashCode(enumerator.Current));
            }
        }
        else
        {
            temp = CombineHashCodes(temp, EmptyCollectionPrimeNumber);
        }

        return temp;
    }
}

What Makes a Good Algorithm?

Performance

The algorithm that calculates a hash code needs to be fast. A simple algorithm is usually going to be a faster one. One that does not allocate extra memory will also reduce need for garbage collection, which will in turn also improve performance.

In C# hash functions specifically, you often use the unchecked keyword which stops overflow checking to improve performance.

Deterministic

The hashing algorithm needs to be deterministic i.e. given the same input it must always produce the same output.

Reduce Collisions

The algorithm that calculates a hash code needs to keep hash collisions to a minumum. A hash collision is a situation that occurs when two calls to GetHashCode on two different objects produce identical hash codes. Note that collisions are allowed (some have the misconceptions that they are not) but they should be kept to a minimum.

A lot of hash functions contain magic numbers like 17 or 23. These are special prime numbers which due to their mathematical properties help to reduce hash collisions as compared to using non-prime numbers.

Hash Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range i.e. it should output a wide range of hashes based on its inputs that are evenly spread. It should have hash uniformity.

Prevent’s DoS

In .NET Core each time you restart an application you will get different hash codes. This is a security feature to prevent Denial of Service attacks (DoS). For .NET Framework you should enable this feature by adding the following App.config file:

<?xml version ="1.0"?>  
<configuration>  
   <runtime>  
      <UseRandomizedStringHashAlgorithm enabled="1" />  
   </runtime>  
</configuration>

Because of this feature, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection and they should never be persisted.

Read more about this here.

Cryptographically Secure?

The algorithm does not have to be a Cryptographic hash function. Meaning it does not have to satisfy the following conditions:

  • It is infeasible to generate a message that yields a given hash value.
  • It is infeasible to find two different messages with the same hash value.
  • A small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value (avalanche effect).

ANSWER:

Here is my hashcode helper.
It’s advantage is that it uses generic type arguments and therefore will not cause boxing:

public static class HashHelper
{
    public static int GetHashCode<T1, T2>(T1 arg1, T2 arg2)
    {
         unchecked
         {
             return 31 * arg1.GetHashCode() + arg2.GetHashCode();
         }
    }

    public static int GetHashCode<T1, T2, T3>(T1 arg1, T2 arg2, T3 arg3)
    {
        unchecked
        {
            int hash = arg1.GetHashCode();
            hash = 31 * hash + arg2.GetHashCode();
            return 31 * hash + arg3.GetHashCode();
        }
    }

    public static int GetHashCode<T1, T2, T3, T4>(T1 arg1, T2 arg2, T3 arg3, 
        T4 arg4)
    {
        unchecked
        {
            int hash = arg1.GetHashCode();
            hash = 31 * hash + arg2.GetHashCode();
            hash = 31 * hash + arg3.GetHashCode();
            return 31 * hash + arg4.GetHashCode();
        }
    }

    public static int GetHashCode<T>(T[] list)
    {
        unchecked
        {
            int hash = 0;
            foreach (var item in list)
            {
                hash = 31 * hash + item.GetHashCode();
            }
            return hash;
        }
    }

    public static int GetHashCode<T>(IEnumerable<T> list)
    {
        unchecked
        {
            int hash = 0;
            foreach (var item in list)
            {
                hash = 31 * hash + item.GetHashCode();
            }
            return hash;
        }
    }

    /// <summary>
    /// Gets a hashcode for a collection for that the order of items 
    /// does not matter.
    /// So {1, 2, 3} and {3, 2, 1} will get same hash code.
    /// </summary>
    public static int GetHashCodeForOrderNoMatterCollection<T>(
        IEnumerable<T> list)
    {
        unchecked
        {
            int hash = 0;
            int count = 0;
            foreach (var item in list)
            {
                hash += item.GetHashCode();
                count++;
            }
            return 31 * hash + count.GetHashCode();
        }
    }

    /// <summary>
    /// Alternative way to get a hashcode is to use a fluent 
    /// interface like this:<br />
    /// return 0.CombineHashCode(field1).CombineHashCode(field2).
    ///     CombineHashCode(field3);
    /// </summary>
    public static int CombineHashCode<T>(this int hashCode, T arg)
    {
        unchecked
        {
            return 31 * hashCode + arg.GetHashCode();   
        }
    }

Also it has extension method to provide a fluent interface, so you can use it like this:

public override int GetHashCode()
{
    return HashHelper.GetHashCode(Manufacturer, PartN, Quantity);
}

or like this:

public override int GetHashCode()
{
    return 0.CombineHashCode(Manufacturer)
        .CombineHashCode(PartN)
        .CombineHashCode(Quantity);
}