Last week, Artem described our implementation of blockhash for image matching. The second part of making this into a service that can find similar images is to search for hashes that are close but not always matches. To do this we have implemented an algorithm called HmSearch:
To determine how closely two perceptual hashes match each other you count the number of bits that are different between them, which is called hamming distance. This can easily be found by XOR:ing the two hashes and count the number of bits that are 1 in the result.
To find similar images we need to find all hashes in a database that are within a given hamming distance from the query hash. This is easy to do when the hamming distance is 1: just look for the query hash itself, and then for all the hashes you get by flipping each bit in it. For a 256-bit blockhash you’d end up doing 257 queries, which isn’t that bad. If you have disk and memory enough you could also create an index containing all these flipped hashes, and get away with a single query.
The problem starts when you want to find a hash with greater hamming distance. This quickly escalates, so that for a hamming distance of 5 of out 256 bits you have nearly 9 billion combinations, and for 10 bits you get 2.8*10^17 combinations. In our tests, we’ve determined that a hamming distance somewhere between these numbers is good to find resized and reprocessed images while avoiding most false positives. Something smarter than brute-force is obviously needed.
HmSearch is very clever. The algorithm splits the hash into partitions in such a way that we can do a search for matches within hamming distance 1 for each partition. As described above, this is pretty easy. By looking at how many of the partitions match exactly or with distance 1, the algorithm can filter out the hashes that should be within the desired hamming distance. For a 256 bit hash with distance 10 you then need just 262 queries. In a final step these candidates hashes are checked against the query hash to determine the exact distance and sort the results on the best matches.
As an aside, the paper describing HmSearch creates a full index of all one-bit-variations of each partition, but we found that that is unfeasible unless you have lots and lots of disk and memory. Instead we sacrifice a little theoretical lookup speed by just storing each partition, and then querying each one-bit variation of the partitions. In practice, this is quicker since a smaller index is quicker to search in.
Our implementation store the hashes in a Kyoto Cabinet file database. A database of 25M hashes takes about 8.5GB of disk space in the current implementation. With only rather basic tuning, on a year-old Thinkpad with SSD and 8GB memory a lookup into that database takes no more than 50ms and 85% of that is IO wait.
 Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, and Jiaheng Lu. 2013. HmSearch: an efficient hamming distance query processing algorithm. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), Alex Szalay, Tamas Budavari, Magdalena Balazinska, Alexandra Meliou, and Ahmet Sacan (Eds.). ACM, New York, NY, USA, , Article 19 , 12 pages. DOI=10.1145/2484838.2484842 http://doi.acm.org/10.1145/2484838.2484842