Wikimania part 2: what will do?

Posted by Jonas Öberg on August 22, 2014

Photo by Flickr user shindoverse, via photopin

Imagine being able to browse pages on the Internet, finding images you like, and with the click of a button in the corner, get information about who participated in the creation of the image, where it’s from…and if it’s from a gallery, library, archive or museum, which collection it’s part of and where to find it. That’s the first step of what we’re working on over at

Prior to getting into the details of what our extension will do though, it’s relevant to mention what it will not do:

  1. It won’t do anything unless you activate it!While we take every precaution to keep all data anonymous, matching images you see while browsing with images in our database means sending some information to our servers. Our extension will only do this if you activate it, and if you’re browsing incognito or privately, our extension will simply not run.
  2. It won’t automatically search for information about new imagesIf the location of one particular image is known to our database, it’ll be fairly quick and easy to look up information about that image and we’ll do that automatically. There will be cases where we’ve never heard of that location before, though – for instance if someone takes an image from a museum collection and publishes on their own blog. In such cases, we may need to do a perceptual comparison between the image and the image on record in our database. These calculations take some time to do, so we won’t do them automatically. We’ll only do them if you click a button to request it. The benefit of that though is that the next time you visit the same web page – or someone else visits it – the calculation is already done and you don’t need to wait.
  3. (But it might do it in the background!)Irrespective of what I said in the previous sentence; if our servers detect that more than a handful of people are browsing web pages where a particular image is used, we might pick that image and do the calculations even without anyone requesting that we do so.
  4. It won’t detect ALL modificationsThere are limits to how well we can perceptually match images with one another. We’ll be rather conservative on this: we consider it worse if an image generates a false positive (meaning one image matches against a completely different image) than if an image generates a false negative (meaning we can’t match the images together, even if they are one and the same). So sometimes an image might exist in our database, but we can’t match it against the one you’re viewing. If an image has been modified to create a derivative work–adding a border, rotating it, cropping it, or similar–then it’s likely our algorithms won’t match them together.
  5. (But you can match them manually)You may be watching an image and feeling that that image should exist in our database. We will not start with this, but future versions will allow you to contribute more information to the images you see. So if you find an image that exists but we can’t match it automatically, you’ll be able to help contribute with that missing piece of information.

So what exactly will it do then? It will:

  1. Identify the main images of the web page you’re viewing (large visible images)
  2. Look up the URL of those images in our catalog of creative works
  3. For all matches, put a visual mark in the corner of the image that you can click to find out more information
  4. If you click the manual search button, it will calculate one or more a so called perceptual hashes of the images which have not already been matched. It’ll submit those hashes to our catalog, which will crunch the numbers and try to identify images which are the same as ones existing in our catalog
  5. For all matches, it will visually mark them as above, and also record the URL of the image for future lookups and other viewers

That’s it! It may not sound like much, but there are quite a few details in there that we’re currently working through and experimenting with, such as how to search potentially million of records for images that are similar. On a hash with 256 bits, this would be images where only about 1-10 bits differ between the original and the copy. Searching for this is something that’s not currently supported in any of the currently popular databases and, while we can do this fairly easily for smaller collections, we’re working out what gives us the best performance on this in a larger scale.

We’ll be talking more about this in upcoming posts, when we’ll dive into the technology behind all of this. If you’re interested in following along, please do look at our Github and see our activities there!

photo credit: shindoverse via photopin cc