Experimentation with Wikimedia Commons

Posted by Jonas Öberg on July 22, 2014
Photo by See-ming Lee, on Flickr, via photopin

Photo by See-ming Lee, on Flickr, via photopin

One of the challenges that exists when trying to find contextual information about specific images is that even if you publish such information in a database, it’s difficult to match an image in a database against an image found on the Internet. The image may have been cropped, resized, or otherwise changed so that it no longer matches 1-to-1 with the original.

A way to approach this – even if it wouldn’t work for all images – is to use algorithms that mathematically try to detect the similarity between two different images. There are a few such algorithms available, with the most promising one being SIFT (Scale-invariant feature transform). While patented, and therefore legally unusable, I’ve a hunch that it may be technically usable, and I want to test this hypothesis.

In order to test it, I need a set of images, some of which are copies of each other and others which are not. And I need to know which ones are which, ideally with real world data. What I’ve been planning to do is to retrieve a number of images from a well known source (I’m thinking about Wikimedia Commons) and use TinEye and/or Google to do a reverse lookup of those images, to find other occurrences of the same image elsewhere on the Internet.

Neither Google nor TinEye produces 100% valid results. Some of the images returned are bound to be mismatches, images that are thought to be similar but in reality are not. Or images that aren’t true copies, but are obviously based on another image (such as a photo from Wikimedia Commons being used on a book cover that the image search finds).

Feeding this list of images into a crowdsourcing platform like PyBossa and using the visual perception of humans – not machines – would (with some bias) result in a list of known values for the relations between each pair of images (identical, derivative of, or not at all related). I can then run SIFT as well as other algorithms against the same set of images and simply see which algorithm and parameters return the most accurate results.

I’m hoping for an accuracy of about 50-75% here. It would probably be difficult to get a higher degree of accuracy, and we’ll undoubtedly end up with a number of edge cases where the algorithms can’t reliably determine which are copies or not. But if we can at least clear away those pairs which we definitely can determine are copies with the algorithms, that reduces the burden of manual checking.

To get images from Wikimedia Commons, I reached out to Brian Wolff whom my colleague Mathias Klang met at WikiConference USA earlier this year. Brian has a keen understanding of the data in Wikimedia Commons and quickly wired up a script to output a list of files from the Commons that have the most uses outside of Commons itself.

My theory was that the images most used outside of the Commons are most likely to be images that we can find “in the wild” on the Internet, and so would be useful images to do a reverse image search on for the initial database. Unfortunately, it turns out that the most used images include a whole lot of flags, the logotype of Wikimedia Commons itself, and the wiki letter W. Icons, more than photographs.

In retrospect, I should have anticipated this. Since Wikimedia Commons contains not only photographs but also icons used within Wikipedia articles, it’s natural that those icons would head the list.

Brian suggested using the list of pictures which have been featured on Wikimedia Commons, and indeed, I think this is what I should do: it seems to produce much more useful results on a cursory glance. Expect more to follow.

photo credit: See-ming Lee 李思明 SML via photopin cc