Okay, two months without a post, won’t happen again…
So, lately I’ve been moving out from the broadcast area and getting back into webapp development, but some of the things I’ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let’s have a look at what we can do now.
Doing exact deduplication – stopping someone uploading the same file twice to a website – is pretty easy. You just hash the uploaded file (or de-encapsulate the data and hash that if you want to be a little more resilient) with something like SHA256 or SHA512. It’s fast, effective and easy. Lookups are as fast as your RDBMS is. This works with images, audio, videos, you name it.
What’s much harder is doing perceptual deduplication, or content deduplication. If I upload two files which are the same except one’s a PNG and one’s a JPEG, I want to be able to say “Hang on, you’ve already uploaded that!” when you upload the second file. Similarly, what if we resize an image? We want something resistant to that sort of attack.
There are already perceptual hashing libraries like pHash out there which work by generating hashes which look very similar to your average md5sum, but are actually generated perceptually – the Hamming distance between two hashes is a measure of the similarity of the images represented by those two hashes. This is a great thing, but it’s pretty useless on large datasets without some specialist software to manage databases of hashes and querying based on Hamming distance. The pHash guys will of course sell you this solution, but there’s the problem – there’s quite a bit of money to be made with this sort of product, and useful open source implementations seem to be quite rare if not non-existent.
pHash is also a bit of an odd thing in that it works on multiple media types – audio, video and images. More specifically for audio are techniques for audio fingerprinting, like AcoustID which aim to generate a specific fingerprint for each recording of a song. Distance between songs isn’t something we’re very interested in, because audio releases are typically few in number, and rarely are we looking for a song which sort of sounds like X – if it sounds different, it’s a different recording of the song, or has been mastered.
Images are very different because people often make small changes, or change the formatting or file type of an image, and these circulate and get thrown around all over the place. We want to be able to accept things that are legitimately changed, but flag up things that are almost perfectly identical to something we already have in a database.
So, what can we do? Well, we can use simpler techniques to reduce the number of images to compare down to a small number – say, 50 – and then compare the Hamming distance on full pHashes using that technique. But do we actually need this? Certainly for some databases, merely using those simpler techniques may well be sufficient.
I’ve been building an image board, lately, which I’m using Dragonfly for, an awesome on-the-fly image serving system. We’re storing everything in MongoDB, with GridFS for file storage. Here’s what we’re doing.
First, we need some analysis of the image contents. I’m using a roughly perceptually weighted average intensity metric. Here’s a tiny little Python script using Python-OpenCV’s SWIG bindings to perform that analysis rapidly.
Now we need to get Dragonfly in on this, and register a function we’ll use to handle GIFs. This goes in our Dragonfly initializer.
And finally we need to store all this and actually do the find-duplicates step. Note the aspect ratio stuff – with the ImageMagick analyser, Dragonfly will handle storing this for you, which makes life easier.
In our find_duplicates function we just look for images that have both a similar intensity and a similar aspect ratio. If both of these things are very close we’ve got ourselves a potential duplicate. We’re not doing exact matching on intensity/aspect ratio, more fuzzy matching, because compression changes and resizes can often affect both slightly.
Of course, the proof is in the pudding- this site only has a hundred images in it now, and we may need to adjust the distance variable, but so far it’s working very well and isn’t bringing up any false positives while correctly identifying duplicates. The way we’re doing this from a workflow perspective is important, too – when we do find duplicates, we ask the uploader of the image to check and make sure they’re not uploading anything we already have by presenting the duplicate images to them. They can then mark the duplicate (which deletes the image they just uploaded and leaves a redirect to the image their upload duplicates) or confirm it’s not a duplicate. The statistics from those actions will be interesting to see over time as a metric of success for this approach!
So, that’s how I’ve done image dedupe in a Rails app – what approach are you using?