Perceptual image and audio deduplication

Okay, two months without a post, won’t happen again…

So, lately I’ve been moving out from the broadcast area and getting back into webapp development, but some of the things I’ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let’s have a look at what we can do now.

Doing exact deduplication – stopping someone uploading the same file twice to a website – is pretty easy. You just hash the uploaded file (or de-encapsulate the data and hash that if you want to be a little more resilient) with something like SHA256 or SHA512. It’s fast, effective and easy. Lookups are as fast as your RDBMS is. This works with images, audio, videos, you name it.

What’s much harder is doing perceptual deduplication, or content deduplication. If I upload two files which are the same except one’s a PNG and one’s a JPEG, I want to be able to say “Hang on, you’ve already uploaded that!” when you upload the second file. Similarly, what if we resize an image? We want something resistant to that sort of attack. Continue reading Perceptual image and audio deduplication