<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Talk Unafraid</title>
	<atom:link href="http://www.talkunafraid.co.uk/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.talkunafraid.co.uk</link>
	<description>The (occasionally coherent) ramblings of a geek</description>
	<lastBuildDate>Sat, 07 Jan 2012 22:24:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Perceptual image and audio deduplication</title>
		<link>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/</link>
		<comments>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/#comments</comments>
		<pubDate>Sat, 07 Jan 2012 18:03:11 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Code Snippets and Examples]]></category>
		<category><![CDATA[Odds and Ends]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[dragonfly]]></category>
		<category><![CDATA[image processing]]></category>
		<category><![CDATA[opencv]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1375</guid>
		<description><![CDATA[Okay, two months without a post, won&#8217;t happen again&#8230; So, lately I&#8217;ve been moving out from the broadcast area and getting back into webapp development, but some of the things I&#8217;ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let&#8217;s have a look at [...]]]></description>
			<content:encoded><![CDATA[<p>Okay, two months without a post, won&#8217;t happen again&#8230;</p>
<p>So, lately I&#8217;ve been moving out from the broadcast area and getting back into webapp development, but some of the things I&#8217;ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let&#8217;s have a look at what we can do now.</p>
<p>Doing exact deduplication &#8211; stopping someone uploading the same file twice to a website &#8211; is pretty easy. You just hash the uploaded file (or de-encapsulate the data and hash that if you want to be a little more resilient) with something like SHA256 or SHA512. It&#8217;s fast, effective and easy. Lookups are as fast as your RDBMS is. This works with images, audio, videos, you name it.</p>
<p>What&#8217;s much harder is doing perceptual deduplication, or content deduplication. If I upload two files which are the same except one&#8217;s a PNG and one&#8217;s a JPEG, I want to be able to say &#8220;Hang on, you&#8217;ve already uploaded that!&#8221; when you upload the second file. Similarly, what if we resize an image? We want something resistant to that sort of attack.<span id="more-1375"></span></p>
<p>There are already perceptual hashing libraries like pHash out there which work by generating hashes which look very similar to your average md5sum, but are actually generated perceptually &#8211; the Hamming distance between two hashes is a measure of the similarity of the images represented by those two hashes. This is a great thing, but it&#8217;s pretty useless on large datasets without some specialist software to manage databases of hashes and querying based on Hamming distance. The pHash guys will of course <em>sell </em>you this solution, but there&#8217;s the problem &#8211; there&#8217;s quite a bit of money to be made with this sort of product, and useful open source implementations seem to be quite rare if not non-existent.</p>
<p>pHash is also a bit of an odd thing in that it works on multiple media types &#8211; audio, video and images. More specifically for audio are techniques for audio fingerprinting, like <a href="http://acoustid.org/">AcoustID</a> which aim to generate a specific fingerprint for each recording of a song. Distance between songs isn&#8217;t something we&#8217;re very interested in, because audio releases are typically few in number, and rarely are we looking for a song which sort of sounds like X &#8211; if it sounds different, it&#8217;s a different recording of the song, or has been mastered.</p>
<p>Images are very different because people often make small changes, or change the formatting or file type of an image, and these circulate and get thrown around all over the place. We want to be able to accept things that are legitimately changed, but flag up things that are almost perfectly identical to something we already have in a database.</p>
<p>So, what can we do? Well, we can use simpler techniques to reduce the number of images to compare down to a small number &#8211; say, 50 &#8211; and then compare the Hamming distance on full pHashes using that technique. But do we actually need this? Certainly for some databases, merely using those simpler techniques may well be sufficient.</p>
<p>I&#8217;ve been building an image board, lately, which I&#8217;m using Dragonfly for, an awesome on-the-fly image serving system. We&#8217;re storing everything in MongoDB, with GridFS for file storage. Here&#8217;s what we&#8217;re doing.</p>
<p>First, we need some analysis of the image contents. I&#8217;m using a roughly perceptually weighted average intensity metric. Here&#8217;s a tiny little Python script using Python-OpenCV&#8217;s SWIG bindings to perform that analysis rapidly.</p>
<p><script type="text/javascript" src="https://gist.github.com/1575463.js?file=simple_analyser.py"></script>Now we need to get Dragonfly in on this, and register a function we&#8217;ll use to handle GIFs. This goes in our Dragonfly initializer.<script type="text/javascript" src="https://gist.github.com/1575463.js?file=dragonfly.rb"></script></p>
<p>And finally we need to store all this and actually do the find-duplicates step. Note the aspect ratio stuff &#8211; with the ImageMagick analyser, Dragonfly will handle storing this for you, which makes life easier.</p>
<p><script src="https://gist.github.com/1575463.js?file=image.rb"></script></p>
<p>In our find_duplicates function we just look for images that have both a similar intensity and a similar aspect ratio. If both of these things are very close we&#8217;ve got ourselves a potential duplicate. We&#8217;re not doing <em>exact</em> matching on intensity/aspect ratio, more fuzzy matching, because compression changes and resizes can often affect both slightly.</p>
<p>Of course, the proof is in the pudding- this site only has a hundred images in it now, and we may need to adjust the distance variable, but so far it&#8217;s working very well and isn&#8217;t bringing up any false positives while correctly identifying duplicates. The way we&#8217;re doing this from a workflow perspective is important, too &#8211; when we do find duplicates, we ask the uploader of the image to check and make sure they&#8217;re not uploading anything we already have by presenting the duplicate images to them. They can then mark the duplicate (which deletes the image they just uploaded and leaves a redirect to the image their upload duplicates) or confirm it&#8217;s not a duplicate. The statistics from those actions will be interesting to see over time as a metric of success for this approach!</p>
<p>So, that&#8217;s how I&#8217;ve done image dedupe in a Rails app &#8211; what approach are you using?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IRIS &#8211; The Interchangeable Radio Ingest System</title>
		<link>http://www.talkunafraid.co.uk/2011/10/iris-the-interchangeable-radio-ingest-system/</link>
		<comments>http://www.talkunafraid.co.uk/2011/10/iris-the-interchangeable-radio-ingest-system/#comments</comments>
		<pubDate>Thu, 13 Oct 2011 10:48:14 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Audio]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Projects]]></category>
		<category><![CDATA[Radio]]></category>
		<category><![CDATA[Rivendell]]></category>
		<category><![CDATA[Servers and Software]]></category>
		<category><![CDATA[EBU R128]]></category>
		<category><![CDATA[iris]]></category>
		<category><![CDATA[myriad]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[rivendell]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1367</guid>
		<description><![CDATA[Well, wow. After nearly forgetting to actually submit it and only writing the entry a few hours before the deadline, it turns out that the system I made while at Insanity Radio 1287AM has been nominated for the Best Technical Achievement award at the Student Radio Awards. So, I figured it would be worth actually [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://assets.talkunafraid.co.uk/2011/10/IRIS-ConceptualDrawing.png" rel="lightbox[1367]"><img class="alignright size-medium wp-image-1369" title="Conceptual Overview" src="http://assets.talkunafraid.co.uk/2011/10/IRIS-ConceptualDrawing-300x225.png" alt="" width="300" height="225" /></a>Well, wow. After nearly forgetting to actually submit it and only writing the entry a few hours before the deadline, it turns out that the system I made while at Insanity Radio 1287AM has been nominated for the Best Technical Achievement award at the Student Radio Awards. So, I figured it would be worth actually writing up a little bit about what it is and what it does. And why you can use it, too, if you&#8217;re involved with a student radio station.</p>
<p>IRIS was written to replace MACIS, a system I bodged up out of necessity. At Insanity, we had a computer failure weeks before we went on air at the start of the first term, and lost all the data- including the entire playout system. Lessons have been learned (I made sure we replaced that machine with a box that had RAID, for starters) since, but we had the unenviable challenge of repopulating a student radio playout system from scratch with little to no staff. Enter MACIS!</p>
<p><span id="more-1367"></span></p>
<p>MACIS was <em>dumb</em>. It talked to our playout system (PSquared&#8217;s Myriad 3.5) via the not-very-documented TCP/IP interface, had a web interface and drop box, and some background processing magic. It was implemented as a Ruby on Rails web application, since we already used Rails and Ruby for various tasks around the station (the website is all Rails, and Ruby was chosen for most scripting tasks because of its user friendliness to people not too familiar with programming. You passed it files, it converted them (the main purpose- Myriad doesn&#8217;t support AAC, MP4 or many other formats we were using for ingest), did a basic stab at normalization, and then imported the files. Myriad is slow at importing files- 1-2 minutes per file on average, so we let MACIS distribute workload across all four of our Myriad machines, speeding things up massively.</p>
<p>This was good and got us running, but then term started. We&#8217;re a student station and have specialist music shows, who upload their own content to the playout system every week, and new content for new music and chart shows came in regularly. MACIS was used for this, as it did the conversion automatically, saving our presenters loads of time. It also did batch imports quickly and efficiently, which sped things up. However, after a while, we stopped using it, and just provided a simple Ruby dropbox on a shared file server for conversion. MACIS was useful, but too buggy and inflexible. In addition, while it did a better job of getting metadata into Myriad than Myriad managed on its own, it had issues with some formats and material.</p>
<p>What was needed was a system that would let presenters upload content in any format, would sort out the metadata, handle conversion, and fire it off to the playout system for usage.</p>
<div id="attachment_1370" class="wp-caption alignright" style="width: 211px"><a href="http://assets.talkunafraid.co.uk/2011/10/IRIS-Screenshot-UploadInformation.png" rel="lightbox[1367]"><img class="size-medium wp-image-1370" title="Upload Information" src="http://assets.talkunafraid.co.uk/2011/10/IRIS-Screenshot-UploadInformation-201x300.png" alt="" width="201" height="300" /></a><p class="wp-caption-text">An example upload showing the log and graphs (huge image)</p></div>
<p>So, while presenters got back to using Myriad directly, I went back to the drawing board and scrapped MACIS. At this stage we were considering transitioning to the Rivendell open source playout system to replace Myriad, so I decided whatever I was going to make had to support both Myriad and Rivendell, and any other system you could imagine.</p>
<p>I also wanted to solve the loudness problem. Even doing normalization to every track imported, we had huge loudness level differences between some tracks, making life quite tricky for presenters and impacting our on-air sound. Especially given our lack of transmitter processing (we only have a limiter and preemphasis box on the AM system- no AGC or multiband comps), I wanted to do all I could to get everything as perceptually loud as everything else. Enter EBU Recommendation 128 for loudness measurement- with the help of some great libraries (libebur128 in particular), I implemented a simple version of the recommended processing system for loudness normalization, including LRA correction using a compressor. Thus, everything you run through IRIS comes out sounding about the same as everything else in terms of perceptual levels &#8211; as much as is possible without impacting the sound. <em>Wish You Were Here</em> is still going to have a quiet bit at the beginning- but IRIS will gently compress the track to make the difference less severe, and will then use R128&#8242;s standards to normalize to -23 LUFS.</p>
<p>Next up, user authentication. This was almost an afterthought, but added after talking to people about security. You register an account, and that account is either able to upload content only, upload and review (more on that in a sec), or administrate the system (ie modify users etc). This is done by user groups, which are pretty flexible, and easily adapted for your own usage via a simple permissions file. Uploads are linked to users- users can only see their uploads (unless they have permission to see more than that) and admins can see who owns a specific upload. You can also have emails sent on error conditions being met- so presenters know if a file they uploaded failed to make it to the playout system <em>before</em> they turn up to do their show and wonder where it is.</p>
<p>Metadata was one of the big issues I wanted to solve. Let&#8217;s say I have a track from a CD- I&#8217;ve ripped it and the ripper has embedded title/artist, maybe album metadata from Gracenote. For a playout system for radio this isn&#8217;t perfect- really we want to know the record label, copyright info, and so on. Enter MusicBrainz- a huge collaborative open database of music metadata. With some clever tools, IRIS matches up the track&#8217;s metadata to a MusicBrainz identifier and fills in the blanks. For most tracks, it can get everything- including ISRCs, year of release, and so on. This is great for music librarians and makes copyright reconciliation for PRS/PPL much simpler, since you&#8217;ve now got all this in a database.</p>
<p>Of course, if we know what the track is, we can do another useful step- especially so for student stations. Using the MusixMatch API service (commercial but free for nonprofits at present), we can get the lyrics to a track. This means we can do a quick once-over for words we don&#8217;t want on air (swearing). Of course, this assumes the track isn&#8217;t a radio edit. We do a quick check and skip the lyric pass for tracks that look like a radio edit, but if not, the track will be flagged for review.</p>
<p>Additionally, and this is intended specifically for situations where you have no control over the ingest quality that presenters are using to rip CDs or vinyl, tracks are flagged for review if they fail to meet quality restrictions. This can be specified as a function of sample rate and bit rate. This stops people trying to upload 96kbps MP3 at 32k sample rate. We don&#8217;t want that in the system. Not now, not ever. All of these parameters can be changed easily and simply by changing a single configuration file and restarting the app.</p>
<div id="attachment_1371" class="wp-caption aligncenter" style="width: 310px"><a href="http://assets.talkunafraid.co.uk/2011/10/IRIS-SystemArchitecture.png" rel="lightbox[1367]"><img class="size-medium wp-image-1371" title="System Architecture" src="http://assets.talkunafraid.co.uk/2011/10/IRIS-SystemArchitecture-300x225.png" alt="" width="300" height="225" /></a><p class="wp-caption-text">Overview of the system architecture</p></div>
<p>Once an upload is flagged for review an administrator or music librarian can review the track and either permanently reject it or approve it (in case of false positive lyric matches or odd uploads where quality restrictions aren&#8217;t able to be met).</p>
<p>Everything in IRIS is done through a web interface- uploads, management and monitoring. Every track has its own log of events, allowing administrators to debug and diagnose problems with ease, and giving clear and simple feedback to users. There are convenience functions such as automatic display of R128 loudness graphs pre/post normalization and compression, and display of all metadata available for tracks, plus lyrics if they were found.</p>
<p>The backend to IRIS is all Ruby and Rails, using a simple database server (PostgreSQL recommended) to store everything. Background processing is distributable over multiple computers with shared storage, allowing for CPU-intensive tasks to be spread across multiple machines. Given the R128 metering process includes a fourfold upsampling, this is particularly useful. You can run workers without running the whole web application, allowing you to install copies of the app onto lots of low-cost general purpose machines and have your own distributed ingest processing cluster on even a tight bugdet.</p>
<p>Of course, now you&#8217;ve got a track with lots of metadata and some normalized audio in WAV format (archived to FLAC pre-normalization, just in case you need the original audio). Now you need to get it into your playout system. Rivendell is supported, Myriad 3.6 now has dropbox support so you can just tell IRIS to export files to that dropbox in a Myriad-supported format, and you can also just do an export in any format you choose to an arbitrary folder. Export formats supported include FLAC, MP3, WAV, BWF (Broadcast WAVE Format) and AAC. Most of these flavours come with embedded metadata.</p>
<p>IRIS isn&#8217;t a perfect system, and it&#8217;s not an instant drop-in system; I don&#8217;t have the time to maintain it as such. What it is, though, is a flexible and powerful system that any average Linux user can install and have running in hours, and which can be used by any station looking to improve their import process.  The entire project is open source, and can be obtained <a href="https://github.com/JamesHarrison/iris">here</a>- there&#8217;s also a bugtracker and wiki with some documentation (unfinished) on it. If you&#8217;d like to contribute, feel free- as I&#8217;ve stepped out from student radio to work on student television, I&#8217;ve not got a huge amount of time to work on it at the moment.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2011/10/iris-the-interchangeable-radio-ingest-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The surprising thing about BlackBerry outages</title>
		<link>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/</link>
		<comments>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 09:36:36 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Cryptography]]></category>
		<category><![CDATA[Odds and Ends]]></category>
		<category><![CDATA[bbm]]></category>
		<category><![CDATA[blackberry]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1364</guid>
		<description><![CDATA[The surprising thing, to me at the least, whenever there&#8217;s a huge story all over the technology pages of the BBC or the Guardian about the BlackBerry Messenger/email services being down for huge periods of time, is that people are surprised at this. The internet has flourished and works so very well because it is [...]]]></description>
			<content:encoded><![CDATA[<p>The surprising thing, to me at the least, whenever there&#8217;s a huge story all over the technology pages of the BBC or the Guardian about the BlackBerry Messenger/email services being down for huge periods of time, is that people are surprised at this.</p>
<p>The internet has flourished and works so very well because it is decentralized, based on open protocols, and systems working together to let people communicate. Let&#8217;s just compare standard email with the BlackBerry flavour for a moment.<span id="more-1364"></span>If you send an email, your email client talks to a server in a protocol called SMTP, the Simple Mail Transfer Protocol. The server will then take that SMTP request and talk to the server which the target domain has nominated to be the mail server for that domain. This is found out using the open Domain Name System, and supports simple load balancing and failover by virtue of multiple MX records and priorities. The servers then have a quick conversation via SMTP, and at some point (assuming all goes to plan), your email is now sat in someone&#8217;s mailbox. Their client will talk to the server regularly (using POP3 or IMAP) and will pick up on the new email that just came in, and tell them about it.</p>
<p>These protocols are all completely standard, simple and free to implement. They run over the standard TCP/IP network layer that makes up the internet, and there&#8217;s no centralization (except in the case of DNS, which is still massively decentralized compared to BBM). Everything supports full end-to-end encryption using the secure socket layer, a much more robust system for securing transmissions than BlackBerry&#8217;s unencrypted transports and symmetric keying.</p>
<p>Now let&#8217;s look at a typical BlackBerry user and compare them to an Android user (or iPhone, or Windows Phone, or even Palm Pilots &#8211; the protocols existed back then). Both are trying to send and receive emails. Most people rely on a free provider for their email, like Gmail or Hotmail, but many people run their own mail servers or use organizational servers (university mail servers, your office&#8217;s mail server, and so on). There&#8217;s a vast web of servers, all cooperating together to make things work. Enter the BlackBerry- you now remove those protocols from the device and instead put them into a mail server at the other end. This server talks SNMP to the rest of the world, and even speaks IMAP/POP to your mail server, behind the scenes- but access to your mails from your device requires the BlackBerry go-between to be functional. If that single point of failure fails, it&#8217;s all over- no connectivity for you.</p>
<p>So why would, on paper, you want to use the BlackBerry service? Well, BBM is an additional feature- group messaging based on&#8230; the BlackBerry equivalent of MAC addresses? Huh. Okay, so we&#8217;ve got a comms methodology from the 90s which basically implements a dumbed-down version of IRC or Jabber for your phone. Not like we&#8217;ve had that on other platforms for as long as I can remember &#8211; I was using Jabber on my phone years ago to talk to people on the other side of the world about EVE Online fleet operations instead of paying attention in maths lessons. Again, things like IRC and Jabber are open standard protocols, free to implement, and implemented in a manner that is widely distributed with little to no single points of failure. IRC was designed for the days of acoustic couplers, and is still used widely by millions of people. The BBM protocols aren&#8217;t even secure- they&#8217;re encrypted, but the key used for encryption is shared amongst all phones. It&#8217;s a bog standard symmetric crypto cipher. This means if you can convince a phone that it&#8217;s actually another phone, you can read any other person&#8217;s messages and send messages as them. Contrast with IRC &#8211; the network is still assumed secure (just as with BBM) and unencrypted, but links between servers and links between client and server can all use the secure socket layer- a very secure protocol.</p>
<p>On top of this we&#8217;ve got the push email feature &#8211; which lets you get emails sort-of-instantly (actually delayed unless your organization runs a BlackBerry Messaging Server, as I understand it) pushed to your phone instead of having to have the phone fetch new emails. Granted, that&#8217;s quite a nice feature. But it&#8217;s not a game-changer to the extent where I&#8217;d be willing to sacrifice my email connectivity on my device to have it- for days at a time, no less. In addition, there&#8217;s open tools that will do this for you on Android and iPhone. I remember back when I was first using a Windows Mobile PDA, I set up push email in the course of a few days of hacking around with an off-the-shelf product. This was back in the days of the XDA, when push email was just kicking off, and it wasn&#8217;t too hard to make work. These days it&#8217;s even easier.</p>
<p>BlackBerry falls over because they are a quite heavily centralized company and their infrastructure is not designed for the modern age. Note that the BlackBerry announcement about their recent outages referred to &#8220;Europe, the Middle East and Africa&#8221;. One problem can hit all those users. That&#8217;s not a well-designed system, it&#8217;s a legacy system shoehorned in from the days when BlackBerry Messaging Server was the sort of thing you ran if you were a big company. It is bound to fail and when it fails it pisses off many millions of users.</p>
<p>The question of why people decided to go with BlackBerry remains a mystery to me- I honestly don&#8217;t know why people would choose a BlackBerry over any other phone. I have spent countless hours debugging and fixing broken BlackBerries for friends, and many more hours still trying to explain how to use them to other friends (who inevitably end up sending them back and getting a &#8216;droid). Perhaps there&#8217;s some secret feature I&#8217;m not seeing, but it just seems to me that the hardware&#8217;s mediocre, the software&#8217;s awful at best, and the infrastructure is flaky. Why choose that over solid and varied hardware choice, great software with flexibility, and open infrastructure with no reliance on single providers or services? I know which one of these combinations has given me years of flawless service with not one day without email.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

