<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Talk Unafraid &#187; Odds and Ends</title>
	<atom:link href="http://www.talkunafraid.co.uk/category/odds-and-ends/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.talkunafraid.co.uk</link>
	<description>The (occasionally coherent) ramblings of a geek</description>
	<lastBuildDate>Thu, 09 Feb 2012 02:27:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Serializing and scoping Mongoid criteria</title>
		<link>http://www.talkunafraid.co.uk/2012/02/serializing-and-scoping-mongoid-criteria/</link>
		<comments>http://www.talkunafraid.co.uk/2012/02/serializing-and-scoping-mongoid-criteria/#comments</comments>
		<pubDate>Thu, 09 Feb 2012 02:27:36 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Odds and Ends]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[ajax]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[mongoid]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[redis]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1381</guid>
		<description><![CDATA[So, while working on a project, I ran into a snag. I&#8217;ve got a partial which renders a list of images, and I want that partial to be auto-updated. That partial is used in multiple controllers and actions and is passed lots of different arbitrary data sets. How do you manage auto-updating that partial, and [...]]]></description>
			<content:encoded><![CDATA[<p>So, while working on a project, I ran into a snag. I&#8217;ve got a partial which renders a list of images, and I want that partial to be auto-updated. That partial is used in multiple controllers and actions and is passed lots of different arbitrary data sets. How do you manage auto-updating that partial, and additionally, gain the ability to access that list&#8217;s criteria in other areas? I also have a &#8216;grid view&#8217; of the images &#8211; I want to be able to pass my image list along to that and have it render it, but I still want to paginate (and on the grid view, pagination is done by different amounts).</p>
<p>Mongoid lets us build (much as ARel lets us do in ActiveRecord) criteria up, which consist of a few things, but mostly a selector (&#8220;Which records?&#8221;) and options (&#8220;How do you want them?&#8221;). So the answer is actually pretty straightforward &#8211; we serialize these objects and then use them to build our base criteria, on which we can then do pagination. Neat, right?</p>
<p>So, how do we do this? Easy, actually.<span id="more-1381"></span></p>
<p>Essentially we need a method for setting scope, and one for loading it. So let&#8217;s define those methods and set it up.</p>
<p><script src="https://gist.github.com/1776559.js"> </script></p>
<p>Okay &#8211; now we&#8217;ve got a scope key stored in our @list_scope instance variable. This refers to an entry in our cache store (in our case, Redis) which holds info on how to reconstruct that criteria. We&#8217;re tagging along a few special parameters, too, not just the essentials.</p>
<p>Now that we&#8217;ve got this we can refer to that scope where we need to &#8211; for instance, in links to other controllers, or in data- attributes to specify regions to automatically reload on the page.</p>
<p><script src="https://gist.github.com/1776579.js"> </script></p>
<p>Pulling out a criteria (and any other data we might want) is trivial at this stage. This technique can let you throw arbitrary-complexity queries around in userspace, allowing for clever client-side stuff and a smoother workflow for many applications and users. Give it a shot and have a play with what you can do!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2012/02/serializing-and-scoping-mongoid-criteria/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Perceptual image and audio deduplication</title>
		<link>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/</link>
		<comments>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/#comments</comments>
		<pubDate>Sat, 07 Jan 2012 18:03:11 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Code Snippets and Examples]]></category>
		<category><![CDATA[Odds and Ends]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[dragonfly]]></category>
		<category><![CDATA[image processing]]></category>
		<category><![CDATA[opencv]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[rails]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1375</guid>
		<description><![CDATA[Okay, two months without a post, won&#8217;t happen again&#8230; So, lately I&#8217;ve been moving out from the broadcast area and getting back into webapp development, but some of the things I&#8217;ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let&#8217;s have a look at [...]]]></description>
			<content:encoded><![CDATA[<p>Okay, two months without a post, won&#8217;t happen again&#8230;</p>
<p>So, lately I&#8217;ve been moving out from the broadcast area and getting back into webapp development, but some of the things I&#8217;ve been working on touch quite heavily on deduplication, of images and music. This is quite an interesting topic, so let&#8217;s have a look at what we can do now.</p>
<p>Doing exact deduplication &#8211; stopping someone uploading the same file twice to a website &#8211; is pretty easy. You just hash the uploaded file (or de-encapsulate the data and hash that if you want to be a little more resilient) with something like SHA256 or SHA512. It&#8217;s fast, effective and easy. Lookups are as fast as your RDBMS is. This works with images, audio, videos, you name it.</p>
<p>What&#8217;s much harder is doing perceptual deduplication, or content deduplication. If I upload two files which are the same except one&#8217;s a PNG and one&#8217;s a JPEG, I want to be able to say &#8220;Hang on, you&#8217;ve already uploaded that!&#8221; when you upload the second file. Similarly, what if we resize an image? We want something resistant to that sort of attack.<span id="more-1375"></span></p>
<p>There are already perceptual hashing libraries like pHash out there which work by generating hashes which look very similar to your average md5sum, but are actually generated perceptually &#8211; the Hamming distance between two hashes is a measure of the similarity of the images represented by those two hashes. This is a great thing, but it&#8217;s pretty useless on large datasets without some specialist software to manage databases of hashes and querying based on Hamming distance. The pHash guys will of course <em>sell </em>you this solution, but there&#8217;s the problem &#8211; there&#8217;s quite a bit of money to be made with this sort of product, and useful open source implementations seem to be quite rare if not non-existent.</p>
<p>pHash is also a bit of an odd thing in that it works on multiple media types &#8211; audio, video and images. More specifically for audio are techniques for audio fingerprinting, like <a href="http://acoustid.org/">AcoustID</a> which aim to generate a specific fingerprint for each recording of a song. Distance between songs isn&#8217;t something we&#8217;re very interested in, because audio releases are typically few in number, and rarely are we looking for a song which sort of sounds like X &#8211; if it sounds different, it&#8217;s a different recording of the song, or has been mastered.</p>
<p>Images are very different because people often make small changes, or change the formatting or file type of an image, and these circulate and get thrown around all over the place. We want to be able to accept things that are legitimately changed, but flag up things that are almost perfectly identical to something we already have in a database.</p>
<p>So, what can we do? Well, we can use simpler techniques to reduce the number of images to compare down to a small number &#8211; say, 50 &#8211; and then compare the Hamming distance on full pHashes using that technique. But do we actually need this? Certainly for some databases, merely using those simpler techniques may well be sufficient.</p>
<p>I&#8217;ve been building an image board, lately, which I&#8217;m using Dragonfly for, an awesome on-the-fly image serving system. We&#8217;re storing everything in MongoDB, with GridFS for file storage. Here&#8217;s what we&#8217;re doing.</p>
<p>First, we need some analysis of the image contents. I&#8217;m using a roughly perceptually weighted average intensity metric. Here&#8217;s a tiny little Python script using Python-OpenCV&#8217;s SWIG bindings to perform that analysis rapidly.</p>
<p><script type="text/javascript" src="https://gist.github.com/1575463.js?file=simple_analyser.py"></script>Now we need to get Dragonfly in on this, and register a function we&#8217;ll use to handle GIFs. This goes in our Dragonfly initializer.<script type="text/javascript" src="https://gist.github.com/1575463.js?file=dragonfly.rb"></script></p>
<p>And finally we need to store all this and actually do the find-duplicates step. Note the aspect ratio stuff &#8211; with the ImageMagick analyser, Dragonfly will handle storing this for you, which makes life easier.</p>
<p><script src="https://gist.github.com/1575463.js?file=image.rb"></script></p>
<p>In our find_duplicates function we just look for images that have both a similar intensity and a similar aspect ratio. If both of these things are very close we&#8217;ve got ourselves a potential duplicate. We&#8217;re not doing <em>exact</em> matching on intensity/aspect ratio, more fuzzy matching, because compression changes and resizes can often affect both slightly.</p>
<p>Of course, the proof is in the pudding- this site only has a hundred images in it now, and we may need to adjust the distance variable, but so far it&#8217;s working very well and isn&#8217;t bringing up any false positives while correctly identifying duplicates. The way we&#8217;re doing this from a workflow perspective is important, too &#8211; when we do find duplicates, we ask the uploader of the image to check and make sure they&#8217;re not uploading anything we already have by presenting the duplicate images to them. They can then mark the duplicate (which deletes the image they just uploaded and leaves a redirect to the image their upload duplicates) or confirm it&#8217;s not a duplicate. The statistics from those actions will be interesting to see over time as a metric of success for this approach!</p>
<p>So, that&#8217;s how I&#8217;ve done image dedupe in a Rails app &#8211; what approach are you using?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2012/01/perceptual-image-and-audio-deduplication/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The surprising thing about BlackBerry outages</title>
		<link>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/</link>
		<comments>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/#comments</comments>
		<pubDate>Wed, 12 Oct 2011 09:36:36 +0000</pubDate>
		<dc:creator>James Harrison</dc:creator>
				<category><![CDATA[Cryptography]]></category>
		<category><![CDATA[Odds and Ends]]></category>
		<category><![CDATA[bbm]]></category>
		<category><![CDATA[blackberry]]></category>
		<category><![CDATA[email]]></category>
		<category><![CDATA[rant]]></category>

		<guid isPermaLink="false">http://www.talkunafraid.co.uk/?p=1364</guid>
		<description><![CDATA[The surprising thing, to me at the least, whenever there&#8217;s a huge story all over the technology pages of the BBC or the Guardian about the BlackBerry Messenger/email services being down for huge periods of time, is that people are surprised at this. The internet has flourished and works so very well because it is [...]]]></description>
			<content:encoded><![CDATA[<p>The surprising thing, to me at the least, whenever there&#8217;s a huge story all over the technology pages of the BBC or the Guardian about the BlackBerry Messenger/email services being down for huge periods of time, is that people are surprised at this.</p>
<p>The internet has flourished and works so very well because it is decentralized, based on open protocols, and systems working together to let people communicate. Let&#8217;s just compare standard email with the BlackBerry flavour for a moment.<span id="more-1364"></span>If you send an email, your email client talks to a server in a protocol called SMTP, the Simple Mail Transfer Protocol. The server will then take that SMTP request and talk to the server which the target domain has nominated to be the mail server for that domain. This is found out using the open Domain Name System, and supports simple load balancing and failover by virtue of multiple MX records and priorities. The servers then have a quick conversation via SMTP, and at some point (assuming all goes to plan), your email is now sat in someone&#8217;s mailbox. Their client will talk to the server regularly (using POP3 or IMAP) and will pick up on the new email that just came in, and tell them about it.</p>
<p>These protocols are all completely standard, simple and free to implement. They run over the standard TCP/IP network layer that makes up the internet, and there&#8217;s no centralization (except in the case of DNS, which is still massively decentralized compared to BBM). Everything supports full end-to-end encryption using the secure socket layer, a much more robust system for securing transmissions than BlackBerry&#8217;s unencrypted transports and symmetric keying.</p>
<p>Now let&#8217;s look at a typical BlackBerry user and compare them to an Android user (or iPhone, or Windows Phone, or even Palm Pilots &#8211; the protocols existed back then). Both are trying to send and receive emails. Most people rely on a free provider for their email, like Gmail or Hotmail, but many people run their own mail servers or use organizational servers (university mail servers, your office&#8217;s mail server, and so on). There&#8217;s a vast web of servers, all cooperating together to make things work. Enter the BlackBerry- you now remove those protocols from the device and instead put them into a mail server at the other end. This server talks SNMP to the rest of the world, and even speaks IMAP/POP to your mail server, behind the scenes- but access to your mails from your device requires the BlackBerry go-between to be functional. If that single point of failure fails, it&#8217;s all over- no connectivity for you.</p>
<p>So why would, on paper, you want to use the BlackBerry service? Well, BBM is an additional feature- group messaging based on&#8230; the BlackBerry equivalent of MAC addresses? Huh. Okay, so we&#8217;ve got a comms methodology from the 90s which basically implements a dumbed-down version of IRC or Jabber for your phone. Not like we&#8217;ve had that on other platforms for as long as I can remember &#8211; I was using Jabber on my phone years ago to talk to people on the other side of the world about EVE Online fleet operations instead of paying attention in maths lessons. Again, things like IRC and Jabber are open standard protocols, free to implement, and implemented in a manner that is widely distributed with little to no single points of failure. IRC was designed for the days of acoustic couplers, and is still used widely by millions of people. The BBM protocols aren&#8217;t even secure- they&#8217;re encrypted, but the key used for encryption is shared amongst all phones. It&#8217;s a bog standard symmetric crypto cipher. This means if you can convince a phone that it&#8217;s actually another phone, you can read any other person&#8217;s messages and send messages as them. Contrast with IRC &#8211; the network is still assumed secure (just as with BBM) and unencrypted, but links between servers and links between client and server can all use the secure socket layer- a very secure protocol.</p>
<p>On top of this we&#8217;ve got the push email feature &#8211; which lets you get emails sort-of-instantly (actually delayed unless your organization runs a BlackBerry Messaging Server, as I understand it) pushed to your phone instead of having to have the phone fetch new emails. Granted, that&#8217;s quite a nice feature. But it&#8217;s not a game-changer to the extent where I&#8217;d be willing to sacrifice my email connectivity on my device to have it- for days at a time, no less. In addition, there&#8217;s open tools that will do this for you on Android and iPhone. I remember back when I was first using a Windows Mobile PDA, I set up push email in the course of a few days of hacking around with an off-the-shelf product. This was back in the days of the XDA, when push email was just kicking off, and it wasn&#8217;t too hard to make work. These days it&#8217;s even easier.</p>
<p>BlackBerry falls over because they are a quite heavily centralized company and their infrastructure is not designed for the modern age. Note that the BlackBerry announcement about their recent outages referred to &#8220;Europe, the Middle East and Africa&#8221;. One problem can hit all those users. That&#8217;s not a well-designed system, it&#8217;s a legacy system shoehorned in from the days when BlackBerry Messaging Server was the sort of thing you ran if you were a big company. It is bound to fail and when it fails it pisses off many millions of users.</p>
<p>The question of why people decided to go with BlackBerry remains a mystery to me- I honestly don&#8217;t know why people would choose a BlackBerry over any other phone. I have spent countless hours debugging and fixing broken BlackBerries for friends, and many more hours still trying to explain how to use them to other friends (who inevitably end up sending them back and getting a &#8216;droid). Perhaps there&#8217;s some secret feature I&#8217;m not seeing, but it just seems to me that the hardware&#8217;s mediocre, the software&#8217;s awful at best, and the infrastructure is flaky. Why choose that over solid and varied hardware choice, great software with flexibility, and open infrastructure with no reliance on single providers or services? I know which one of these combinations has given me years of flawless service with not one day without email.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.talkunafraid.co.uk/2011/10/the-surprising-thing-about-blackberry-outages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

