Sustainable low-budget infrastructure

This month I finish my university career and along with this move I sadly will stop working at Insanity Radio, the student (now community) radio station I’ve been running tech at for about 3 years now. Needless to say, I’m going to miss the place and the people, and the challenges that came with that environment.

Specifically: No budget (a total of £3,000 income annually, compared to the average income of £75,000 for most Community Radio stations according to Ofcom). No paid full-time staff. And a desire for 100% availability regardless.

Over the years the systems at Insanity have evolved and grown – they started out as a single computer for playout, a single encoder and streaming server running Windows Media Encoder with about 50% availability best-case, and we’re now deploying high availability clusters for streaming and encoding, have very few single points of failure, with a total of 21 computers. Back in 2009 we had significant amounts of dead air – outside of a processor failure we’ve had very few incidents since 2011.

Building systems for reliability on no money is a tricky thing to do, and it’s even harder when the people maintaining the infrastructure change on a potentially annual basis. This post is basically a quick encapsulation of some of the most important things to focus on to make such a situation work – not just from a technological perspective but from a human perspective too.

Power to the People

In a volunteering situation, recruitment is important. We’re lucky in that we have two departments on campus whose students are all required to have a basic understanding of Linux, the computer science and physics departments. However, in our last elections there was no significant attempt at outreach made and nobody ran for the position of head of technology – our chief engineer role. I went out and talked to some people who had expressed interest to me personally and hopefully they’ll stick with the station for next year. It’s crucial to understand the mindset of the people you’re trying to encourage to get involved.

One of the main political dramas this year involved the proposed takeover of our engineering by our student’s union’s commercial services department. Myself and others involved have opposed this move because it would limit dramatically the creative and experimental freedom of people running the tech at the station, and that is harmful for the station and for recruitment. The mere suggestion of it has driven people away from even considering running the place. Not to mention that down this route lies significantly higher costs. The people who do well at technological management are the people who can do some experiments, take some risks, and do so with the station’s requirements in mind. For instance, we tested a new logging system for months alongside our existing system before replacing the old system, and all my little experiments that make it into production get a lot of testing beforehand.

But the crucial thing is that I have freedom in my role. I’m accountable to the production board and ultimately the board of directors if I screw up, and if I do screw up, I get to be responsible and explain myself. I still get to justify what I want to do to the board and managers, but that’s fine – I’m the decision-maker, and get to do the implementation of what I come up with in terms of ideas. People are not interested if all they do is just implement things or just come up with ideas/plans.

Strength in Numbers

So that’s the human aspect. Computers are more fickle in some ways. Out of the 21 machines we run at Insanity, 2 are provided by the university (our production workstations, running Windows), 1 was bought by Insanity (a specialist system with an £800 sound card with 4 channels for playout) and 18 were donated or (in the case of most of the computers) rescued from skips. They’re of questionable quality, age and capability. They are however free.

The trick is in structuring and planning your systems so if a computer dies, it’s not the end of the world. This is done by never giving any task to just one computer. We’ve got a few single points of failure (SPOFs) in our network and streaming infrastructure but much of what we do is to mitigate this; for instance, I’m currently finishing off a redundant streaming cluster using IP failover on our two icecast servers and fallbacks within icecast to support two sets of streaming encoders. We can thus tolerate one encoder and one streaming server failing before we have to panic. And if we have spare hardware or computers around (which we do) and plan things well, provisioning a replacement in the event of failure is pretty trivial.

Standard IT management practices like backups (off and on-site) always apply, but for things like output logging where loss of even a minute of content can be disastrous, multiple loggers on and off-site where possible to provide full redundancy 24/7 without interruption are more or less the only way to do backups. We log twice in the studio and a third time at our transmission site, for instance.

Keeping Track

We use Nagios for monitoring systems, processes, availability of services and so on. Nagios periodically pings all of our computers and ensures we’ve got good service on them all. We also use Smokeping a lot to ensure latency is within acceptable bounds on network links across campus and to external services – this helps us a lot in identifying failure points (“Oh, well, we have 80% packet loss to the outside world. Probably it’s the JANET connection and not our servers”). We have one machine sat in our office running CoffeeSaint to let everyone know the second anything goes wrong, so that it can be addressed quickly. Nagios also emails us geeks, too.

As well as keeping track of the current health and state of machines we have all the tools we need to quickly provision things to hand. Using open source software stacks means that we don’t have to worry about licensing (we’ve had a few instances of our playout system, a closed source system called Myriad, failing outside office hours and requiring us to play from VLC/iPods for half a day or a weekend till we get the licensing team at PSquared back in the office), and setting up spares is easy. There are a couple of machines sat around at the studio, unconnected but ready, with blank copies of Ubuntu 12.04 Server loaded onto them. If anything goes wrong we can just drop it in, install the specific software we need, and configure it.

I’m also looking at implementing change management software to keep track of changes made. This is probably a very good idea; as it stands, every time I go in and do something, I send an email to the managers and other tech team members, plus anyone else potentially affected, explaining what I’ve done and why. Emails aren’t the best form of communication for this, though. The holy grail would be something that integrates nicely with Nagios, but I’ve found nothing of this nature yet.

Configuration files are another tricky bit. This could be solved by using puppet/chef and provisioning systems intelligently, but we don’t have a spare machine for puppet/chef to run on and the infrastructure for puppet/chef is quite involved. What we do have is a set of git repositories which store our main configuration files for liquidsoap, icecast2, and so on. This makes configuring a machine as simple as cloning the repository, copying the files into place and restarting services. Job done, and much simpler than puppet/chef in terms of keeping things maintainable – no need to go learn another DSL or figure out how to make chef/puppet work, no bootstrapping, etc.

Analysis and Forecasting

The other thing Nagios gives us is the ability to look at things like availability and uptimes over long periods of time and to identify trends. This helps guide choices on infrastructure and ultimately where to spend your limited budget and time. If your playout system is up 99.98% of the time then you’re not actually better off buying a spare playout system compared to fixing the 98% uptime on the stream encoders, for instance.

Figuring out what’s likely to break entirely next is also helped by Nagios – stuff that has minor hiccups is fairly likely to be on its way out, so plans can be made if you start seeing machines dropping packets out of the blue or randomly crashing programs.

Action vs Reaction, Disaster Recovery Planning

The most important part of managing low-budget infrastructure is to try to detatch action from reaction. Say something breaks. Clearly you need to fix it, and if it’s a SPOF, you need to fix it right now. But approaching everything this way leads to poor planning, poorly implemented systems, and inevitably bad cable management as you try and bodge things together in ways you never planned to.

First up, managing the unexpected is as important as managing the expected. Assign some of your budget to buying stuff you might not use – but would let you do bodges more neatly. Long network cables, long power cables and spare hard disks, cable ties, audio interfaces, long audio cables and plenty of patch cables and bodge adapters (XLR to phono etc). Don’t get crap stuff – get stuff you’re happy to leave in production perhaps indefinitely if it comes to it. Being able to neatly bodge things together at no notice will save you some headaches, and will turn unmaintainable bodges into bodges that are at least able to be left in place if needed and maintained.

Planning for failure is another part of it. Go through every bit of your infrastructure and figure out what could break. For each bit that can break (hint: everything), figure out what you need to fix that, short and long term. What are the acceptable workarounds? What do you need to implement them? Are there common workarounds for multiple problems that mean you only need to buy one thing to acceptably work around problems X, Y and Z? This sort of thinking can help you hugely when things do go wrong, particularly if you then go and get what you need to fix most common problems without leaving the room to go buy parts. And crucially for anything you need to be in compliance with your license, ensure you can reasonably stay compliant in the event of a failure.

 

I hope there’s some good advice hidden in that wall of text, and I really hope people can put any advice they find to good use. I’m going to miss student radio a heck of a lot, and I hope that everyone involved with Insanity Radio next year has as much fun as I did running the place.