Visualising the Ruby Global VM Lock

I’m working on Ruby bindings for Ceph’s RADOS client library – it’s the first native C Ruby extension I’ve written so I’m learning lots of new things.

I’m keen to ensure my extension releases Ruby’s Global VM Lock (GVL) wherever it’s waiting on IO, so that other threads can do work and I’ve written a few simple test scripts to prove to myself it’s working correctly. The result is a textual visualisation of how releasing the GVL can improve the behaviour of threads in Ruby.

For example, I just added basic read and write support to my library so you can read and write objects stored in a Ceph RADOS cluster. My first pass was written without releasing the GVL – it just blocks waiting until Ceph has completed the read or write.

My test script starts three threads, one doing rados write operations in a loop and outputting a “w” to STDOUT when they succeed, one doing rados read operations and writing a “r” and one just doing some cpu work in Ruby and writing a “.”

This is the output from the script before I added GVL releases:

As you can see, it’s almost as if Ruby is switching round-robin style between the threads, waiting for each one to complete one iteration. In some cases, the cpu worker doesn’t get a look in for several read and write iterations!

So then I extracted the blocking parts out to a separate function and called them using Ruby 1.9’s rb_thread_blocking_region function, which releases the GVL, and then reran my test script:

As you can see, the thread doing CPU work in Ruby gets considerably more work done when the GVL is released. Those network-based IO operations can block for quite some time.

It’s exactly what is expected, but it’s neat to see it in action so clearly.

The code for the library is here on github, but but it’s under heavy development at the moment and is in no way complete – I’ve only pushed it out so early so I can write this blog. And this is commit showing just where I made the read/write operations release the gvl.

Beautiful command-line interface design talk

I spoke about writing beautiful command-line interfaces at Scottish Ruby Conference back in June and they’ve published the video, which is freely available for viewing now.

The slides are available here in pdf format (if you’re interested, they were made using emacs org mode and beamer.

There were loads of great talks recorded so check out the videos of them all here on the schedule.

Rate limiting with Apache and mod-security

Rate limiting by request in Apache isn’t easy, but I finally figured out a satisfactory way of doing it using the mod-security Apache module. We’re using it at Brightbox to prevent buggy scripts rinsing our metadata service. In particular, we needed th e ability to allow a high burst of initial requests, as that’s our normal usage pattern. So here’s how to do it.

Install mod-security (on Debian/Ubuntu, just install the libapache2-modsecurity package) and configure it in your virtual host definition like this:

SecRuleEngine On

<LocationMatch "^/somepath">
  SecAction initcol:ip=%{REMOTE_ADDR},pass,nolog
  SecAction "phase:5,deprecatevar:ip.somepathcounter=1/1,pass,nolog"
  SecRule IP:SOMEPATHCOUNTER "@gt 60" "phase:2,pause:300,deny,status:509,setenv:RATELIMITED,skip:1,nolog"
  SecAction "phase:2,pass,setvar:ip.somepathcounter=+1,nolog"
  Header always set Retry-After "10" env=RATELIMITED
</LocationMatch>

ErrorDocument 509 "Rate Limit Exceeded"

Continue reading Rate limiting with Apache and mod-security

Full text indexing of syslog messages with Riak

I’ve just released a little tool I wrote called riak-syslog which takes your syslog messages and puts them into a Riak cluster and then lets you search them using Riak’s full text search.

Rather than re-implement the wheel, riak-syslog expects that a syslog daemon will handle receiving syslog messages and will be able to provide them in a specific format – there is documentation on getting this running with rsyslog on Ubuntu.

I’ve used it to gather and store a few hundred gig of syslogs over the last several months on an small internal Riak cluster on Brightbox Cloud and it’s working well (which can’t be said of a similar setup I did with Solr which caved in after a while and needed some fine tuning!)

There is documentation on getting it set up in the README, and some examples of how to conduct searches too.

If you want to play with Riak, you can build a four node cluster spanning two data-centres in five minutes on Brightbox Cloud.

You might also be interested in my post about indexing syslog messages with Solr.

Documentation that tells a story

When reading technical documentation I too often come across examples like this:

let’s assume you have a client called foo and a server called bar

or command examples like:

mysqldump -h server1 | mysql -h server2

When I write documentation, I prefer to tell a story. What is the client called, Steven? Are we taking a mysqldump of a production server and writing it to a staging server?

Human brains like stories. It’s much easier to keep track of facts if they have some kind of meaning. Many memory improvement techniques use stories to link things together. And when you’re reading documentation, you’re usually learning some new concept anyway – so you’re adding unnecessary cognitive load by using abstract labels like Foo and Bar or A and B.

In my Git submodules post I name the two example projects your_project and other_project and use it consistently throughout. You never have to rememeber whether “Foo” is the remote project or not.

One of my own favourites is an old heartbeat cluster guide I wrote. It involves two clusters, each of which consisted of two servers working together. I named the first cluster JuliusCaesar, naming the two nodes Julius and Caesar. The second cluster is called MarcusAurelius. Throughout the documentation, I’m able to refer to any server just by it’s name and you can know where it is in the network.

It’s part of why I like using rspec to do testing, because it encourages you to tell a story rather than to just test arbitrary values.

So, put some thought into your examples. Tell a story. Make it easier for the reader to keep track of all this new stuff they’re learning.

Inside Google Plus

Steven Levy interviewed Google’s Bradley Horowitz about Google+:

Wired: Some users are chafing at Google’s insistence that they provide real names. Explain the policy against pseudonyms.

Horowitz: Google believes in three modes of usage—anonymous, pseudonymous, and identified, and we have a spectrum of products that use all three. For anonymity, you can go into incognito mode in Chrome and the information associated with using the browser is not retained. Gmail and Blogger are pseudonymous—you can go be captainblackjack@gmail.com. But with products like Google Checkout, you’re doing a financial transaction and you have to use your real name.For now, Google+ falls into that last category. There are great debates going on about this—I saw one comment yesterday that claimed that pseudonyms protect the experience of women in the system. I felt compelled to respond, because I’ve gotten feedback from women who say that the accountability of real names makes them feel much more comfortable in Google+.

Notice that Horowitz did not answer the question, and what he did say was just ridiculous nonsense. Steven Levy at Wired didn’t seem to notice, or care.

Horowitz tries to make us think that we need our real name when making a financial transaction.  Thousands of years of currency proves that is not the case.

Horowitz then goes on to blurrily equate making a financial transaction with sharing videos of cats on Google+.

And then the cherry on the top: Google+ protects women.

This was the closest there was to a serious question in the whole interview and Horowitz just laughed out of his arse at it.

Continue reading Inside Google Plus

Redirecting outgoing mail with Postfix

We have a various staging deployments of our systems at Brightbox and need to test that the emails they send are correct. We have a bunch of test accounts registered with various email addresses and we wanted them all to go to our dev team, rather than the original recipient.

Rather than write support for this into our apps, we used Postfix to redirect the mail to our devs.

In our case, our staging deployments use a local installation of Postfix and the systems are generally not used by anything else, which makes this dead easy.

Firstly, write a rewrite map file, with the following one line of content. Call it /etc/postfix/recipient_canonical_map:

/./ devteam@example.com 

Then configure Postfix like this (in /etc/postfix/main.cf):

 recipient_canonical_classes = envelope_recipient recipient_canonical_maps = regexp:/etc/postfix/recipient_canonical_map 

Now all mail going through this relay will be redirected to devteam@example.com. It rewrites only the envelope, so the important headers are not changed.

Puppet dependencies and run stages

I’m using Puppet to manage some apt repositories on Ubuntu and have had a dependency problem. I want to write the source configs before running apt-get update and I want to run that before installing any packages.  Otherwise, a manifests that tries to install a package from a custom repository will fail, either because the repository is not configured or the apt metadata hasn’t been retrieved yet.

Due to Puppet changes being idempotent, this is usually solvable by running puppet a few times (ew). Or you can do this properly by diligently setting all the dependencies for all of your packages on your apt-get update command, and having that depend on your source configs, but that’s pretty fiddly.

Continue reading Puppet dependencies and run stages

Indexing syslog messages with solr

I’ve been thinking about centralized indexing and searching of logs for a while and the other day I came across a project called Graylog2 that does just that. It provides a service to receive messages over the network (in a couple of formats, including syslog) and writes them into mongodb. It then has a rails application that lets you browse and search the logs.

It’s neat but I wasn’t quite happy with the search options – I’ve always thought logs should be indexed with a real full text indexer. So, I knocked up a couple of scripts to do just that, as a proof of concept.

It uses rsyslog to receive the messages and write them to a named pipe.  A small ruby script called rsyslog-solr reads from the other end of the pipe and writes batches of the incoming messages to the full text indexer. I chose solr as the full text indexer as it has some very good options for scaling up, which will be necessary when indexing lots of logs.

Solr indexes, compresses and stores the messages sent to it, so we can retrieve the full text without having to store the original log. I wrote a custom schema definition optimized for this.

Then another script, rsyslog-solr-search, is used to query Solr and display the matching messages.

Querying is fun, for example I’ve searched all ssh authentication failures across all hosts and then searched on the originating IPs to see what other probes they made.

You don’t have to do advanced searches though, you can just display all logs from the last hour, or day or whatever.

One important note, any user that can generate logs that are sent to the system can cause a denial of service attack by sending specially malformed messages. This can be fixed by moving the formatting of the log entries from rsyslog into the ruby script, but I’ve not done it yet.

I’ve pushed the code to github under the MIT license. Feel free to improve it.

The cost of free

Helienne Lindvall writes in the Guardian:

Cory Doctorow [will] cost you $25,000 (£15,800) to get him to speak at your conference…

But what does Doctorow speak about? Well, ironically, he’s a proponent of giving away content for free as a business model – and for years he’s been telling the music industry to adapt to it. Am I the only one to see the irony in this?

I don’t see the irony. This is exactly what Doctorow recommends. Give your content away and charge to perform it. Give your music away and charge for your gigs. I bet the content of his slides is creative commons, and I bet the recordings of this talks are creative commons even. But if watching a video of him isn’t enough for you and you want him in person, then you pay for it.

It seems that Helienne Lindvall does not understand even the basic ideas of free culture.

UPDATE: Helienne Lindvall seems to have been misinformed anyway, as per this tweet from Doctorow himself:

@helienne, I’m afraid you were badly misinformed. I don’t have a “booker”, I don’t charge anything like the sum quoted, most talks are free

UPDATE: Doctorow has since written an interesting article rebutting Lindvall.

ipq.co: create dns records instantly

ipq.co is a new service I put together to lower the barrier for dns management. It’s the tinyurl of the dns world – provide an IP address and you get a random dns record for it (or you can choose your own, if it’s available).  Looking at other dns management systems, I was surprised this hadn’t been done before (and by how awful most of the dns interfaces are out there!)

I wrote it in Ruby using the Rails 3 framework, with the dns records being served by the PowerDNS MySQL back end (though I’ll likely be switching it to use a custom back end using my powerdns_pipe library for more flexibility).

We’re building a big new cloud system over at Brightbox and we’ve been thinking how to provide convenient dns records for our customers.  We already have some basic integration but the resulting records are quite a mouthful. ipq.co is just a bit of an experiment to explore other ways of solving the problem.  There has already been some discussion over on Hacker News about possible applications (and implications) of the service – I’m interesting in how people will use it.

I’ve got some plans for other features which I’ll be adding over the next few weeks, and then I’ll be selling it to Google for low 7 figures, so watch this space.

Wildcard IP lookups

You can now do wildcard IP lookups, as provided by the xip.io service, useful for development environments:

$ host whatever.10.0.0.5.ip.ipq.co

whatever.10.0.0.5.ip.ipq.co is an alias for 2rvxtx.ip.ipq.co.

2rvxtx.ip.ipq.co has address 10.0.0.5

New additional domain name

A donor has transferred ownership of a new domain for use with ipq.co.  So now, rather childishly, you can create instant dns records as subdomains of mypen.is:

$ host localhost.mypen.is
localhost.mypen.is has address 127.0.0.1