Tag: indexing

Indexing syslog messages with solr

I’ve been thinking about centralized indexing and searching of logs for a while and the other day I came across a project called Graylog2 that does just that. It provides a service to receive messages over the network (in a couple of formats, including syslog) and writes them into mongodb. It then has a rails application that lets you browse and search the logs.

It’s neat but I wasn’t quite happy with the search options – I’ve always thought logs should be indexed with a real full text indexer. So, I knocked up a couple of scripts to do just that, as a proof of concept.

It uses rsyslog to receive the messages and write them to a named pipe.  A small ruby script called rsyslog-solr reads from the other end of the pipe and writes batches of the incoming messages to the full text indexer. I chose solr as the full text indexer as it has some very good options for scaling up, which will be necessary when indexing lots of logs.

Solr indexes, compresses and stores the messages sent to it, so we can retrieve the full text without having to store the original log. I wrote a custom schema definition optimized for this.

Then another script, rsyslog-solr-search, is used to query Solr and display the matching messages.

Querying is fun, for example I’ve searched all ssh authentication failures across all hosts and then searched on the originating IPs to see what other probes they made.

You don’t have to do advanced searches though, you can just display all logs from the last hour, or day or whatever.

One important note, any user that can generate logs that are sent to the system can cause a denial of service attack by sending specially malformed messages. This can be fixed by moving the formatting of the log entries from rsyslog into the ruby script, but I’ve not done it yet.

I’ve pushed the code to github under the MIT license. Feel free to improve it.

Xapian Fu: Full Text Indexing in Ruby

Xapian is an Open Source Search Engine Library written in C++. It has Ruby bindings, but they’re generated with SWIG, so they basically just mirror the C++ bindings – not very Ruby-like (and pretty ugly).

Being a self-confessed full text indexing nerd and a Ruby-lover, I wrote Xapian Fu: a library to provide access to Xapian that is more in line with “The Ruby Way”.

I started writing Xapian Fu exactly a year ago today but left it for a couple of months, then restarted work on it on the train on the way back from the 2009 Scotland on Rails conference.  Development was test driven, so it’s got an extensive test suite (using rspec).  Documentation is in rdoc and is quite detailed.  As of the latest version, it supports Ruby 1.9 too.

Xapian Fu basically gives you a Hash interface to Xapian – so you get a persistent Hash with full text indexing built in (and ACID transactions!).

Example

For example, create a database called example.db, put three documents into it and search them and print the results:

  require 'xapian-fu'
  include XapianFu
  db = XapianDb.new(:dir => 'example.db', :create => true,
                    :store => [:title, :year])
  db << { :title => 'Brokeback Mountain', :year => 2005 }
  db << { :title => 'Cold Mountain', :year => 2004 }
  db << { :title => 'Yes Man', :year => 2008 }
  db.flush
  db.search("mountain").each do |match|
    puts match.values[:title]
  end

There are of course a whole bunch more examples in the documentation.
(more…)

My NWRUG Ferret Talk

I did a short talk on Ferret, the Ruby “Information Retreival Library”, at the North West Ruby Users Group last Thursday.  We had a bit of a theme too, with Will Jessop speaking about Sphinx and Asa Calow speaking about Solr.

I got to have a bit of a nosey around the Manchester BBC building too – though I was worried I’d open the wrong door and end up on TV. Didn’t fancy having to apologise to Jeremy Paxman.

Brightbox also sponsored some pizza, and gave away t-shirts and stickers like candy (there was no candy though).

My slides are available here, and contain a little example file system indexer. I made my slides with webby and S6 if you’re interested.

News Sniffer, Ferret and Rails

I’ve been working on my News Sniffer project for the last few days, finishing up a two month experiment with using the Ruby Lucene implementation, Ferret, to index news articles and comments.  More info on the News Sniffer blog.  The project spanned two months due to some instability in the newer versions of Ferret, but the author responded to the bug reports and managed to fix all the problems so I decided to deploy.

Ferret offers huge improvements over the original MySQL full-text search method, and I’m looking forward to adding some fancy keyword statistics graphs in the future – perhaps showing censorship patterns in bbc comments with certain keywords.

Because News Sniffer is distributed across a number of servers, I used DRb (distributed Ruby) to allow them all to update one central Ferret index.  DRb seems to work very well generally, and is amazingly simple to use, but I ran into a few problems with recycled objects and invalid references whilst using Ferret across it, apparently due to the garbage collector on the service side collecting things still in use on the client side.  I think I eliminated most of them but they still crop up once in a while – I’ll be looking into this further.

(more…)