Xapian Fu: Full Text Indexing in Ruby

Xapian is an Open Source Search Engine Library written in C++. It has Ruby bindings, but they’re generated with SWIG, so they basically just mirror the C++ bindings – not very Ruby-like (and pretty ugly).

Being a self-confessed full text indexing nerd and a Ruby-lover, I wrote Xapian Fu: a library to provide access to Xapian that is more in line with “The Ruby Way”.

I started writing Xapian Fu exactly a year ago today but left it for a couple of months, then restarted work on it on the train on the way back from the 2009 Scotland on Rails conference.  Development was test driven, so it’s got an extensive test suite (using rspec).  Documentation is in rdoc and is quite detailed.  As of the latest version, it supports Ruby 1.9 too.

Xapian Fu basically gives you a Hash interface to Xapian – so you get a persistent Hash with full text indexing built in (and ACID transactions!).

Example

For example, create a database called example.db, put three documents into it and search them and print the results:

  require 'xapian-fu'
  include XapianFu
  db = XapianDb.new(:dir => 'example.db', :create => true,
                    :store => [:title, :year])
  db << { :title => 'Brokeback Mountain', :year => 2005 }
  db << { :title => 'Cold Mountain', :year => 2004 }
  db << { :title => 'Yes Man', :year => 2008 }
  db.flush
  db.search("mountain").each do |match|
    puts match.values[:title]
  end

There are of course a whole bunch more examples in the documentation.

Schema-less

The hard work of full text indexing and storage is of course done by the Xapian library, but I have added a couple of useful features.  One in particular is the ability to use symbols (or strings) as field names. Xapian has no real concept of fields, but you can store arbitrary data that it calls values in a numbered slot alongside each document.  Instead of making you deal with field numbers, Xapian Fu uses a hash function to convert your field names into numbers.  This means Xapian Fu is schema-less – you can add or omit fields whenever you like.  It’s useful to define fields when opening databases so that Xapian Fu can recognise them in searches or to give Xapian Fu some clues on the type of data you’ll be using, but it’s not necessary.

Efficient storage of fields for ordering

If you tell Xapian Fu what type of data you’ll be storing in your fields, it can store them more efficiently.  For example, if you don’t specify the type, integers will be converted to strings as is, so 354,441,945,266,899 becomes “354441945266899” – that’s  fifteen bytes!  When you tell Xapian Fu that your field is going to be an Integer,  it will store them in double precision floating point format which is 8 bytes and can represent up to about 16 decimal digits.  Also, it’s stored in big-endian format, so Xapian can still use the field for ordering results. XapianFu will store Time objects like this too, so again, it’s size efficient and can be used for ordering results.

Since Xapian Fu now knows what type the field is, it can convert it back when you access it too, so you get an Integer or a Time object (rather than a String, which is how Xapian represents it internally). It currently supports Integer, Fixnum, Bignum (up to a certain size), Float, Time and Date.  You can add your own types easily by decorating your instances with special methods.

Stemming and stopping

Xapian has stemming support for loads of languages (via the Snowball library), but no stop word lists.  Xapian Fu uses the appropriate stemmer when you specify a language for your document or database and comes with stop word lists for 13 languages (also automatically used).  This means Xapian doesn’t have to index these common stop words, so you get faster indexing and search times, a smaller database and more relevant search results.

Xapian Fu also knows that your searches won’t work right unless you stem them too! It automatically stems queries using the database language (though this does fall down a bit if you have different documents with different languages in your database at the moment, but it can be disabled (and isn’t too difficult to add support)).

Will Paginate support

will_paginate is a pagination library for ActiveRecord (and other DB abstraction layers).  It has helpers for drawing page list interfaces.  Xapian Fu supports will_paginate by using the same method names in result sets (such as :current_page and :total_pages).  You can pass a Xapian Fu result set to the will_paginate helpers and you’ll get lovely page list interfaces (you need to handle accepting the parameters in your action and setting up the next search of course!)

Active Record support

Xapian Fu does not yet have an Active Record plugin (I’ll add one soon) but as Xapian Fu uses the :id field as the Xapian primary key by default, it’s trivial to use it in your Rails app right now. See the “ActiveRecord Integration” section of the README for code examples.  In this case, you probably don’t need to actually store any data in the Xapian database, just the index information (and the :id field of course, but that’s stored by default) – so you get a smaller database (though you still need to store fields that you want to group by (collapse) or order results with).

Multi-master replicated full text indexing service

Xapian Fu doesn’t do this. I’m designing something that might though :)

Getting Xapian Fu

It’s available in gem form from Rubyforge/cutter.  The code is on github here.  You’ll need the Xapian Ruby bindings installed – on Debian/Ubuntu this is the libxapian-ruby1.8 package.  The gem named xapian claims to provide Xapian and the Ruby bindings but it failed form me on 64bit.  The gem named xapian-full claims to provide the Ruby bindings without Xapian (you’ll obviously need to build and install Xapian yourself) but I’ve not used that either.  RPMs, source files and other downloads are listed on the Xapian downloads page.

There is also this weird kinda splash page I made, in some kind of attempt to host something about Xapian Fu on my own domain. Not really sure what real purpose it serves.

Comments

[…] just properly announced my Ruby full text indexing library, Xapian Fu, on my personal blog.  It’s a Ruby interface to Xapian, an open source search engine Library.  Xapian Fu […]

Leave a Reply