• Home
  • Personal
  • Tech
  • Politics
  • Photography
  • Xapian Fu: Full Text Indexing in Ruby

    January 31st, 2010

    Xapian is an Open Source Search Engine Library written in C++. It has Ruby bindings, but they’re generated with SWIG, so they basically just mirror the C++ bindings – not very Ruby-like (and pretty ugly).

    Being a self-confessed full text indexing nerd and a Ruby-lover, I wrote Xapian Fu: a library to provide access to Xapian that is more in line with “The Ruby Way”.

    I started writing Xapian Fu exactly a year ago today but left it for a couple of months, then restarted work on it on the train on the way back from the 2009 Scotland on Rails conference.  Development was test driven, so it’s got an extensive test suite (using rspec).  Documentation is in rdoc and is quite detailed.  As of the latest version, it supports Ruby 1.9 too.

    Xapian Fu basically gives you a Hash interface to Xapian – so you get a persistent Hash with full text indexing built in (and ACID transactions!).

    Example

    For example, create a database called example.db, put three documents into it and search them and print the results:

      require 'xapian-fu'
      include XapianFu
      db = XapianDb.new(:dir => 'example.db', :create => true,
                        :store => [:title, :year])
      db << { :title => 'Brokeback Mountain', :year => 2005 }
      db << { :title => 'Cold Mountain', :year => 2004 }
      db << { :title => 'Yes Man', :year => 2008 }
      db.flush
      db.search("mountain").each do |match|
        puts match.values[:title]
      end

    There are of course a whole bunch more examples in the documentation.
    Read the rest of this entry »

    Tags: active record, database, full text indexing, indexing, ruby, search, stemming, stopping, the ruby way, xapian, xapian-fu

    Posted in Ruby, Tech | 1 Comment »

  • My NWRUG Ferret Talk

    March 24th, 2009

    I did a short talk on Ferret, the Ruby “Information Retreival Library”, at the North West Ruby Users Group last Thursday.  We had a bit of a theme too, with Will Jessop speaking about Sphinx and Asa Calow speaking about Solr.

    I got to have a bit of a nosey around the Manchester BBC building too – though I was worried I’d open the wrong door and end up on TV. Didn’t fancy having to apologise to Jeremy Paxman.

    Brightbox also sponsored some pizza, and gave away t-shirts and stickers like candy (there was no candy though).

    My slides are available here, and contain a little example file system indexer. I made my slides with webby and S6 if you’re interested.

    Tags: ferret, indexing, inverse index, rails, ruby, search, solr, sphinx, talk

    Posted in Ruby on Rails, Tech | 2 Comments »

  • News Sniffer, Ferret and Rails

    April 20th, 2007

    I’ve been working on my News Sniffer project for the last few days, finishing up a two month experiment with using the Ruby Lucene implementation, Ferret, to index news articles and comments.  More info on the News Sniffer blog.  The project spanned two months due to some instability in the newer versions of Ferret, but the author responded to the bug reports and managed to fix all the problems so I decided to deploy.

    Ferret offers huge improvements over the original MySQL full-text search method, and I’m looking forward to adding some fancy keyword statistics graphs in the future – perhaps showing censorship patterns in bbc comments with certain keywords.

    Because News Sniffer is distributed across a number of servers, I used DRb (distributed Ruby) to allow them all to update one central Ferret index.  DRb seems to work very well generally, and is amazingly simple to use, but I ran into a few problems with recycled objects and invalid references whilst using Ferret across it, apparently due to the garbage collector on the service side collecting things still in use on the client side.  I think I eliminated most of them but they still crop up once in a while – I’ll be looking into this further.

    Read the rest of this entry »

    Tags: ferret, indexing, memcached, newssniffer, rails, ruby, searching

    Posted in Ruby on Rails | No Comments »

  • John Leach

    • John Leach is a human being living in Leeds, UK.
  • Twitter

    • John is finally sitting down to watch Terminator 2 after @louisa_ insisted we watch 1 first. She, of course, was right to insist. 8 hrs ago
    • More twitter updates →
  • Author Stuff

    • Brightbox Rails Hosting
    • Compost This
    • ELER Web Comic
    • New World Odour
    • News Sniffer
    • Photography
    • Profile and History
    • Recycle This
    • The Gillroyd Parade
    • Things to do today
    • Website
  • Friends

    • Caius Durling
    • Deb Bassett
    • Gianni Tedesco
    • Ian Higgins
    • Louisa Parry
    • Rahoul Baruah
    • Sleepy Kev
    • Tim Waters
    • Tom Hall
  • Stuff

    • ifup
    • Media Lens
    • Mia Bambina
    • News from nowhere
  • Meta

    • Log in
    • Entries RSS
    • Comments RSS
  • Search

Creative Commons License The text of this blog is licensed under the Creative Commons BY-ND license