I did a short talk on Ferret, the Ruby “Information Retreival Library”, at the North West Ruby Users Group last Thursday. We had a bit of a theme too, with Will Jessop speaking about Sphinx and Asa Calow speaking about Solr.
I got to have a bit of a nosey around the Manchester BBC building too – though I was worried I’d open the wrong door and end up on TV. Didn’t fancy having to apologise to Jeremy Paxman.
Brightbox also sponsored some pizza, and gave away t-shirts and stickers like candy (there was no candy though).
My slides are available here, and contain a little example file system indexer. I made my slides with webby and S6 if you’re interested.
Whilst working with the Ruby text search engine library Ferret, I came across a segfault in the query parser. It had already been reported and fixed, but I realised it can lead to a denial of service.
If you use Ferret anywhere that allows users to execute queries, those users can crash the Ruby process with a specially crafted query. This was quite serious for a number of my sites (not to mention slowing development of a current app) so I applied the fix to the released 0.11.4 source and repackaged it as 0.11.4.1.
Obviously this isn’t in any way official, but it works for me and I’m sharing here for anyone else affected. Gem, tgz and zip here and just the patch available here (derived from the author’s changeset to trunk).
The patch is against the release source, as the subversion repository seems to be down atm (I got the changeset from the web bases subversion viewer).
I’ve been working on my News Sniffer project for the last few days, finishing up a two month experiment with using the Ruby Lucene implementation, Ferret, to index news articles and comments. More info on the News Sniffer blog. The project spanned two months due to some instability in the newer versions of Ferret, but the author responded to the bug reports and managed to fix all the problems so I decided to deploy.
Ferret offers huge improvements over the original MySQL full-text search method, and I’m looking forward to adding some fancy keyword statistics graphs in the future – perhaps showing censorship patterns in bbc comments with certain keywords.
Because News Sniffer is distributed across a number of servers, I used DRb (distributed Ruby) to allow them all to update one central Ferret index. DRb seems to work very well generally, and is amazingly simple to use, but I ran into a few problems with recycled objects and invalid references whilst using Ferret across it, apparently due to the garbage collector on the service side collecting things still in use on the client side. I think I eliminated most of them but they still crop up once in a while – I’ll be looking into this further.
Continue reading News Sniffer, Ferret and Rails
Whilst planning some changes to my News Sniffer project, I thought I’d have a play with Active Resource.
Currently, all the forum and news article downloading and scraping happens on a different machine to the web server. It has a VPN connection to the database and memcache servers, but I’d like to integrate the Ferret text indexing system for better searching capabilities. To centralise Ferret, I have a three options:
- regularly reindex new content from the database on the web server;
- DRb a Ferret Object;
- or use ActiveResource to access the models via the web service.
DRb-ing a Ferret Object would be quite elegant, but using ActiveResource would also replace the need for a database and memcache connection (and I could do much better fragment caching actually).
Anyway, I searched high and low for some docs – lots of blog entries about how great it is, but no real api docs. When I searched through the Rails code and found nothing either, I got suspicious. Finally I found a couple of blog entries stating that ActiveResource was dropped from Rails 1.2. It seems to be planned for Rails 2.0. Not sure how I missed this. I guess my search-foo is lacking.
I’ll be investigating other options. I’d much prefer not to build a SOAP or XMLRPC interface. Ugh.