I’ve been working on my News Sniffer project for the last few days, finishing up a two month experiment with using the Ruby Lucene implementation, Ferret, to index news articles and comments. More info on the News Sniffer blog. The project spanned two months due to some instability in the newer versions of Ferret, but the author responded to the bug reports and managed to fix all the problems so I decided to deploy.
Ferret offers huge improvements over the original MySQL full-text search method, and I’m looking forward to adding some fancy keyword statistics graphs in the future – perhaps showing censorship patterns in bbc comments with certain keywords.
Because News Sniffer is distributed across a number of servers, I used DRb (distributed Ruby) to allow them all to update one central Ferret index. DRb seems to work very well generally, and is amazingly simple to use, but I ran into a few problems with recycled objects and invalid references whilst using Ferret across it, apparently due to the garbage collector on the service side collecting things still in use on the client side. I think I eliminated most of them but they still crop up once in a while – I’ll be looking into this further.
I also moved from using memcached for cache fragment storage to FileStore. This allows me to expire fragments using regular expressions, which lets me use fragment caching more easily and more often (such as with paged listing). FileStore is rather slower than memcached, especially when expiring using these regular expressions, but being able to use it more often outweighed the performance hit. FileStore is obviously not distributed unless you have a shared file system, so I used DRb here too.
It would be nice to add regular expression expiry to memcached, but I think this goes against the original design spec for memcached. I’m considering adding configurable memory limits to the Rails MemoryStore fragment store, where it’ll remove least recently used fragments when the limit is approached (currently it would just keep allocating ram until your OS killed your Ruby process).
I also found a (easily fixable on Linux/BSD) race condition in FileStore where you could theoretically retrieve a corrupted fragment when it’s used in a multi-process shared storage setup (though not a multi-thread setup, so my DRb’ed FileStore should be safe).
Hopefully, with the improved searching due to Ferret and the higher performance due to FileStore, News Sniffer will now be more useful.
Leave a Reply