Tag: newssniffer

News Sniffer, Ferret and Rails

I’ve been working on my News Sniffer project for the last few days, finishing up a two month experiment with using the Ruby Lucene implementation, Ferret, to index news articles and comments.  More info on the News Sniffer blog.  The project spanned two months due to some instability in the newer versions of Ferret, but the author responded to the bug reports and managed to fix all the problems so I decided to deploy.

Ferret offers huge improvements over the original MySQL full-text search method, and I’m looking forward to adding some fancy keyword statistics graphs in the future – perhaps showing censorship patterns in bbc comments with certain keywords.

Because News Sniffer is distributed across a number of servers, I used DRb (distributed Ruby) to allow them all to update one central Ferret index.  DRb seems to work very well generally, and is amazingly simple to use, but I ran into a few problems with recycled objects and invalid references whilst using Ferret across it, apparently due to the garbage collector on the service side collecting things still in use on the client side.  I think I eliminated most of them but they still crop up once in a while – I’ll be looking into this further.

(more…)

Active Resource not in Rails 1.2!

Whilst planning some changes to my News Sniffer project, I thought I’d have a play with Active Resource.

Currently, all the forum and news article downloading and scraping happens on a different machine to the web server. It has a VPN connection to the database and memcache servers, but I’d like to integrate the Ferret text indexing system for better searching capabilities. To centralise Ferret, I have a three options:

  1. regularly reindex new content from the database on the web server;
  2. DRb a Ferret Object;
  3. or use ActiveResource to access the models via the web service.

DRb-ing a Ferret Object would be quite elegant, but using ActiveResource would also replace the need for a database and memcache connection (and I could do much better fragment caching actually).

Anyway, I searched high and low for some docs – lots of blog entries about how great it is, but no real api docs. When I searched through the Rails code and found nothing either, I got suspicious. Finally I found a couple of blog entries stating that ActiveResource was dropped from Rails 1.2. It seems to be planned for Rails 2.0. Not sure how I missed this. I guess my search-foo is lacking.

I’ll be investigating other options. I’d much prefer not to build a SOAP or XMLRPC interface. Ugh.

News Sniffer: Revisionista

The latest News Sniffer project went live today: Revisionista. It tracks changes in corporate news articles and marks the differences. So you can choose a BBC news article and see how it’s changed since it was first created. Most changes are on breaking new articles which get updated as more information becomes available, but some changes are rather telling of policy.

Currently only the BBC is monitored, but it’s pretty easy for me to add support for any site with an RSS feed.

Some examples:
(more…)