Faceted Search on a Shoestring

There are any number of reasons that you can attribute to Solr‘s status as the standard bearer of faceted full-text searching:  it’s free, fast, works shockingly well out of the box without any tweaking, has a simple and intuitive HTTP API (making it available in the programming language of your choice) and is, by far, the easiest “enterprise-level” application to get up and running.  None of its “competitors” (Sphinx, Xapian, Endeca, etc.), despite any individual advantages they might have, can claim all of these features, which goes a long way towards explaining Solr’s popularity.

The library world has definitely taken a shine to Solr:  from discovery interfaces like VuFind and Primo, to repositories like Fedora, to full-text aggregators like Summon, you can find Solr under the hood of most of the hot products and services available right now.  The fact that a library can install VuFind and have a slick, jaw-droppingly powerful OPAC-replacement that puts their legacy interface to shame in about an hour is almost completely the by-product of Solr’s amazing simplicity to get up and running.  It’s no wonder why so many libraries are adopting it (compare it to SOPAC, also built in PHP and about as old, but uses Sphinx for the full-text indexing and is hardly ever seen in the wild).

Without a doubt, Solr is pretty much a no-brainer if you are able to run Jetty (or Tomcat or JBoss or Glassfish or whatever):  with enough hardware, Solr can scale up to pretty much whatever your need might be.  The problem (at least the problem in my mind) is that Solr doesn’t scale down terribly well.  If you host your content from a cheap, shared web hosting provider or a VPS, for example, Solr is not available or not practical (it doesn’t live in small memory environments well).  The hosted Solr options are fairly expensive and while there are cheap, shared web hosting providers that do provide Java Application Servers, switching vendors to provide faceted search for your mid-size Drupal or Omeka site might not be entirely practical or desirable.

I find myself proof-of-concept-ing a lot of hacks to projects like VuFind, Blacklight, Kochief and whatnot and run these things off of my shared web server.  It’s older, underpowered and only has 1GB of RAM.  Since I’m not running any of these projects in production (just really making things available for others to see), it was really annoying to have Solr gobbling up 20% of the available RAM for these little pet projects.  What I wanted was something that acted more or less like Solr when you pointed an application that expected Solr to be there, but I wanted it to have a small footprint that could run (almost) anywhere and more or less disappear when it was idle.

So it was for this scenario that I wrote CheapSkate: a Solr emulator written in Ruby.  It uses Ferret, the Ruby port of Lucene, as the full-text indexing engine and Sinatra to supply the HTTP API.  Ferret is fast, scales quite well and responds to the same search syntax as Solr, so I knew it could handle the search aspect pretty easily.  Faceting (as can be expected) proved the harder part.  Originally, I was storing the values of fields in an RDBMS and using that to provide the facets.  Read performance was ok, although anything over 5,000 results would start to bog down – the real problem was the write performance, which was simply woeful.  Part of the issue was that this design was completely schemaless:  you could send anything to CheapSkate and facet on any field, regardless of size.  It also tried to maintain the type of the incoming field value:  dates were stored as dates, numbers stored as integers and so on.  Basically the lack of constraints made it wildly inefficient.

Eventually, I dropped the RDBMS component, and started playing around Ferret’s terms capabilities.  If you set a particular field to be untokenized, your field values appear exactly as you put them in.  This is perfect for faceting (since you don’t want stemming and whatnot on your query filters and your strings aren’t normalized or downcased or anything so they look right in the UI) and is basically the same thing Solr itself does.  Instead of a schema.xml, CheapSkate has a schema.yml, but it works essentially the same way:  you define your fields, what should be tokenized (that is, which fields allow full-text search) or not (i.e. facet fields) and what datatype the field should be.

CheapSkate doesn’t support all of the field types that Solr does, but it supports strings, numbers, dates and booleans.

One neat thing about Ferret is that you can pass a Ruby Proc to the search method as a search option.  This proc then has access to the search results as Ferret is finding them.  CheapSkate uses this find the terms in the untokenized fields for each search hit, throws them in a Hash and generates a hit count for each term.  This is a lot faster than getting all the document ids from the search, looping them and generating your term hash after the search is completed.  That said, this is still definitely the bottleneck for CheapSkate.  If the search result has more than 10-15,000 hits, performance begins to get pretty heavily impacted by grabbing the facets.  I’m not terribly concerned by this, data sets with search results in the 20,000+ range start to creep into the “you would be better off just using Solr” domain.  For my proofs-of-concepts, this has only really raised its head in VuFind when filtering on something like “Book” (with no search terms) for a 50,000 record collection.  What I mean to say is, this happens for fairly non-useful searches.

Overall, I’ve been pretty happy with how CheapSkate is working.  For regular searching it does pretty well (although, like I said, I’m not trying to run a production discovery system that pleases both librarians and users).  There’s a very poorly designed “more like this” handler that really needs an overhaul and there is no “did you mean” (spellcheck).  This hasn’t been a huge priority, because I don’t really like the spellcheck in Solr all that much, anyway.  That said, if somebody really wanted this and had an idea of how it would be implemented in Ferret, I’d be happy to add it.

Ideally, I’d like to see something like CheapSkate in PHP using Zend_Search_Lucene, since that would be accessible to virtually everybody, but that’s a project for somebody else.

In the meantime, if you want to see some examples of CheapSkate in action:

One important caveat to projects like VuFind and Blacklight:  CheapSkate doesn’t work with Solrmarc, which requires Solr to return responses in the javabin format (which may be possible to hack out something that looks enough like javabin to fool Solrmarc, I just haven’t figured it out).   My workaround has been to populate a local Solr index with Solrmarc and then just dump all of the documents out of Solr into CheapSkate.

3 comments
  1. MJ Suhonos said:

    Very Cool Stuff(TM), Ross! I’ll have to give CheapSkate a spin, which should be easy using Ruby on OSX.

    (Sidenote: I’m a LAMP guy originally, so the more lightweight, non-Java options we have — especially PHP — the happier I am.)

    I’ve been playing with ElasticSearch (http://www.elasticsearch.com/) for a similar alternative to Solr, and a bare naked instance is about 110MB of RAM, although that may just be the default JVM settings. With 90000 documents, half a dozen facets, and a pile of MLT queries, around 260M. Query performance is excellent.

    But you do need a JVM, so maybe it fits better under the “you would be better off just using Solr” domain. :-)

    MJ

  2. Ross said:

    MJ, it was designed and developed on OSX, so you should have no problems there.

    I have been keeping my eye on ElasticSearch, too – it definitely looks really slick. The only issue I see is that this (library – and I imagine others) ecosystem is so heavily invested in Solr’s API (specifically) that it will be hard for projects like ElasticSearch to make serious inroads without some significantly outstanding feature that everybody thinks is the new hotness. BTW, those JVM stats are almost identical to what I was getting with Solr.

    I’m not allergic to Java, per se (well, using it, anyway — I’ll do almost anything to avoid writing it), but it’s just not going to be feasible for everybody. Unfortunately, that’s also true for Ruby (although it’s a different problem).

    Like I said, I would love to see a PHP port, and hopefully CheapSkate can be used a template for that. The similarities in API between Ferret and Zend_Search_Lucene should make that pretty easy. The major hurdle to Zend_Search_Lucene (there really needs to be a shortened nick for that) is going to be the facets, since you don’t have the handy lambda option that Ferret has. It may actually be worth revisiting the RDBMS concept for a PHP port (just for the untokenized/faceting fields), since you could store the field values on the document id and it doesn’t cost you anything, really, just loop to through all of the search results and getting the document ids. The real performance hit is in opening the documents out of the index and reading the values.

    Of course a GROUP_BY/COUNT SQL query on 15,000 document ids is still a non-trivial performance consideration…

  3. Frank said:

    Hey Ross, I’m a front-end web developer and I have some basic back-end skills in php. I know enough to set up content management systems and basic database fetches. I was wondering if you could write an install guide for dummies that will help me set up CheapSkate on my shared web hosting provider. I’d really like to use faceted search on my next small scale project.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>