Archive for the ‘libraries’ Category

* Dear Innovative Customers

Posted on April 9th, 2008 by Ross. Filed under libraries.


Why do you put up with this crap?

Maybe it’s about time you all started to explore this idea again.

.



* Filing an extension on my fifteen minutes

Posted on January 7th, 2008 by Ross. Filed under libraries, philosophizing.


I was reading Brian’s appeal for more Emerils in the library world (bam!), noticed Steven Bell’s comment (his blog posting was a response to one by Steven in the first place) and it got me thinking.

First off, I don’t necessarily buy into Brian’s argument.  Maybe it’s due to the fact that he’s younger than me, but my noisy, unwanted opinions aren’t because I didn’t get a pretty enough pony for my sixteenth birthday or because I saw Jason Kidd’s house on Cribs ™ and want to see my slam dunk highlights on SportsCenter on my 40″ flat screens in every bathroom.  It’s because I feel I have something to offer libraries and I genuinely want to help affect change.  Really, I know this is what motivates Brian, too, despite his E! Network thesis, because we worked together and I know his ideas.

Brian doesn’t have to worry about his fifteen minutes coming to a close anytime soon.  Although at first blush it would appear that the niche he has carved out for himself is potentially flash-in-the-pan-y (Facebook, Second Life, library gaming, other Library 2.0 conceits), the motivation for why he does what he does is anything but.  He is really just trying to meet users where they are, on their terms, to help them with their library experience.

Technologies will change and so, too, will Brian, but that’s not the point.  He’ll adapt and adjust his methods to best suit what comes down the pike, as it comes down the pike (proactively, rather than reactively) and continue to be a vanguard in engaging users on their own turf.  More importantly, though, I think he can continue to be a voice in libraries because he works in a library and if you have some creative initiative it’s very easy to stand out and make yourself heard.

Brian and I used joke about the library rock star lifestyle:  articles, accolades, speaking gigs, etc.  A lot of this comes prettily easily, however.  If you can articulate some rational ideas and show a little something to back those ideas up, you can quickly make a name for yourself.  Information science wants visionary people (regardless of whether or not they follow that leader) and librarians want to hear new ideas for how to solve old problems.  Being a rock star is pretty easy, being a revolutionary is considerably harder.

I made the jump from library to vendor because I wanted to see my ideas affect a larger radius than what I could do at a single node.  It has been an interesting adjustment and I’m definitely still trying to find my footing.  It has been much, much more difficult to stand out because I am suddenly surrounded by a bunch of people that much are smarter than me, much better developers than me, and have more experience applying technology on a large scale.  This is not to say that I haven’t worked with brilliant people in libraries (certainly I have, Brian among them), but the ratio has never been quite like this.  Add to the fact that being a noisy, opinionated voice within a vendor has its immediate share of skeptics and cynics (who are the ‘rock stars’ in the vendor community?  Stephen Abram?  Shoot me.), I may find myself falling into Steven Bell’s dustbin.  Then again, I might be able to eventually influence the sorts of changes that inspired me to make the leap in the first place.  I can do without the stardom in that case.

.



* Tales from the Open Content Alliance

Posted on October 22nd, 2007 by Ross. Filed under libraries.


San Francisco is a truly miserable town to try to recuperate from a sore throat.  Not that anyone would consider a place with the nickname ‘Fog City’ to be a good place to convalesce, especially since walking around in the wet and cold is so desirable given the charms of the city.  And so I found myself last week (and my first week at Talis), slogging through precipitation to the Open Content Alliance’s Annual Meeting at the Officer’s Club at the Presidio.

Festivities got a late start and began with everyone in room (all hundred something of us) standing up and introducing ourselves to the room.  While this helped put names to faces (otherwise I never would have met John Mignault) and identify some of the OCA’s initiatives, it had the added effect of pushing us completely off schedule.

Brewster Kahle talked for a while about the need to focus on texts between 1923 and 1963 (or whatever); a vast majority of these works would likely be out of copyright, it just takes some research or contacting the rights holder.  We cannot just digitize works pre-1923; this has diminishing returns of value.  Post-1964 materials will likely never be available, so this middle group needs to be exploited.

Next, we got a report out on some of the activities the OCA has been undertaking for the last year: microfilm digitization from UIUC; the Biodiversity Heritage Library; printing on demand; and briefly on the Open Library.  The microfilm digitization project is interesting.  They can scan something like a roll every hour.  I had a similar idea when I was at Tech, although my plan had been to use the existing, public microform scanners.  Obviously that would have been a lot slower and with possibly mixed results, but a lot cheaper.

The Biodiversity Heritage Library is a consortium of 10 natural history museums and botanical gardens (including the Smithsonian, New York Botanical Garden, Kew Gardens) working to create a subject specific portal to their collections.  If there were a lot of details said about this I must have tuned them out.  Same goes for the Open Library project.  I’m pretty sure both of these updates were rather light on specifics.  Printing on demand was from a vendor:  the simple message was that this is becoming affordable — the cool part being that the economics are exactly the same for printing one book as it is for creating 1000 copies of the book and that is around $0.01 per page.

After a break (where I talked to John Mignault the whole time), we had breakout sessions.  I attended “Sharing and Integration of Bibliographic Records” and intended to sit and listen.  Instead, I wound up talking.  A lot.  One of the main issues of conversation surrounded a proposal by Bowker (the ISBN issuing agency for North America) to supply ISBNs to digitized copies of works that would not originally have had an ISBN (which, in the context of the Open Content Alliance, would be nearly all of them).  Bowker has made a deal that they will offer 3 million ISBNs (250,000 per library) as a gift to the non-profit sources in the OCA.  After a library burns through their 250k, the ISBNs are $0.10 each.

Superficially, you could either love or hate this proposal, but as debate wore on, it became easy to both love and hate this proposal.  I think I argued for both sides during the course of the session.  While the payoffs seem logical (it would be nice to discover these materials by related ISBN, certainly), there are also some pitfalls, as well.  For instance, since the OCA isn’t terribly effective at discouraging multiple institutions digitizing the same edition/run of a particular book.  This means that two different scans of essentially the same book could potentially have two different ISBNs.  This actually complies with the ISBN specification, but it certainly wouldn’t comply with people’s expectations.

Also, it is unclear how these records should be treated in services like OCLC.  If they get an ISBN, does that mean a new record should be added?  If multiple scans are created, what is the appropriate 856 to add to an existing record?  What is the ‘authoritative’ URL?

We also talked a bit the static nature of the Internet Archive’s metadata.  It is assumed that the metadata will not change after a scan is loaded into the archive, but this is hardly the reality.  How then does the IA know of the updates made at the owning institution?  This seems like the perfect application of RDF; the IA would just point at the owning institution’s record, but that would obviously require some infrastructure.  The notion of a pingback, like weblogs, was raised.

After lunch, we got reports from the breakouts.  The ILL/Scan-on-demand group came up with a process to share still in copyright items.  Also they made the recommendation that no ILL charges be made on these requests.  This is really quite an interesting development and I’m awfully impressed they made the progress they did.  It’s obviously because I wasn’t there to argue about every little point.

Brewster Kahle then had those interested in the ISBN deal go back into a room and work things out.  The same issues came up, but this time I feel they were largely ignored.  Kahle wants to implement ISBNs and really wouldn’t take any other answer, so the plan is to figure out how to make this work.  Since he’s willing to pony up a lot of the cash to make it happen, it’s certainly his call.  His argument was, basically, “it will work or it won’t”.  Sure, but how will it effect the use of ISBNs in the meantime?

We broke up to listen to Carl Lagoze speak about life after we’ve digitized everything.  His thesis was, in a nutshell, what are we going to do after we’ve aggregated everything into repositories?  It’s imperative that we come up with value on top this data we’re accumulating, it’s not enough just to collect it.  We need to make associations, content and means to allow our researchers to leverage the objects we have.  He opened the floor to discussion on this topic and comments varied.  I think I was still hung up on the ISBN thing and wasn’t in the mood to argue anymore.

I skipped the reception to meet Ian and Paul and get my Macbook Pro.  Unfortunately, I had to fly out the next morning, but all in all it was a good trip!

.



* HACK IS NOT A CRIME

Posted on August 15th, 2007 by Ross. Filed under coding, libraries, ruby.


…although LinuX_Xploit_Crew, with all due respect, I think it actually is.

Oh well, we’re back with a new theme (which nobody will see except to read comments, since I’m pretty sure all traffic comes from the code4lib planet) and an updated WordPress install. Look out, world!

So, in the downtime here’s a non-comprehensive rundown of what I’ve been working on:

  1. I’ve written an improved (at least, I think it’s improved) alternative to Docutek’s Eres RSS interface. Frankly, Docutek’s sucked. Maybe we have an outdated version of Eres, but the RSS feeds would give errors because you have to click through a copyright scare page before you can view reserves, but you can’t link the RSS links to this form and get the item. I wrote a little Ruby/Camping app that takes urls like: http://eres.library.gatech.edu/course/WS-1001-A/Fall/2007 and turns that into a usable feed. I needed the course id/term/year format to show them in Sakai. My favorite part of this project was finding Rubyscript2exe. This allows me to just bundle one file (the compiled camping app) plus a configuration file. Granted, an asp.net app would be even easier for sites to install, but I didn’t have time to learn asp.net. I have more ideas of what I would like to do with this (such as show current circ status for physical reserves), but in the chaos that is our library reorg, I haven’t gotten around to even showing anybody what I’ve written so far.
  2. I broke ground on a Metalib X-Server Ruby library. It took me a while to wrap my head around how this needed to be modelled, but I think it’s starting to take shape. It doesn’t actually perform queries, yet, but it connects to the server, allows you to set the portal association and find and set the category to search. Quicksets and MySets are all derivations of the same concept, so I don’t think it will be take me long to actually incorporate actual searching. For proof-of-concept, I plan on embedding this library in MemoryHole, our Solr-based discovery app. I’ve actually stopped development of MemoryHole so we can focus on vuFind, since they do functionally the same thing and I’d rather help make vuFind better than replicate everything it does, only in Ruby. The reason I’m doing this proof-of-concept in MemoryHole rather than vuFind is solely due to familiarity and time.

In other news, my last post seems to have caused a bit of a stir. My plan is to write a response, but the short of it is that I feel the arguments for an MLS are extremely classist.

Also my bathroom is finished and it looks great.

ERESidue

Blogged with Flock

Tags:

.



* Union Card

Posted on July 9th, 2007 by Ross. Filed under Master of Library Science, libraries, philosophizing.


Can anyone give a rational explanation as to why a job with a description like this:

DESCRIPTION: Provides technology and computer support for the Vanderbilt Library. The major areas of responsibility include developing, maintaining and assisting in the enhancement of interfaces to web-enabled database applications (currently implemented in perl, PHP, and MySQL). The position also helps establish and maintain guidelines (coding standards, version control, etc.) for the development of new applications in support of library patrons, staff, and faculty across the university. This position will also provide first line backup for Unix system administration. Other duties and assignments will be negotiated based on the successful candidate’s expertise, team needs, and library priorities.

would require an MLS? Library experience? Sure, I can see why that would be desirable. While I find it ridiculous when many libraries require an MLS for what is essentially an IT manager, Vandy is upping the ante here and requiring it for a developer/jr. sysadmin.

I guess that’s a way to prop up the profession.

.



* In search of… Bigfoot

Posted on March 29th, 2007 by Ross. Filed under coding, libraries, ruby.


Before I left for Guatemala, Ian Davis at Talis asked if I could give him a dump of our MARC records to load into Talis Platform. I had been talking in the #code4lib channel about how I was pushing the idea of using Talis Source to make simple, ad-hoc union catalogs; we could make one for Georgia Tech & Emory (we have joint degree programs) or Arche or Georgia Tech/Atlanta-Fulton Public Library, etc. My thinking was that by utilizing the Talis Platform, we could forgo much of the headache in actually making a union catalog for somewhat marginal use cases (the public library one notwithstanding).

About a week after I got back from Guatemala, I had an email from Richard Wallis with some urls to play around with to access my Bigfoot store. He showed me search services, facet services and augment services. I was unable to be really dive into it much at the time but since I’m working on a total site search project for the library, I thought this would be a good chance to kick the tires a bit to include catalog results.

After two days of poking around, I have made some opinions of it, have some recommendations for it, and wrote a Ruby library to access it.

1) The Item Service

This is certainly the most straightforward and for many people, the most useful service of the bunch. The easiest way to think of the item service is an HTTP based Lucene service (a la Solr or Lucene-WS) of your bib records. It returns something OpenSearch-y (it claims to be a RSS 1.0 document), but it doesn’t validate. That being said, FeedTools happily consumed it (more on that later) and the semantics should be familiar to anyone that has looked at OpenSearch before. Each item node also contains a Dublin Core representation of the record and a link to a marcxml representation. I’m not sure if there’s a description document for Bigfoot.
Although the query syntax is pure Lucene (title:”The Lexus and the Olive Tree”), the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones (for example, my guess is I wouldn’t be able to get an index for 490/440$v that I use for the Umlaut). I don’t see returning the results as OAI_DC being too much of a problem, since the RSS item includes a title (which would have been tricky between the DC and the marcxml). My Ruby library might not generate valid DC, I haven’t really looked into it.

The docs also mention you can POST items to your Bigfoot store, but they don’t mention what your data needs to look like (MARC?) or what credentials you need to add something (I mean, it must be more than just your store name, right?). My hope is to add this functionality to bigfoot-ruby soon (especially since my data is from a bulk export from last October).

2) The Facet Service

This one is intriguing, definitely, since Faceted searching is all the rage right now.  The search syntax is basically the same as the Item Service, except you also send a comma delimited list of the fields you would like to query.  What you get back is either an XML or XHTML document of your results.

For each field you request, you get back a set of terms (you can specify how many you want, with a default of 5) that appear most frequently in your field.  You also get an approximation for how many results you would get in that facet and a url to search on that facet.  It’s quite fast, although, realistically, you can’t do much with the output of facet search alone.

Again, it’s difficult to know what you can facet on (subject, creator and date are all useful — I’m sure there are others) and the facet that (for me, at least) held the most promise — type — is too overly broad to do much with (it uses Leader position 7, but lumps the BKS and SER types all in a label called “text”).  I would like to see Talis implement something like my MARC::TypedRecord concept so one could facet on things like government document or conference.  You could separate newspapers from journals and globes from maps.  Still, the text analysis of the non-fixed fields is powerful and useful and beats the hell out of trying to implement something like that locally.

In bigfoot-ruby, I have provided two ways to do a faceted search:  you can just do the search and get back Facet objects containing the terms and search urls or you can facet with items which executes the item searches automatically (in turn getting a definitive number of results for the query, as well).  Since I didn’t bother to implement threading, getting facets with items can be pretty slow.

3) The Augment Service

To be honest, I’m having a hard time figuring out useful scenarios for the augment service.  The idea is that you give it the URI of an RSS feed, and this service will enhance it with data from your Bigfoot store (at least, that is sort of how I understand it works).  Richard’s example for me was to feed it the output of an xISBN query (which isn’t in RSS 1.0, AFAIK, but, for the sake of example…) and the augment service would fill in the data for ISBNs your library holds.  The API example page mentions Wikipedia, but I don’t know where other than the Talis Platform that you can get Wikipedia entries formatted properly.  I tried sending it the results of an Umlaut2 OpenSearch query, but it didn’t do anything with it.  Presumably this RSS 1.0 feed needs the bib data to be sent in a certain way (my guess is in OAI_DC, like the Item Service), but I’m not sure.  The only use case I can think of for this service is a much simpler way to check for ISBN concordance (rather than isbn:(123456789X|223456789X|323456789X|etc.))

Overall, I’m really impressed with the Talis API.  It is a LOT easier to use than, say, Z39.50 and by using OpenSearch seems more natural to integrate into existing web services than SRU.

Bigfoot-ruby is definitely a work in progress.  I think I would like to split the Search class into ItemService and FacetService.  I don’t like how results is an Array for items and a Hash for facets. Just seems sloppy.  I need to document it, of course and I would like to implement Item POST.  This project also made me realize how bloody slow FeedTools is.  I am currently using it in both the Umlaut and the Finding Aids to provide OpenSearch, but I think it’s really too sluggish to justify itself.

Thanks, Talis, for getting me started with Bigfoot and giving me the opportunity to play around with it.  Also, thanks to Ed Summers for fixing SVN on Code4lib.org.  You wouldn’t be able to download it and futz around with it yourself, otherwise.

.



* A proposal to Endeavor Voyager customers

Posted on March 7th, 2007 by Ross. Filed under Polishing the Turd, libraries, philosophizing.


If YPOW, like MPOW, is an Endeavor Voyager site, you’ve got some decisions ahead.  Francisco Partners, naturally, would like you to migrate to Aleph, and I have no doubt that Ex Libris is, as I write this, busily working on a means to make that easy for Voyager libraries to do.  But ILS migrations are painful, no matter how easy the backend process might be.  There’s staff training,  user training, managing new workflows, site integration; lots of things to deal with.  Also, your functionality may not be a 1:1 relationship to what you currently have.  How do you work around services you depended upon?

Since soon our contracts with Endeavor Information Systems will be next to worthless, I propose, Voyager customers, that we take ownership of our systems.  For the price of a full Oracle (or SQL Server? — does Voyager support other RDBMSes?) license  (many of us already have this), we can get write permissions to our DB and make our own interfaces.  We wouldn’t need to worry about staff clients (for now), since we already have cataloging, circulation, acquisitions, etc. modules that work.  When we’re ready for different functionality, however, we can create a new middleware (in fact, I’m planning to break ground on this in the next two weeks) to allow for web clients or, even better, piggyback on Evergreen’s staff clients and let somebody else do the hard work.  If we had native clients in the new middleware, a library could use any database backend they wanted (just migrate the data from Oracle into something else).  The key is write access to the database.

By taking ownership of our ILS, we can push developments we want, such as NCIP, a ‘Next Gen OPAC’, better link resolver integration, better metasearch integration, etc. without the pain of starting all over again (with potentially the same results, who is to say that whatever you choose as an ILS wouldn’t eventally get bought and killed off, as well?).  Putting my money (or lack thereof) where my mouth is, I plan on migrating Fancy Pants to use such a backend (read only db access, for now, we still have a support contract, after all).  I’m calling this project ‘Bon Voyage’.  After reading Birkin’s post on CODE4LIB, I would like to make a similar service for Voyager that would basically take the place of the Z39.50 server and access to the database.  Fancy Pants wouldn’t be integrated into Bon Voyage, it would just be another client (since it was always only meant as a stopgap, anyway).

What we’ll have is a framework for getting at the database backend (it’d be safe to say this will be a rails project) with APIs to access bib, item, patron, etc. information.  Once the models are created, it will be relatively simple to transition to ‘write’ access when that becomes necessary.  Making a replacement for WebVoyage would be fairly trivial once the architecture is in place.  Web based staff clients would also be fairly simple.  I think EG staff client integration wouldn’t be too hard since it would just be an issue of outputting our data to something the EG clients want (JSON, I believe) and translating the client’s reponse.  That would need to be investigated more, however (I’m on paternity leave and not doing things like that right now :)

Would anybody find this useful? 
It seems the money we spend on an ILS could be better spent elsewhere.  I don’t think this would be a product we could distribute outside of the the current Voyager customer base (at least, not until it was completely native… maybe not even then- we’d have to work this out with Francisco Partners, I guess), but I think that that is big enough to be sustainable on its own.

.



* YYZ

Posted on February 13th, 2007 by Ross. Filed under Toronto, libraries, presentations.


At the beginning of the month, I gave two presentations at the Ontario Library Association’s SuperConference. I had a good time. My first time to Toronto and it snowed. Although, with all due respect, Mr. Lee, YYZ totally sucked.
My presentations were:

Librarian’s Lib: Taking control of what’s ours

The Communicat: contribute to the collective collection, comrade

Some pictures I took while there

.



* Nice threads

Posted on December 21st, 2006 by Ross. Filed under Polishing the Turd, Ruby on Rails, coding, libraries.


I have been working on Fancy-Pants quite a bit in the last couple of weeks. This is an AJAX layer over Voyager’s WebVoyage — an attempt to de-suck-ify its interface a bit. Why is it called Fancy-Pants? Well, Voyager still has the same underwear, it’s just got a new set of britches.

There are two main problems that it’s trying to solve:

  1. For items that have more than one MFHD, WebVoyage won’t show any item information in the title list.
  2. We wanted to link to 856 URLs from the title list.

Now, we’re already doing the second one, but it’s not implemented particularly well. While we were solving those problems, we wanted to see what we could do about that god-awful table based display.

I took NCSU’s Endeca layout as the baseline template for what I wanted the results to look like. Right now, Fancy-Pants can only be accessed via this Greasemonkey script [get Greasemonkey here]. Greasemonkey, of course, wouldn’t be a requirement, but we’re using it to inject the initial javascript call since we’re having to work on a live system.

For the title list screen, the javascript is looping through the bib ids on the page (it grabs them from the ’save record’ checkboxes) and sends them to a Ruby on Rails app that queries Voyager’s Oracle database and builds a new result set. The javascript hides the original page results (display: none) and inserts a div with the new results. If there are multiple 856es or locations, the result has expanding/collapsing divs to show/hide them.

I send the query terms to Yahoo’s spell check API and will return a link to any suggestions it gives. No, this isn’t the ideal, but I’m still in proof-of-concept stage.

Things I still want to do with title list screen are:

  1. Come up with a way to show what the item is (journal, microform, map, etc.) — I’ve started on this, but it’s very rough
  2. Make the ’sort by’ dropdown a row of links
  3. Turn the ‘Narrow my search’ button/page into a faceted navigation menu with options that make sense for the result set (for instance, limiting language to Dutch, Middle (ca. 1050-1350) isn’t going to come into play that much). Also add some logical facets a la Evergreen
  4. Replace the ’save record’ feature to work during the entire session and be able to save directly to Zotero, Endnote, Bibtex, CiteULike or Connotea.
  5. COinS and UnAPI
  6. Give it the same style as the rest of our new web design.

I’m currently not doing much with the record view page, but I am adding a direct link to the record.  I plan on integrating Umlaut responses here, as well as other context sensitive items - especially those that don’t conform well to OpenURL requests.

If you were able to install the Greasemonkey script and want to try it out, go to GIL’s keyword search and try:

  1. senate hearings — this is a good example of multiple mfhds/856es
  2. thomas friedmann — a good example of “Did you mean”

Also try a journal search for “Nature”.  Then try whatever floats your boat and let me know how it worked.  If you notice that it’s really slow, this is actually because of Voyager.  The “Available online” and relevance icons are all rendered dynamically and they just grind the output to a halt.  When we go live with this, we’d disable those features in WebVoyage to speed things up.
Fancy-pants is by no means a final product.  I view this as a bridge between what we have and an upcoming Solr based catalog interface.  The Solr catalog will still need to interface with Voyager, so Fancy-pants would transition to that.  Ultimately, I would like this whole process to eventually lead to the Communicat.

.



* The Soul-sucking vacuum that is EAD

Posted on December 20th, 2006 by Ross. Filed under Ruby on Rails, archives, coding, libraries, php.


I have a very conflicted relationship with our archives department. While their projects still need to get done, their services get very little use (especially when compared to other pending projects) and every time I get near any of their projects, it starts to become “Ross Singer and the Tar Baby”. Everything, EVERY SINGLE THING, archives/special collections has ever created is arcane, dense, and enormously time consuming. It always seems simple and then it somehow always turns into EAD.

About a year ago, I was tasked with developing something to help enable archives publish their recently converted EAD 2002 finding aids to HTML. I knew that XSLT stylesheets existed to do the heavy lifting in this so I thought it wouldn’t be too hard to get this up and running.

But it never works out that way with archives. The stylesheets that did what they liked worked only in Saxon and Saxon only works with Java (well, now with .Net, too — but that still didn’t help at the time). I wasn’t going to take on a language that I only poke at with a long stick on the most ambitious of days for some archives project (no offense, archives, but come on…). My whining caught Jeremy Frumkin’s attention who pointed me at a project that Terry Reese had done. It was a PHP/MySQL project that took the EAD, called Saxon from the commandline and printed the resulting HTML. It also indexed all these parts of the finding aid and put them in a MySQL database to enable search. I could never get the search part to work very well with our data, so I gave up that part, focusing instead on the Saxon->XSLT->cache to HTML part.

I set up a module in our intranet that let the archivists upload xml files, preview the result with the stylesheet, rollback to an earlier version, etc. No matter what, though, this was a hack. And, most importantly, I never got search working. Also, the detailed description (the dsc) was incredibly difficult to get to display how archives wanted it.

Another thing that nagged at me was, for all this work I (and they) had invested in this, how was this really any better than the HTML finding aids they were migrating from? They put all this work into encoding the EAD and all that we were really doing was displaying a web page that was less flexible in its output than the previous HTML.
This summer, I met with archives to discuss how we were going to enable searching. Their vision seemed pretty simple to implement.

Why do I keep falling for that? How many hours had I already invested in something I thought would be trivial?

The new system was (of course) built in Rails. The plan was to circumvent XSLT altogether so:

  1. The web designer could have more control over how things worked without punching the XSLT tarbaby.
  2. We could make some of that stuff in the EAD “actionable”, like the subject headings, personal names, corporate names, etc.
  3. We could avoid the XSLT nightmare that is displaying the Box/Folder/etc. list.

The fulltext indexing would be provided by Ferret. I thought I could hammer this out in a couple of weeks. I think dumb things like this a lot.

The infrastructure went up pretty quickly. Uploading, indexing, and displaying the browse lists (up until now, they had to get the web designer to add their newly encoded finding aid to the various pages to link to it) all took a little over a week, maybe. Search took maybe another week. For launch purposes, I wasn’t worried about pagination of search results. There were only around 210 finding aids in the system, so a search for “georgia” (which cast the widest net) didn’t really make a terribly unmanageable results page. That’s the nice thing about working with such a small dataset that’s not heavily used. Inefficiencies don’t matter that much. I’ve since added pagination (took a day or two).

No, like last time, the real burden was displaying the finding aid. My initial plan, parsing the XML and divvying it up into a bunch of Ruby objects, was taking a lot of time. EAD is inordinately difficult to put into logical containers that have easily usable attributes. That’s just not how EAD rolls. I found I was severely delaying launch trying to shoehorn my vision on the finding aid, which archives didn’t really care about, of course. They just wanted searching. So, while I was working out my EAD/Ruby object modeler, I deferred to XSLT to get the system out the door.

Rather than using Saxon this time, I opted for Ruby/XSLT (based on libxml/libxslt) for main part of the document and ruby scripting/templates for the detailed collection list. The former worked pretty well (and fast!) but the latter was turning into a nightmare of endless recursion. When I tried looping through all of the levels (EAD can have 9 levels of recursion, c01-c09, starting at any number — I think — and going to any depth), my vain attempts either showed the horrid performance of REXML (a native Ruby XML parser) or attempts at navigating the recursion that would leave you clutching at the fibers of sanity that remain when you get this far in an EAD document.

Finally I found what I thought was my answer: a nifty little Ruby library called CobraVsMongoose that would transform a REXML document to a Ruby hash (or vice-versa). It was unbelievably fast and made working with this nested structure a WHOLE lot easier. There was some strangeness to overcome (naturally). For instance, if an element’s child nodes include an element that appears more than once, it will nestle the the children in an array. If there are no repeating elements, it will put then child nodes in a Hash, (or the other way around, I can’t remember) so you have to check what the object is and process it differently accordingly. Still, it was fast, easy to manipulate and allowed me to launch the finding aids system.

Everybody was happy.

And then archives actually looked at the detailed description. Anything in italics or bold or in those weird ‘emph’ tags wasn’t being displayed. Ok, no problem, I just need to mess with the CobraVsMongoose function… oh wait. Yeah, what I hadn’t really thought about was that Ruby hashes don’t preserve sort order, so there was no way to get the italicized or bolded or whatevered text to display where it was supposed to in the output.

Damn you, tarbaby!

Back to the drawing board. I decided to return to REXML, emboldened now by (what I thought) was a better handle on the dsc structure and some better approaches to the recursion. Every c0n element would get mapped to a Ruby object (Collection, Series, Subseries, Folder, Item, etc.) and nest as children to each other and have partial views that would display them properly.

On my first pass, it was taking 28 seconds for our largest finding aid to display. TWENTY EIGHT SECONDS?! I would like to note, as finding aids go, ours aren’t very large. So, after tweaking a bit more (eliminating checking for children in Item elements, for example), I got it down to 24 seconds. Still not exactly the breathtaking performance I had hoped for.

What was bugging me was how quick CobraVsMongoose was. Somehow, despite also using REXML, it was fast. Really fast. And here my implementation seemed more like TurtleVsSnail. I was all set to turn to Libxml-Ruby (which would require a bunch of refactoring to migrate from REXML) when I found my problem and its solution. This was last night at 11PM.

While poring over the REXML API docs, I noticed that the REXML::Element.each_element method’s argument was called ‘xpath’. Terry had written about how dreadfully slow XPath queries were with REXML and, as a result, I thought I was avoiding them. When I removed the path arg from the each_element call in one of my methods and just iterated through each child element to see if its name matched, it cut the processing time in half! So, while 12 seconds was certainly no thoroughbred, it was definitely the right track. When I eliminated every xpath in the recursion process, I got it down to about 5 seconds. Add a touch of fragment caching and the natural performance boost of a production vs. development site in rails, and I think we’ve got a “good enough for now” solution.

There’s a little more to do with this project. I plan on adding an OpenSearch interface (I have all the plumbing from adding it to the Umlaut), an OAI provider and an SRU interface (when I finally get around to porting that CQL parser). And, yeah, finishing the EAD Object Model.

But right now, archives and I need to spend a little time away from each other.

In the meantime, here’s the development site.

.