Archive

Monthly Archives: March 2009

Jonathan Rochkind recently started a thread on the Code4lib mailing list asking how to register an info URI for SuDocs.  Ray Denenberg responded with an explanation of the process.  I won’t get into my opinions of info URIs or the merits of either side of the ensuing debate that spun out from this thread, but my takeaway was that Jonathan wasn’t really looking for an info URI, anyway.

What Jonathan wanted was:

  • A generic URI model to define SuDocs
  • For this model to be maintained and hosted by somebody other than him
  • If possible, the URIs be resolvable to something that made sense for the supplied SuDoc

I think these are reasonable desires.

I also thought that there were existing structures out there that could meet his requirements without going through the “start up costs” of registering an info URI.  Also, info URIs are not of the web, so after going through the work of creating a ‘standard’, you cannot actually use it directly to figure out what the SuDoc is referring to.

SuDocs (like all other aspects of Government Documents) are arcane, niche and not understood by anyone other than GovDoc librarians, who seem to be rare.  That being said, there is a pretty convenient web presence implicit in SuDoc — in order for a SuDoc to exist, it needs to appear in the GPO’s catalog.  Since anything that appears in the GPO’s catalog can be seen on the web, we have a basis for a dereferenceable URI structure.

The GPO uses Ex Libris’ Aleph and whatever Aleph’s out of the box web OPAC is for their catalog.  Last week, I was Googling for some information about SRU and Aleph and it led me to this page about constructing CCL queries into Aleph (note, please disregard almost everything written on this page about SRU, CQL, etc., as it’s almost completely b.s.).  Figuring there must some way to search on SuDocs, I tried a couple of combinations of things, until I found this page in the GPO catalog.  Ok, so the index for SuDocs is called “GVD”.

This gives us URLs like:  http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202

Now, this could work, but it’s incredibly awkward.  It’s also extremely fragile since it’s software (in this case, Aleph) dependent, and if it was to break, requires the GPO to redirect us to the right place.

This is, of course, exactly what PURLs were designed to do.  I had never actually set up a PURL and almost didn’t for this, since the purl.org service said that it wasn’t working properly so it would be disabled for another week.  However, all the links were there, so I forged ahead.  I was in the process of setting up a regular PURL, when I ran across partial redirects.  I figured something like this had to exist for PURLs that were used for RDF vocabularies and the like, but wasn’t aware of how they work.

Anyway, they’re extremly simple.  Basically you set up a base URL (http://purl.org/NET/foo/) and anything requested past that base URL (e.g. http://purl.org/NET/foo/bar) will be redirected to the PURL endpoint verbatim.

So, I set up a partial redirect PURL at the base:  http://purl.org/NET/sudoc/

The expectation that it would be followed by a properly URL escaped SuDoc:  E 2.11/3:EL 2 becomes http://purl.org/NET/sudoc/E%202.11/3:EL%202 which then tacks that SuDoc onto http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3D and redirects you to http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202.

What you have then is a unique identifier for a SuDoc that resolves to a human readable representation of what the URI stands for.  If the GPO changes OPACs or Ex Libris changes Aleph’s URL scheme or the GPO comes up with a better representation of SuDoc, it doesn’t matter as long as the actual SuDoc class number can be used to redirect the user to the new location.

Obviously, there’s an expectation here that PURLs remain indefinitely and that purl.org is never lost to a third party that repurposes it for other uses.  However, there are major parts of the web that rely on purl.org, so there are a lot of people that would fight to not see this happen.

Basically, I think these are the sorts of simple solutions that I feel we should be using to solve these sorts of problems on the web.  We are no longer the center of the information universe and it’s time that we accepted that and begin to use the tools that the rest of the world is using to solve the same problems that everybody else is dealing with.

How many other ‘identifiers’ could be mint persistent, dereferenceable URIs this way?  I look forward to finding out.

By the way, there are currently three possible points of failure for this URI scheme:  purl.org, GPO and me.  I would prefer not to be a single point of failure, so if you would like to be added as a maintainer to this PURL, please let me know and I would be happy to add you.

Let me start this by saying this is not a criticism or a rant against any of the technologies I am about to mention.  The problems I am having are pretty specific and the fact that I am intentionally trying to use “off the shelf” commodity projects to accomplish my goal complicates things.  I realize when I tweet things like this, it’s not helpful (because there is zero context).

I’ve been in a bit of a rut this week.  Things were going ok on Monday, when I got the Jangle connector for Reserves Direct working, announced and started generating some conversation around how to model course reserves (and their ilk) in Jangle.  However, this left me without anything specific to work on.  I have a general, hand-wavy, project that I am supposed to be working on to provide a simple, general digital asset management framework that can either work on the Platform or with a local repository like Fedora, depending on the institution’s local needs or policies.  More on this in some other post.  The short of it is, I need to gather some general requirements before I can begin something like this in earnest, which led me to revive an old project.

When Jangle first started, about 15 months ago, Elliot and I felt we needed what we called “the Base Connector“.  The idea here was that there were always going to be situations where a developer doesn’t have direct, live access to a system’s data, and they would need a surrogate external database to work with.  The Base Connector was an attempt to provide an out of the box application that could simulate the basics of an ILS and be populated with the sorts of data you would get from commandline ‘report’ type applications.  The sort of thing you can cron and write out to a file on your ILS server.  Updates in catalog records.  Changes in user status.  Transactions.  That sort of thing.

After the amount of interest at Code4lib in Janglifying III Millenium, I decided to revisit the concept of the Base Connector.  Millenium’s (and to an extent, Unicorn’s, and there are no doubt others) lack of consistent database access, makes it a good candidate for this duplicate database.  I was hoping to take a somewhat different approach to this problem than Elliot and I had originally tried, however.  I was hoping to be able to come up with something:

  1. More generically “Jangle” and less domain specific to ILSes
  2. Easy to install
  3. Customizable with a simple interface
  4. Something, preferably, that could be taken mostly “off the shelf”, where the only real “coding” I had to do was to get the library data in and the connector API out.  I was hoping all “data model” and “management” stuff could be taken care of already.

In my mind, I was picturing using a regular CMS for this, although it needed to be able to support customized fields for resources.

Here is the rough scenario I am picturing.  Let’s say you have an ILS that you don’t have a lot of access to.  For your external ‘repository’, you’ll need it to be able to accomodate a few things.

  • Resources will need not just a title, an identifier and the MARC record, but also have fields for ISBN, ISSN, OCLC number, etc.  They’ll also need some sort of relationship to the Items and Collections they’re associated with.
  • Actors could be simple system user accounts, but they’ll need first names and last names and whatnot.
  • Collections, I assume, can probably be contrived via tags and whatnot.
  • The data loading would probably need to be able to be done remotely via some commandline line scripting.

I decided to try three different CMSes to try to accomplish this:  Drupal, Plone and Daisy.  I’ll go through each and where I ran into a snag for each.  I want to reiterate here that I know next to nothing about any of these.  My problems are probably not shortcomings of the project themselves, but more due to my own ignorance.  If you see possible solutions to my issues (or know of other possible projects that fit my need even better) please let me know.  This is a cry for help, not a review.

Drupal

One of the reasons I targeted Drupal is that it’s easy to get running, can run on cheap shared hosting, has quite a bit of traction in libraries and has CCK.  I actually got the farthest with Drupal in this capacity.  With CCK, I was able to, in the native Drupal interface, build content types for Resources and Items.  For Actors, I had just planned on using regular user accounts (since then I could probably piggyback off of the contributed authentication modules and whatnot).  Collections would be derived from Taxonomies.

Where things went wrong:

My desire is to decouple the ‘data load’ aspect of the process from the ‘bag of holding’ itself.  What I’m saying is that I would prefer that the MARC/borrower/item status/etc. load not be required to be built in Drupal module, but, instead, be able to be written in whatever language the developer is comfortable with and a simple way of getting that data into the bag of holding.

There are only two ways that I can see to use an external program to get data into Drupal:

  1. Use the XMLRPC interface
  2. Simulate the content creation forms with a scripted HTTP client.

I’m not above number two, but I would prefer not to if there’s a better way available.  The problem is that I can find almost zero documentation on the XMLRPC service.  What ‘calls’ are available?  How do I create a Resource content type?  How do I relate that to a user or an Item?  I have no idea where to look.  I don’t actually even know if the fields I created will be searchable (which was the whole point of making them).

Drupal seems promising for this, but I don’t know where to go from here.

Plone

I really thought Plone was going to be a winner.  It’s completely self-contained (framework, webserver and database all rolled into one installer) and based on an OODB.  Being Python based, I feel I can fall back on Python to build the scripts to actually do the dirty work of massaging and loading the data.  The downside to Plone (and I have looked eye-to-eye with this downside before) is that it and Zope are total voodoo.

It didn’t take me long to run into a brick wall with Plone.  I installed version 3.2.1 thanks to the handy OSX installer and got it up and running.

And then I couldn’t figure out what to do next.  I think I want Archetypes.  I followed the (outdated)  instructions to install it.  I see Archetypes stuff in the Zope control panel.  However, I never see anything in Plone.  I Google.  Nothing.  Feeling that it must be there and I’m just missing something I follow this tutorial to start building new content types.  I build a new content type.  It doesn’t show up in the Plone quick installer.  Nothing in the logs.  I Google.  Nothing.

Nothing frustrates me more than software making me feel like total dumbass.

I am at the point where I think Plone might be up to the task, but I don’t have the interest, time or energy to make it work.  At the end of the day, I’m still not entirely sure that it would meet my basic criteria of the ‘content type’ being editable within the native web framework anyway.  I also have no idea if my plan of loading the data via an external Python (or, even better, Ruby) script is remotely feasible.

Plone got the brunt of my disgruntled tweeting.  This is mainly due my frustration at seeing how well Plone would fit it my vision and being able to get absolutely nowhere towards realizing that goal.

Daisy

What, you’ve never heard of it?  I have a history with Daisy, and I know, without a doubt, it could serve my needs.  The problem with Daisy is that it has a lot of working parts.  To do what I want, you need both the data repository and the wiki running, as well as MySQL.  On top of that, some external web app would need to actually do the Jangle stuff (and, this would most likely be Ruby/Sinatra) interacting with the HTTP API.  This is a lot of running daemons.  A lot of daemons that might not be running at any given time which would break everything.  Daisy is a lot of things, but it’s not ‘self-contained’.

This is not a criticism.  If I was running a CMS, this would be ok.  When I was developing the Communicat, this was ok.  Those are commitments.  Projects that you think, “ok, I’m going to need to invest some thought and energy into this”.

The bag of holding is a stop-gap.  “I need to use this until I can figure out a real way to accomplish what I need”.  Maybe it’s the ultimate production service.  That’s fine, but it needs to scale down as far as it scales up.  I literally want something that somebody can easily get running anywhere, quickly and start Jangling.

If anybody has any recommendations on how I can easily get up and running with any of the above projects, please let me know.

Alternately, if anybody knows something else, a simple, remotely accessible dynamic, searchable data store, definitely enlighten me!  I realize the irony of this plea, given who I work for, but the idea here is for something not cloud based, since I would like for the user to be able to load in their sensitive patron data without having to submit it to some third party service.  There’s also the fact that there’s no front end that I can just ‘plug in’ to manage the data.

If I can’t get anything off the shelf working, I think I’ll be reduced to writing something simple in Merb or Sinatra with CouchDB or Solr or something.  I was really hoping to have to avoid doing this, though.