* Linked Open LibraryThing

Posted on July 2nd, 2009 by Ross. Filed under Linked Data.


For Ian Davis‘ birthday, Danny Ayers sent out an email asking people to make some previously unavailable datasets accessible as linked data as Ian’s present.  It was a pretty neat idea.  One that I wish I had thought of.

Given that Ian is my boss (prior to about a month ago, Ian was just nebulously “above me” somewhere in the Talis hierarchy, but I now report to him directly) one could cynically make the claim that by providing Ian a ‘linked data gift’ that I would just be currying favor by being a kiss-ass.  You could make that claim, sure, but evidently you are not aware of how I hurt the company.

Anyway, as my contribution, I decided to take the data dumps from LibraryThing that Tim Spalding pretty graciously makes available [whoa, in the time that I first started this post until now, the data has gone AWOL, I suppose I did this just in time].  The data isn’t always very current and not all of the files are terribly useful (the tags one, for example, doesn’t offer much since the tags aren’t associated with anything — it’s just words and their counts), but it’s data and between ThingISBN and the WikipediaCitations I thought it would be worth it.

I wanted to take a very pragmatic approach to this: no triple store, no search, no rdf libraries, minimal interface.  Mostly this was inspired by Ed Summers‘ work with the Library of Congress Authorities, but, also, if Tim (or, whoever at LibraryThing) saw that making LibraryThing linked data was as easy as a few template tweaks (as opposed to a major change in their development stack) this exercise was much more likely to actually make its way into LibraryThing.

What I ended up with (the first pass released before the end of Ian’s birthday, I might add) was LODThing: a very simple application written in Ruby’s Sinatra framework, DataMapper and SQLite.  The entire application is less than 230 lines of Ruby (including the web app and data loader) plus 2 HAML templates and 2 builder templates for the HTML/RDFa and RDF/XML, respectively.  The SQL database has three tables, including the join table.  This is really simple stuff.  The only real reason it took a couple days to create was trying to get the data loaded into SQLite from these huge XML files.  Nokogiri is fast (well, Ruby fast), but a 200 MB XML file is pretty big.  It was nice to get acquainted with Nokogiri’s pull parser, though.

There are a few things to take away from this exercise.

  1. When data is freely available, it’s really quite simple to reconstitute it into linked data without any need to depart from your traditional technology stack.  There is nothing even remotely semantic-webby about LODThing except its output.
  2. We now have an interesting set of URIs and relationships to start to express and model FRBR relationships.
  3. The Wikipedia citations data is extremely useful and could certainly be fleshed out more.  One could imagine querying DBpedia or Freebase on these concepts and identifying if the Wikipedia article is actually referring to the work itself and use that.  Right now LODThing makes no claims about the relationships except that it’s a reference from Wikipedia.

LODThing isn’t really intended for human consumption, so there’s no real “default way in”.  The easiest way to use it is to make a URI from an ISBN:

If you know the LibraryThing ‘work ID’, you can get in that way, too:

Also, you can all of these resources as RDF/XML by replacing the .html with .rdf.

So, Tim, you wrote on the LT API page that you would love to see what people are doing with your data, here you go.  It would be even more awesome if it made it’s way back into LT — after all, it would alleviate some of the need for you to have a special API for this stuff.

Also, special thanks to Toby Inkster for providing a ton of help in getting this to resemble something that a linked data aware agent would actually want and finally turning the httpRange-14 light bulb on over my head.  He also immediately linked to it from his Amazon API LODifiier, which is sort of cool, too.

I’ll be happy to throw the sources into a github repository if anybody’s interested in them.

.



* Better Paging Through Better Searching

Posted on June 7th, 2009 by Ross. Filed under Grails, Lucene, coding, jangle.


For the last couple of weeks I’ve returned to working on Alto Jangle connector, at least part-time.  I had shelved development on it for a while; I had a hard time finding anybody interested in using it and had reached a point where the development database I was working against was making it difficult to know what to expect in a real, live Alto system.  After I got wind of a couple of libraries that might be interested in it, I thought I should at least get it in a usable state.

One of the things that was vexing me prior to my hiatus was how to get Sybase to page through results in a semi-performant way.  I had originally blamed it on Grails, then when I played around with refactoring the connector in PHP (using Quercus, which is pretty slick by the way, to provide Sybase access via JDBC — the easiest way to do it) I realized that paging is just outside of Sybase’s capabilities.

And when you’re so used to MySQL, PostgreSQL and SQLite, this sort of makes your jaw drop (although, in its defense, it appears that this isn’t all that easy in Oracle, either — however, it’s at least possible in Oracle).

There seem to be two ways to do something like getting rows 375,000 – 375,099 from all of the rows in a table:

  1. Use cursors
  2. use SET ROWCOUNT 375100 and loop through and throw out the first 375,000 results.

The first option isn’t really viable.  You need write access to the database and it’s unclear how to make this work in most database abstraction libraries.  I don’t actually know that cursors do anything differently than option 2 besides pushing the looping to the database engine itself.  I was actually using cursors in my first experiments in JRuby using java.sql directly, but since I wasn’t aware of this problem at the time, I didn’t check to see how well it performed.

Option 2 is a mess, but this appears to be how GORM/Hibernate deals with paging in Sybase.  Cursors aren’t available in Quercus’ version of PDO, so it was how I had to deal with paging in my PHP prototypes, as well.  When I realized that PHP was not going to be any faster than Grails, I decided to just stick with Grails (”regular C-PHP” is out — compiling in Sybase support is far too heavy a burden).

This paging thing still needed to be addressed.  Offsets of 400,000 and more were taking more than twelve minutes to return.  How much more, I don’t know — I killed the request at the 12 minute mark.  While some of this might be result of a bad or missing index, any way you cut it, it wasn’t going to be acceptable.

I was kicking around the idea of exporting the “models” of the Jangle entities into a local HSQLDB (or whatever) mirror and then working the paging off of that.  I couldn’t help but think that this was sort of a waste, though — exporting from one RDBMS to another solely for the benefit of paging.  You’d have to keep them in sync somehow and still refer to the original Sybase DB for things like relationships and current item or borrower status.  For somebody that’s generally pretty satisfied with hacking together kludgy solutions to problems, this seemed a little too hack-y… even for my standards.

Instead, I settled on a different solution that could potentially bring a bunch of other features along with it.  Searchable is a Grails plugin for Compass, a project to easily integrate Lucene indexes with your Java domain classes (this would be analogous to Rails’ act_as_ferret).  When your Grails application starts up, Searchable will begin to index whatever models you declared as, well,  searchable.  You can even set options to store all of your attributes, even if they’re not actual database fields, alleviating the need to hit the original database at all, which is nice.  Initial indexing doesn’t take long — our “problem” table that took twelve minutes to respond takes less than five minutes to fully index.  It would probably take considerably less than that if the data was consistent (some of the methods to set the attributes can be pretty slow if the data is wonky — it tries multiple paths to find the actual values of the attribute).

What this then affords us is consistent access times, regardless of the size of the offset:  the 4,000th page is as fast as the second:  between 2.5 and 3.5 seconds (our development database server is extremely underpowered and I access it via the VPN — my guess is that a real, live situation would be much faster).

The first page is a bit slower.  I can’t use the Lucene index for the first page of results because there’s no way for Searchable to know if the WORKS_META table has changed since the last request since these changes wouldn’t be happening through Grails.  Since performance for the first hundred rows out of Sybase isn’t bad, the connector just uses it for the first page, then syncs the Lucene index with the database at the end of the request.  Each additional page then pulls from Lucene.  Since these pages wouldn’t exist until after the Lucene index is created and the Lucene index is recreated every time the Grails app is started, I added a controller method that checks the count of the Sybase table and the count of the Lucene index to confirm that they’re in sync (it’s worth noting that if the Lucene index has already been created once, this will be available right away after Grails starts — the reindexing is still happening, but in a temp location that will be moved to the default location once it’s complete overwriting the old index).

The side benefit to using Searchable is that it will make adding search functionality to Alto connector that much easier.  Building SQL statements from the CQL queries in the OpenBiblio connector was a complete pain the butt.  CQL to Lucene syntax should be considerably easier.  It seems like  it would be possible for these Lucene indexes to potentially alleviate the need for the bundles Zebra index that comes with Alto, eventually, but that’s just me talking, not any sort of strategic goal.

Anyway, thanks to Lucene, Sybase is behaving mostly like a modern RDBMS, which is a refreshing change.

.



* Comment from an Alternate Universe

Posted on June 7th, 2009 by Ross. Filed under libraries.


In a world where library management systems are sophisticated and modern…

I was doing some Google searches about SKOS, trying to figure out the exact distinction between skos:ConceptScheme and skos:Collection (it’s much more clear to me now) and I came across this article in XML.com:

Introducing SKOS

The article is fine, but it’s not what compelled me to write a blog post.  I was struck by a comment on that page titled What about Topic Maps?:

This new W3C standard obviously has a huge overlap with the very mature ISO standard Topic Maps.Topic Maps were originally conceived for (almost) exactly the same problem space as SKOS, and they are widely used. (For example, all major library cataloging software either supports Topic Maps or soon will.)

However, Topic Maps proved to be more generally useful, so they are often compared and contrasted with RDF itself. The surprising difficulty of making Topic Maps and RDF work together is exactly the “extra level of indirection” mentioned by the author of this article about SKOS.

It is very strange that neither this article, nor the referenced XTech paper, mentions Topic Maps.

What is the relationship between SKOS and Topic Maps? How does this fit in with the work (as reported In Edd Dumbill’s blog)
on interoperability between Topic Maps and RDF/OWL?

Now, I have no idea if “yitzgale” is some sort of alias of Alexander Johannesen, let’s assume “no” (for one thing, that comment is far too optimistic about library technology).  The sentence [f]or example, all major library cataloging software either supports Topic Maps or soon will is sort of stunning in both the claim it makes and its total lack of accuracy.  I feel pretty confident in my familiarity with library cataloging software and I can say with some degree of certainty that there is no support for topic maps today  (hell, MARC21, MFHD and Unicode support are pushing it – and those are just incremental changes).  This comment was written four years ago.
And yet, there’s part of me that feels robbed.  Where is the topic map support in my library system?  I don’t even really know anything about TM, but I still feel it would be a damn sight better than what we’ve got now.  What reality is this that yitzgale is living in, with its fancy library systems and librarians and vendors willing to embrace a radical change in how things are done?  I want in.
I might even be able to jump off my RDF bandwagon for it.

.



* Why Jeremy Frumkin is Awesome

Posted on May 26th, 2009 by Ross. Filed under Che, code4libcon2008.


It took over a year to actually fit him, but…

Ches T-Shirt Front

Library nerd to be

And a special thanks to our sponsors...

And a special thanks to our sponsors...

.



* Mixed Feelings

Posted on May 26th, 2009 by Ross. Filed under Standards Schmandards, jangle.


I cannot perceive a day that I might charge for a webinar about Jangle.  I expect that that day will never come.

Still, it pains me to see a NISO Webinar on Interoperability:

http://www.niso.org/news/events/2009/interop09

It pains me for a couple of reasons — a hundred bucks for a webinar?  Come on, NISO, get over yourself.

Secondly, I have tremendous respect for and was happy to participate in the DLF ILS-DI Berkeley Accord, but it’s, at best, a half measure, is no longer being actively developed and has, for all intents and purposes, lost its sponsorship.

Jangle isn’t perfect and I realize there’s not a NISO standard to be found (well, you can send Z39.2…), but if you’re going to talk about interoperability, there’s not a more pragmatic and simple approach on the table, currently.

.



* Commoditizing the Stack

Posted on April 27th, 2009 by Ross. Filed under philosophizing.


I had the opportunity to attend and present at the excellent ELAG conference last week in Bratislava, Slovakia.  The event was advertised as being somewhat of a European Code4Lib, but in reality, the format seemed to me to be more in line with Access, which in my mind is a plus.

Being the ugly American that I am, I made a series of provocative statements both in my presentation and in the Twitter “back channel” (or whatever they call hash tagging an event) about vendors, library standards, and a seeming disdain for both.  I feel like I should probably clarify my position here a bit, since Twitter is a terrible medium for in-depth communication and I didn’t go into much detail in my presentation (outside of saying vendor development teams were populated by scalliwags and ne’er-do-wells from previous gigs in finance, communications and publishing).

Here was my point I was angling towards in my presentation:  your Z39.50 implementation is never going to get any better than it was in 2001.  Outside of critical bug fixes, I would wager the Z39.50 implementation has not even been touched since it was introduced, never mind improved.  The reason for this is my above “joke” about the development teams being staffed by people that do not have a library background.  They are literally just ignoring the Z-server and praying that nothing breaks in unit and regression testing.  There are only a handful of people that understand how Z39.50 works and they are all employed by IndexData.  For everybody else, it’s just voodoo that was there when they got here, but is a requirement for each patch and release.

Thing is, even as hardware gets faster, and ILSes (theoretically) get more sophisticated, the Z-server just gets worse.  You would think that if this is the most common and consistent mechanism to get data out of ILSes that we would have seen some improvement in implementations as the need for better interoperability increases, but this is just not a reality that I have witnessed.  With the last two ILSes that I primarily worked with (Voyager and Unicorn) I would routinely, accidentally, completely bring down due to trying to use the Z39.50 server as a data source in applications.  For the Umlaut, I had to export the Voyager bib database into an external Zebra index to prevent the ILS from crashing multiple times a day just to look up incoming OpenURL requests.  Let me note that a vast majority of these lookups were just ISSN or ISBN.  Unsurprisingly, the Zebra index held up with no problems.  It’s still working, in fact.

Talis uses Zebra for Alto.  It’s probably the main reason we can check off “SRU Support” in an RFP when practically nobody else can.  But, again, this means the Z/SRU-server is sort of “outside” the development plan, delegated to IndexData.  Our SRU servers technically aren’t even conformant to the spec, since we don’t serve explain documents.  I’m not sure anybody at Talis even was aware of this until I pointed it out last year.

All of this is not intended to demonize vendors (really!) or bite the hand that feeds me.  It’s also not intended to denigrate library standards.  I’m merely trying to be pragmatic and, more importantly, I’m hoping we can make library development a less frustrating and backwards exercise for all parties (even the cads and scalliwags).

My point is that initiatives like the DLF ILS-DI, on paper, make a lot of sense.  I completely understand why they chose to implement their model using a handful of library standards (OAI-PMH, SRU).  The standards are there, why not use them?  The problem is in the reality of the situation.  If the specification “requires” SRU for search, how many vendors do you think will just slap Yaz Proxy in front of their existing (shaky, flaky) Z39.50 server and call it a day?  The OAI-PMH provider should be pretty trivial, but I would not expect any company to provide anything innovative with regards to sets or different metadata formats.

As long as libraries are not going to be writing the software they use themselves, they need to reconcile the fact that suppliers of their software is more than likely not going to be written by librarians or library technologists.  If this is the case, what’s the better alternative?  Clinging to half-assed implementations of our incredibly niche standards?  Or figuring out what technologies are developing outside of the library realm that could be used to deliver our data and services?  Is there really, honestly, no way we could figure out how to use OpenSearch to do the things we expect SRU to do?

I realize I have an axe to grind here, but this isn’t really about Jangle.

I have seen OpenURL bandied about as a “solution” to problems outside of its current primary use of “retrieving context based services from scholarly citations” (I know this is not what OpenURL’s sole use case is, but it’s all it’s being used for.  Period).  The most recent example of this was in a workshop (that I didn’t participate in) at ELAG about how libraries could share social data, such as tagging, reviews, etc. in order to create the economies of scale needed to make these concepts work satisfactorily.  Since they needed a way to “identify” things in their collection (books, journals, articles, maps, etc.) somebody had the (understandable, re: DLF) idea to use OpenURL as the identifier mechanism.

I realize that I have been accused of being “allergic” to OpenURL, but in general, my advice is that if you have a problem and you think OpenURL is the answer to said problem there’s actually probably a simpler and better answer to this if you approach it from outside of a library POV.

The drawbacks of Z39.88 for this scenario are numerous, but I didn’t go into details with my criticisms in Twitter.  Here are a few reasons why I would recommend away from OpenURL for this (and they are not exclusive to this potential application):

  1. OpenURL context objects are not identifiers.  They are a means to describe a resource, not identify it.  A context object may contain an identifier in its description.  Use that, scrap the rest of it.
  2. Because a context object is a description and not an identifier, it would have to be parsed to try to figure out what exactly it is describing.  This is incredibly expensive, error prone and more sophisticated than necessary.
  3. It was not entirely clear how the context objects would be used in this scenario.  Would they just be embedded in, say, an XML document as a clue as to what is being tagged or reviewed?  Or would the consuming service actually be an OpenURL resolver that took these context objects and returned some sort of response?  If it’s the former, what would the base URI be?  If it’s the latter… well, there’s a lot there, but let’s start simple, what sort of response would it return?
  4. There is no current infrastructure defined in OpenURL for these sorts of requests.  While there are metadata formats that could handle journals, articles, books, etc., it seems as though this would just scratch the surface of what would need context objects (music, maps, archival collections, films, etc.).  There are no ’service types’ defined for this kind of usage (tags, reviews, etc.). The process for adding metadata formats or community profiles is not nimble, which would make it prohibitively difficult to add new functionality when the need arises.
  5. Such an initiative would have to expect to interoperate with non-library sources.  Libraries, even banding together, are not going to have the scale or attraction of LibraryThing, Freebase, IMDB, Amazon, etc.  It is not unreasonable to say that an expectation that any of these services would really adopt OpenURL to share data is naive and a waste of time and energy.
  6. There’s already a way to share this data, called SIOC.  What we should be working towards, rather than pursuing OpenURL, is designing a URI structure for these sorts of resources in a service like this.  Hell, I could even be talked into info URIs over OpenURLs for this.

We could further isolate ourselves by insisting on using our standards.  Navel gaze, keep the data consistent and standard.  To me, however, it makes more sense to figure out how to bridge this gap.  After all, the real prize here is to be able to augment our highly structured metadata with the messy, unstructured web.  A web that isn’t going to fiddle around with OpenURL.  Or Z39.50.  Or NCIP.  I have a feeling the same is ultimately true with our vendors.

There comes a point that we have to ask if our relentless commitment to library-specific standards (in cases when there are viable alternatives) is actually causing more harm than help.

.



* Parsing escaped unicode in Ruby

Posted on April 16th, 2009 by Ross. Filed under coding, ruby, unicode.


While what I’m posting here might be incredibly obvious to anyone that understands unicode or Ruby better than me, it was new to me and might be new to you, so I’ll share.

Since Ed already let the cat out of the bag about LCSubjects.org, I can explain the backstory here.  At lcsh.info, Ed made the entire dataset available as N-Triples, so just before he yanked the site, I grabbed the data and have been holding onto it since.  I wrote a simple little N-Triples parser in Ruby to rewrite some of the data before I loaded it into the platform store I have.  My first pass at this was really buggy, I wasn’t parsing N-Triple literals well at all and was leaving out quoted text within the literal and whatnot.  I also, inadvertantly, was completely ignoring the escaped unicode within the literals and sending them verbatim.

N-Triples escapes unicode the same way Python string literals do (or at least this is how I’ve understood it), so 7⁰03ʹ43ʺN 151⁰56ʹ25ʺE is serialized into nt like: 7\\u207003\\u02B943\\u02BAN 151\\u207056\\u02B925\\u02BAE.  Try as I might, I could not figure out how to turn that back into unicode.

Jonathan Rochkind recommended that I look at the Ruby JSON library for some guidance, since JSON also encodes this way.  With that, I took a peek in JSON::Pure::Parser and modified parse_string for my needs.  So, if you have escaped unicode strings like this, and want them to be unicode, here’s a simple class to handle it.

$KCODE = 'u'
require 'strscan'
require 'iconv'
require 'jcode'
class UTF8Parser < StringScanner
  STRING = /(([\x0-\x1f]|[\\\/bfnrt]|\\u[0-9a-fA-F]{4}|[\x20-\xff])*)/nx
  UNPARSED = Object.new
  UNESCAPE_MAP = Hash.new { |h, k| h[k] = k.chr }
  UNESCAPE_MAP.update({
    ?"  => '"',
    ?\\ => '\\',
    ?/  => '/',
    ?b  => "\b",
    ?f  => "\f",
    ?n  => "\n",
    ?r  => "\r",
    ?t  => "\t",
    ?u  => nil,
  })
  UTF16toUTF8 = Iconv.new('utf-8', 'utf-16be')
  def initialize(str)
    super(str)
    @string = str
  end
  def parse_string
    if scan(STRING)
      return '' if self[1].empty?
      string = self[1].gsub(%r((?:\\[\\bfnrt"/]|(?:\\u(?:[A-Fa-f\d]{4}))+|\\[\x20-\xff]))n) do |c|
        if u = UNESCAPE_MAP[$&[1]]
          u
        else # \uXXXX
          bytes = ''
          i = 0
          while c[6 * i] == ?\\ && c[6 * i + 1] == ?u
            bytes << c[6 * i + 2, 2].to_i(16) << c[6 * i + 4, 2].to_i(16)
            i += 1
          end
          UTF16toUTF8.iconv(bytes)
        end
      end
      if string.respond_to?(:force_encoding)
        string.force_encoding(Encoding::UTF_8)
      end
      string
    else
      UNPARSED
    end
  rescue Iconv::Failure => e
    raise GeneratorError, "Caught #{e.class}: #{e}"
  end
end

.



* A URI Scheme for SuDocs

Posted on March 30th, 2009 by Ross. Filed under Problem Solving, SuDoc, URIs.


Jonathan Rochkind recently started a thread on the Code4lib mailing list asking how to register an info URI for SuDocs.  Ray Denenberg responded with an explanation of the process.  I won’t get into my opinions of info URIs or the merits of either side of the ensuing debate that spun out from this thread, but my takeaway was that Jonathan wasn’t really looking for an info URI, anyway.

What Jonathan wanted was:

  • A generic URI model to define SuDocs
  • For this model to be maintained and hosted by somebody other than him
  • If possible, the URIs be resolvable to something that made sense for the supplied SuDoc

I think these are reasonable desires.

I also thought that there were existing structures out there that could meet his requirements without going through the “start up costs” of registering an info URI.  Also, info URIs are not of the web, so after going through the work of creating a ’standard’, you cannot actually use it directly to figure out what the SuDoc is referring to.

SuDocs (like all other aspects of Government Documents) are arcane, niche and not understood by anyone other than GovDoc librarians, who seem to be rare.  That being said, there is a pretty convenient web presence implicit in SuDoc — in order for a SuDoc to exist, it needs to appear in the GPO’s catalog.  Since anything that appears in the GPO’s catalog can be seen on the web, we have a basis for a dereferenceable URI structure.

The GPO uses Ex Libris’ Aleph and whatever Aleph’s out of the box web OPAC is for their catalog.  Last week, I was Googling for some information about SRU and Aleph and it led me to this page about constructing CCL queries into Aleph (note, please disregard almost everything written on this page about SRU, CQL, etc., as it’s almost completely b.s.).  Figuring there must some way to search on SuDocs, I tried a couple of combinations of things, until I found this page in the GPO catalog.  Ok, so the index for SuDocs is called “GVD”.

This gives us URLs like:  http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202

Now, this could work, but it’s incredibly awkward.  It’s also extremely fragile since it’s software (in this case, Aleph) dependent, and if it was to break, requires the GPO to redirect us to the right place.

This is, of course, exactly what PURLs were designed to do.  I had never actually set up a PURL and almost didn’t for this, since the purl.org service said that it wasn’t working properly so it would be disabled for another week.  However, all the links were there, so I forged ahead.  I was in the process of setting up a regular PURL, when I ran across partial redirects.  I figured something like this had to exist for PURLs that were used for RDF vocabularies and the like, but wasn’t aware of how they work.

Anyway, they’re extremly simple.  Basically you set up a base URL (http://purl.org/NET/foo/) and anything requested past that base URL (e.g. http://purl.org/NET/foo/bar) will be redirected to the PURL endpoint verbatim.

So, I set up a partial redirect PURL at the base:  http://purl.org/NET/sudoc/

The expectation that it would be followed by a properly URL escaped SuDoc:  E 2.11/3:EL 2 becomes http://purl.org/NET/sudoc/E%202.11/3:EL%202 which then tacks that SuDoc onto http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3D and redirects you to http://catalog.gpo.gov/F/?func=find-c&ccl_term=GVD%3DE%202.11/3:EL%202.

What you have then is a unique identifier for a SuDoc that resolves to a human readable representation of what the URI stands for.  If the GPO changes OPACs or Ex Libris changes Aleph’s URL scheme or the GPO comes up with a better representation of SuDoc, it doesn’t matter as long as the actual SuDoc class number can be used to redirect the user to the new location.

Obviously, there’s an expectation here that PURLs remain indefinitely and that purl.org is never lost to a third party that repurposes it for other uses.  However, there are major parts of the web that rely on purl.org, so there are a lot of people that would fight to not see this happen.

Basically, I think these are the sorts of simple solutions that I feel we should be using to solve these sorts of problems on the web.  We are no longer the center of the information universe and it’s time that we accepted that and begin to use the tools that the rest of the world is using to solve the same problems that everybody else is dealing with.

How many other ‘identifiers’ could be mint persistent, dereferenceable URIs this way?  I look forward to finding out.

By the way, there are currently three possible points of failure for this URI scheme:  purl.org, GPO and me.  I would prefer not to be a single point of failure, so if you would like to be added as a maintainer to this PURL, please let me know and I would be happy to add you.

.



* Superpatron Hollywood Redux

Posted on March 27th, 2009 by Ross. Filed under two-point-oh-no.


Ed Vielmetti appears again on Rocketboom.

Again via Twitter.  Incorrectly linked from the Rocketboom page.

.



* In search of a bag of holding

Posted on March 19th, 2009 by Ross. Filed under daisycms, drupal, jangle, plone.


Let me start this by saying this is not a criticism or a rant against any of the technologies I am about to mention.  The problems I am having are pretty specific and the fact that I am intentionally trying to use “off the shelf” commodity projects to accomplish my goal complicates things.  I realize when I tweet things like this, it’s not helpful (because there is zero context).

I’ve been in a bit of a rut this week.  Things were going ok on Monday, when I got the Jangle connector for Reserves Direct working, announced and started generating some conversation around how to model course reserves (and their ilk) in Jangle.  However, this left me without anything specific to work on.  I have a general, hand-wavy, project that I am supposed to be working on to provide a simple, general digital asset management framework that can either work on the Platform or with a local repository like Fedora, depending on the institution’s local needs or policies.  More on this in some other post.  The short of it is, I need to gather some general requirements before I can begin something like this in earnest, which led me to revive an old project.

When Jangle first started, about 15 months ago, Elliot and I felt we needed what we called “the Base Connector“.  The idea here was that there were always going to be situations where a developer doesn’t have direct, live access to a system’s data, and they would need a surrogate external database to work with.  The Base Connector was an attempt to provide an out of the box application that could simulate the basics of an ILS and be populated with the sorts of data you would get from commandline ‘report’ type applications.  The sort of thing you can cron and write out to a file on your ILS server.  Updates in catalog records.  Changes in user status.  Transactions.  That sort of thing.

After the amount of interest at Code4lib in Janglifying III Millenium, I decided to revisit the concept of the Base Connector.  Millenium’s (and to an extent, Unicorn’s, and there are no doubt others) lack of consistent database access, makes it a good candidate for this duplicate database.  I was hoping to take a somewhat different approach to this problem than Elliot and I had originally tried, however.  I was hoping to be able to come up with something:

  1. More generically “Jangle” and less domain specific to ILSes
  2. Easy to install
  3. Customizable with a simple interface
  4. Something, preferably, that could be taken mostly “off the shelf”, where the only real “coding” I had to do was to get the library data in and the connector API out.  I was hoping all “data model” and “management” stuff could be taken care of already.

In my mind, I was picturing using a regular CMS for this, although it needed to be able to support customized fields for resources.

Here is the rough scenario I am picturing.  Let’s say you have an ILS that you don’t have a lot of access to.  For your external ‘repository’, you’ll need it to be able to accomodate a few things.

  • Resources will need not just a title, an identifier and the MARC record, but also have fields for ISBN, ISSN, OCLC number, etc.  They’ll also need some sort of relationship to the Items and Collections they’re associated with.
  • Actors could be simple system user accounts, but they’ll need first names and last names and whatnot.
  • Collections, I assume, can probably be contrived via tags and whatnot.
  • The data loading would probably need to be able to be done remotely via some commandline line scripting.

I decided to try three different CMSes to try to accomplish this:  Drupal, Plone and Daisy.  I’ll go through each and where I ran into a snag for each.  I want to reiterate here that I know next to nothing about any of these.  My problems are probably not shortcomings of the project themselves, but more due to my own ignorance.  If you see possible solutions to my issues (or know of other possible projects that fit my need even better) please let me know.  This is a cry for help, not a review.

Drupal

One of the reasons I targeted Drupal is that it’s easy to get running, can run on cheap shared hosting, has quite a bit of traction in libraries and has CCK.  I actually got the farthest with Drupal in this capacity.  With CCK, I was able to, in the native Drupal interface, build content types for Resources and Items.  For Actors, I had just planned on using regular user accounts (since then I could probably piggyback off of the contributed authentication modules and whatnot).  Collections would be derived from Taxonomies.

Where things went wrong:

My desire is to decouple the ‘data load’ aspect of the process from the ‘bag of holding’ itself.  What I’m saying is that I would prefer that the MARC/borrower/item status/etc. load not be required to be built in Drupal module, but, instead, be able to be written in whatever language the developer is comfortable with and a simple way of getting that data into the bag of holding.

There are only two ways that I can see to use an external program to get data into Drupal:

  1. Use the XMLRPC interface
  2. Simulate the content creation forms with a scripted HTTP client.

I’m not above number two, but I would prefer not to if there’s a better way available.  The problem is that I can find almost zero documentation on the XMLRPC service.  What ‘calls’ are available?  How do I create a Resource content type?  How do I relate that to a user or an Item?  I have no idea where to look.  I don’t actually even know if the fields I created will be searchable (which was the whole point of making them).

Drupal seems promising for this, but I don’t know where to go from here.

Plone

I really thought Plone was going to be a winner.  It’s completely self-contained (framework, webserver and database all rolled into one installer) and based on an OODB.  Being Python based, I feel I can fall back on Python to build the scripts to actually do the dirty work of massaging and loading the data.  The downside to Plone (and I have looked eye-to-eye with this downside before) is that it and Zope are total voodoo.

It didn’t take me long to run into a brick wall with Plone.  I installed version 3.2.1 thanks to the handy OSX installer and got it up and running.

And then I couldn’t figure out what to do next.  I think I want Archetypes.  I followed the (outdated)  instructions to install it.  I see Archetypes stuff in the Zope control panel.  However, I never see anything in Plone.  I Google.  Nothing.  Feeling that it must be there and I’m just missing something I follow this tutorial to start building new content types.  I build a new content type.  It doesn’t show up in the Plone quick installer.  Nothing in the logs.  I Google.  Nothing.

Nothing frustrates me more than software making me feel like total dumbass.

I am at the point where I think Plone might be up to the task, but I don’t have the interest, time or energy to make it work.  At the end of the day, I’m still not entirely sure that it would meet my basic criteria of the ‘content type’ being editable within the native web framework anyway.  I also have no idea if my plan of loading the data via an external Python (or, even better, Ruby) script is remotely feasible.

Plone got the brunt of my disgruntled tweeting.  This is mainly due my frustration at seeing how well Plone would fit it my vision and being able to get absolutely nowhere towards realizing that goal.

Daisy

What, you’ve never heard of it?  I have a history with Daisy, and I know, without a doubt, it could serve my needs.  The problem with Daisy is that it has a lot of working parts.  To do what I want, you need both the data repository and the wiki running, as well as MySQL.  On top of that, some external web app would need to actually do the Jangle stuff (and, this would most likely be Ruby/Sinatra) interacting with the HTTP API.  This is a lot of running daemons.  A lot of daemons that might not be running at any given time which would break everything.  Daisy is a lot of things, but it’s not ’self-contained’.

This is not a criticism.  If I was running a CMS, this would be ok.  When I was developing the Communicat, this was ok.  Those are commitments.  Projects that you think, “ok, I’m going to need to invest some thought and energy into this”.

The bag of holding is a stop-gap.  “I need to use this until I can figure out a real way to accomplish what I need”.  Maybe it’s the ultimate production service.  That’s fine, but it needs to scale down as far as it scales up.  I literally want something that somebody can easily get running anywhere, quickly and start Jangling.

If anybody has any recommendations on how I can easily get up and running with any of the above projects, please let me know.

Alternately, if anybody knows something else, a simple, remotely accessible dynamic, searchable data store, definitely enlighten me!  I realize the irony of this plea, given who I work for, but the idea here is for something not cloud based, since I would like for the user to be able to load in their sensitive patron data without having to submit it to some third party service.  There’s also the fact that there’s no front end that I can just ‘plug in’ to manage the data.

If I can’t get anything off the shelf working, I think I’ll be reduced to writing something simple in Merb or Sinatra with CouchDB or Solr or something.  I was really hoping to have to avoid doing this, though.

.