Archive for the ‘coding’ Category

* Blindly groping towards ActivePlatform

Posted on April 9th, 2008 by Ross. Filed under activerdf, coding, platform, ruby.


Something I’ve taken it upon myself to do since I joined Talis is make ActiveRDF a viable client to access the Platform.  While this is mostly selfishness on my part (I want to keep developing in Ruby and there’s basically no RDF support right now, plus this gives me a chance to learn about the RDF/SPARQL-y aspects of the Platform), I also think that libraries like this can only help democratize the Platform.

So far, it’s been pretty ugly.  I haven’t had much time to work on it, granted, but the time I’ve spent on it has made me think that there will be a lot of work to do.  Couple this with some of the things that make the Platform difficult to work with in Ruby anyway (read: Digest Authentication) and this might be a more uphill battle than I’ll ever have time for, but I figure it’s either this or go back to Python and I’m not quite ready to give up on Ruby yet.

Currently, performance is abysmal with ActiveRDF against the Platform, so I’ll need to think of shortcuts to improve that (I’m not even considering write access presently).  Here’s some code (this is as much for my benefit, so I can remember what I’ve done) to work with Ian Davis’ Quotations Book Example store:

require ‘time’ # Otherwise ActiveRDF starts freaking out about DateTime
require ‘active_rdf’

$activerdf_without_xsdtype = true
# less than ideal, but without it, ActiveRDF sends
# ^^<http://www.w3.org/2001/XMLSchema#string> with string literals even if you don’t want
# to send the datatype.  I haven’t actually tried it with other datatypes to see how this breaks
# down the road.

ConnectionPool.set_data_source(:type => :sparql, :results => :sparql_xml, :engine=>:joseki,  :url=> “http://api.talis.com/stores/iand-dev2/services/sparql”)

Namespace.register :foaf, “http://xmlns.com/foaf/0.1/”
Namespace.register :dc, “http://purl.org/dc/elements/1.1/”
Namespace.register :quote, “http://purl.org/vocab/quotation/schema”

QUOTE::Quotations.find_by_dc::creator(”Loren, Sophia”).each do | quote |

# print the important stuff from each graph

# http://purl.org/vocab/quotation/schema#quote has to be manually added as a predicate
# the “#” seems to cause problems
quote.add_predicate(:quote, QUOTE::quote)
puts quote.quote
puts quote.subject
puts quote.rights
puts quote.isPrimaryTopicOf

end

If you actually try to execute this, you’ll see that it takes a long time to run (God help you if you try it on QUOTE::Quotations.find_by_dc::subject(”Age and Aging”)).  A really long time.

If you set some environment vars before you go into irb:

$ export ACTIVE_RDF_LOG_LEVEL=0
$ export ACTIVE_RDF_LOG=./activerdf.log

then you can tail -f activerdf.log and see what exactly is happening.

After ActiveRDF does it’s initial SPARQL query (SELECT DISTINCT ?s WHERE { ?s <http://purl.org/dc/elements/1.1/creator> “Loren, Sophia” . }), it’s doing two things for every request in the block:

  1. a SPARQL query for every predicate associated with the URI (http://api.talis.com/stores/iand-dev2/services/sparql?query=SELECT+DISTINCT+%3Fp+WHERE+%7B+%3Chttp%3A%2F%2Fapi.talis.com%2Fstores%2Fiand-dev2%2Fitems%2F1187139384317%3E+%3Fp+%3Fo+.+%7D+)
  2. a SPARQL query for the value of the attribute (predicate):  http://api.talis.com/stores/iand-dev2/services/sparql?query=SELECT+DISTINCT+%3Fo+WHERE+%7B+%3Chttp%3A%2F%2Fapi.talis.com%2Fstores%2Fiand-dev2%2Fitems%2F1187139384317%3E+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2Fcreator%3E+%3Fo+.+%7D

for every predicate in the graph.  You can imagine how crazily inefficient this is, since to get every value for a resource, you have to make a different HTTP request for each one.

Obviously this would be a lot easier if it used DESCRIBE rather than SELECT, but without a real RDF library to parse the resulting graph, I’m not sure how ActiveRDF would deal with what the triple store returned.

So, anyway, these are some of the hurdles in making ActiveRDF work with the Platform, but I’m not quite ready to throw in the towel, yet.

.



* Objectifying OpenURL

Posted on January 9th, 2008 by Ross. Filed under OpenURL, coding, ruby.


Sometime in November, I came to the realization that I had horribly misinterpreted the NISO Z39.88/OpenURL 1.0 spec.  I’m on the NISO Advisory Committee for OpenURL (which makes this even more embarrassing) and was reviewing the proposal for the Request Transfer Message Community Profile and its associated metadata formats when it dawned on me that my mental model was completely wrong.  For those of you that have primarily dealt with KEV based OpenURLs (which is 99% of all the OpenURLs in the wild), I would wager that your mental model is probably wrong, too.

A quick primer on OpenURL:

  • OpenURL is a standard for transporting ContextObjects (basically a reference to something, in practice, mostly bibliographic citations)
  • A ContextObject (CTX, for short from now on) is comprised of Entities that help define what it is.  Entities can be one of six kinds:
    • Referent - this is the meat of the CTX, what it’s about, what you’re trying to get context about.  A CTX must have one referent and only one.
    • ReferringEntity - defines the resource that cited the referent.  This is optional and can only appear once.
    • Referrer - the source of where the CTX came from (i.e. the A&I database).  This is optional and can only appear once.
    • Requester - this is information about who is making the request (i.e. the user’s IP address).  This is optional and can only appear once.
    • ServiceType - this defines what sorts of services are being requested about the referent (i.e. getFullText, document delivery services, etc.).  There can be zero or many ServiceType entities defined in the CTX.
    • Resolver - these are messages specifically to the resolver about the request.  There can be zero or more Resolver entities defined in the CTX.
  • All entities are basically the same in what they can hold:
    • Identifiers (such as DOI or IP Address)
    • By-Value Metadata (the metadata is included in the Entity)
    • By-Reference Metadata (the Entity has a pointer to a URL where you can retrieve the metadata, rather than including it in the CTX itself)
    • Private Data (presumably data, possibly confidential, between the entity and the resolver)
  • A CTX can also contain administrative data, which defines the version of the ContextObject, a timestamp and an identifier for the CTX (all optional)
  • Community Profiles define valid configurations and constraints for a given use case (for instance, scholarly search services are defined differently than document delivery).  Context objects don’t actually specify any community profile they conform to.  This is a rather loose agreement between the resolver and the context object source:   if you provide me with a SAP1, SAP2 or Dublin Core compliant OpenURL, I can return something sensible.
  • There are currently two registered serializations for OpenURL:  Key/Encoded Values where all of the values are output on a single string, formatted as key=value and delimited by ampersands (this is what majority of all OpenURLs that currently exist look like) and XML (which is much rarer, but also much more powerful)
  • There is no standard OpenURL ‘response’ format.  Given the nature of OpenURL, it’s highly unlikely that one could be created that would meet all expected needs.  A better alternative would be for a particular community profile to define a response format since the scope would be more realistic and focused.

Looking back on this, I’m not sure how “quick” this is, but hopefully it can bootstrap those of you that have only cursory knowledge of OpenURL (or less).  Another interesting way to look at OpenURL is Jeff Young’s 6 questions approach, which breaks OpenURL down to “who”, “what”, “where”, “when”, “why” and “how”.

One of the great failings of OpenURL (in my mind, at least) is the complete and utter lack of documentation, examples, dialog or tutorials about its use or potential.  In fact, outside of COinS, maybe, there is no notion of “community” to help promote OpenURL or cultivate awareness or adoption.  To be fair, I am as guilty as anybody for this failure, since I had proposed making a community site for OpenURL, but due to a shift in job responsibilities and then the wholesale change in employers, coupled with the hacking of the server it was to live on, left this by the wayside.  I’m putting this back on my to do list.

What this lack of direction leads to is that would-be implementors wind up making a lot of assumptions about OpenURL.  The official spec published at NISO is a tough read and is generally discouraged by the “inner core” of the OpenURL universe (the Herbert van de Sompels, the Eric Hellmans, the Karen Coyles, etc.) in favor of the “Implementation Guidelines” documents.  However, only the KEV Guidelines are actually posted there.  The only other real avenue for trying to come to grips with OpenURL is to dissect the behavior of link resolvers.  Again, in almost every instance this means you’re working with KEVs and the downside of KEVs is that they give you a very naive view of OpenURL.

KEVs, by their very nature, are flat and expose next to nothing about the structure of the model of the context object they represent.  Take the following, for example:

url_ver=Z39.88-2004&url_tim=2003-04-11T10%3A09%3A15TZD
&url_ctx_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Actx&ctx_ver=Z39.88-2004
&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&ctx_id=10_8&ctx_tim=2003-04-11T10%3A08%3A30TZD
&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=book&rft.aulast=Vergnaud
&rft.auinit=J.-R&rft.btitle=D%C3%A9pendances+et+niveaux+de+repr%C3%A9sentation+en+syntaxe
&rft.date=1985&rft.pub=Benjamins&rft.place=Amsterdam%2C+Philadelphia
&rfe_id=urn%3Aisbn%3A0262531283&rfe_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook
&rfe.genre=book&rfe.aulast=Chomsky&rfe.auinit=N&rfe.btitle=Minimalist+Program
&rfe.isbn=0262531283&rfe.date=1995&rfe.pub=The+MIT+Press&rfe.place=Cambridge%2C+Mass
&svc_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Asch_svc&svc.abstract=yes
&rfr_id=info%3Asid%2Febookco.com%3Abookreader

Ugly, I know, but bear with me for a moment.  From this example, let’s focus on the Referent:

rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=book&rft.aulast=Vergnaud
&rft.auinit=J.-R&rft.btitle=D%C3%A9pendances+et+niveaux+de+repr%C3%A9sentation+en+syntaxe
&rft.date=1985&rft.pub=Benjamins&rft.place=Amsterdam%2C+Philadelphia

and then let’s make this a little more human readable:

rft_val_fmt:  info:ofi/fmt:kev:mtx:book
rft.genre:  book
rft.aulast:  Vergnaud
rft.auinit:  J.-R
rft.btitle:  Dépendances et niveaux de représentation en syntaxe
rft.date:  1985
rft.pub:  Benjamins
rft.place:  Amsterdam, Philadelphia

Looking at this example, it’s certainly easy to draw some conclusions about the referent, the most obvious being that it’s a book.

Actually (and this is where it gets complicated and I begin to look pedantic) it’s really only telling you, I am sending some by value metadata in the info:ofi/fmt:kev:mtx:book format, not that the thing is actually a book (although the info:ofi/fmt:kev:mtx:book metadata values do state that, but, ignore that for a minute since genre is optional).

The way this actually should be thought of:

ContextObject:
    Referent:
       Metadata by Value:
          Format:  info:ofi/fmt:kev:mtx:book
          Metadata:
             Genre:  book
             Btitle:  Dépendances et niveaux de représentation en syntaxe
             …
    ReferringEntity:
       Identifier:  urn:isbn:0262531283
       Metadata by Value:
           Format:  info:ofi/fmt:kev:mtx:book
           Metadata:
               Genre:  book
               Isbn:  0262531283
               Btitle:  Minimalist Progam
               …
    Referrer:
       Identifier:  info:sid/ebookco.com:bookreader
    ServiceType:
       Metadata By Value:
           Format:  info:ofi/fmt:kev:mtx:sch_svc
           Metadata:
               Abstract:  yes

So, this should still seem fairly straightforward, but the hierarchy certainly isn’t evident in the KEV.  It’s a good starting point to begin talking about the complexity of working with OpenURL, though, especially if you’re trying to create a service that consumes OpenURL context objects.

Back to the referent metadata.  The context object didn’t have to send the data in the “metadata by value” stanza.  It could have just sent the identifier “urn:isbn:9027231141″ (and note in the above example, it didn’t have an identifier at all).  It could also have sent metadata in the Dublin Core format, MARC21, MODS, ONIX or all of the above (the Metadata By Value element is repeatable) if you wanted to make sure your referent could be parsed by the widest range of resolvers. While all of these are bibliographic formats, in Request Transfer Message context objects (which would be used for document delivery, which got me started down this whole path), you would conceivably have one or more of the aforementioned metadata types plus a Request Transfer Profile Referent type that describes the sorts of interlibrary loan-ish types of data that accompany the referent as well as an ISO Holdings Schema metadata element carrying the actual items a library has, their locations and status.

If you only have run across KEVs describing journal articles or books, this may come as a bit of a surprise.  Instead of saying the above referent is a book, it becomes important to say that the referent contains a metadata package (as Jonathan Rochkind calls it) that is in this (OpenURL specific) book format.  In this regard, OpenURL is similar to METS.  It wraps other metadata documents and defines the relationships between them.  It is completely ambivalent about the data it is transporting and makes no attempt to define it or format it in any way.  The Journal, Book, Patent and Dissertation formats were basically contrived to make compatibility with OpenURL 0.1 easier, but they are not directly associated with OpenURL and could have just as easily been replaced with, say, BibTex or RIS (although the fact that they were created alongside Z39.88 and are maintained by the same community makes the distinction difficult to see).

What this means, then, is that in order to know anything about a given entity, you also need to know about the metadata format that is being sent about it.  And since that metadata could literally be in any format, it means there are lot of variables that need to be addressed just to know what a thing is.

For the Umlaut, I wrote an OpenURL library for Ruby as a means to parse and create OpenURLs.  Needless to say, it was originally written with that naive, KEV-based, mental model (plus some other just completely errant assumptions about how context objects worked) and, because of this, I decided to completely rewrite it.  I am still in the process of this, but am struggling with some core architectural concepts and am throwing this out to the larger world as an appeal for ideas or advice.

Overall the design is pretty simple:  there is a ContextObject object that contains a hash of the administrative metadata and then attributes (referent, referrer, requester, etc.) that contain Entity objects.

The Entity object has arrays of identifiers, private data and metadata.

And then this is where I start to run aground.

The original (and current) plan was to populate the metadata array with native metadata objects that are generated by registering metadata classes in a MetadataFactory class.  The problem, you see, is that I don’t want to get into the business of having to create classes to parse and access every kind of metadata format that gets approved for Z39.88.  For example, Ed Summers’ ruby-marc has already solved the problem of effectively working with MARC in Ruby, so why do I want to reinvent that wheel?  The counter argument is, by delegating these responsibilities to third party libraries, there is no consistency of APIs between “metadata packages”.  A method used in format A may very well raise an exception (or, worse, overwrite data) in format B

There is a secondary problem that third party libraries aren’t going have any idea that they’re in an OpenURL context object or even know what that is.  This means there would have to be some class that handles functionality like xml serialization (since ruby-marc doesn’t know that Z39.88 refers to it as info:ofi/fmt:xml:xsd:MARC21), although this can be handled by the specific metadata factory class.  This would also be necessary when parsing an incoming OpenURL since, theoretically, every library could have a different syntax for importing XML, KEVs or whatever other serialization is devised in the future.

So I’m looking for advice on how to proceed.  All ideas welcome.

.



* HACK IS NOT A CRIME

Posted on August 15th, 2007 by Ross. Filed under coding, libraries, ruby.


…although LinuX_Xploit_Crew, with all due respect, I think it actually is.

Oh well, we’re back with a new theme (which nobody will see except to read comments, since I’m pretty sure all traffic comes from the code4lib planet) and an updated WordPress install. Look out, world!

So, in the downtime here’s a non-comprehensive rundown of what I’ve been working on:

  1. I’ve written an improved (at least, I think it’s improved) alternative to Docutek’s Eres RSS interface. Frankly, Docutek’s sucked. Maybe we have an outdated version of Eres, but the RSS feeds would give errors because you have to click through a copyright scare page before you can view reserves, but you can’t link the RSS links to this form and get the item. I wrote a little Ruby/Camping app that takes urls like: http://eres.library.gatech.edu/course/WS-1001-A/Fall/2007 and turns that into a usable feed. I needed the course id/term/year format to show them in Sakai. My favorite part of this project was finding Rubyscript2exe. This allows me to just bundle one file (the compiled camping app) plus a configuration file. Granted, an asp.net app would be even easier for sites to install, but I didn’t have time to learn asp.net. I have more ideas of what I would like to do with this (such as show current circ status for physical reserves), but in the chaos that is our library reorg, I haven’t gotten around to even showing anybody what I’ve written so far.
  2. I broke ground on a Metalib X-Server Ruby library. It took me a while to wrap my head around how this needed to be modelled, but I think it’s starting to take shape. It doesn’t actually perform queries, yet, but it connects to the server, allows you to set the portal association and find and set the category to search. Quicksets and MySets are all derivations of the same concept, so I don’t think it will be take me long to actually incorporate actual searching. For proof-of-concept, I plan on embedding this library in MemoryHole, our Solr-based discovery app. I’ve actually stopped development of MemoryHole so we can focus on vuFind, since they do functionally the same thing and I’d rather help make vuFind better than replicate everything it does, only in Ruby. The reason I’m doing this proof-of-concept in MemoryHole rather than vuFind is solely due to familiarity and time.

In other news, my last post seems to have caused a bit of a stir. My plan is to write a response, but the short of it is that I feel the arguments for an MLS are extremely classist.

Also my bathroom is finished and it looks great.

ERESidue

Blogged with Flock

Tags:

.



* Super arrogant programmer clique

Posted on June 13th, 2007 by Ross. Filed under coding, sakai.


One thing I hope that never happens at Access (and hope it hasn’t already happened and I’m too much of a super arrogant programmer to notice) is the “hotshot developers” gather round a table and whip out their laptops and break only to give supra-obtuse presentations on how to use their technologies (and opening by saying they aren’t going to explain anything to those that might not know the specifics of their particular technologies).

If Sakai really wants to become a ‘community driven software’, they really need to address this impenetrable geeky cult of personality.

Code4lib, of course, is already completely populated by super arrogant programmer types, so I’m not as worried about it.

.



* In search of… Bigfoot

Posted on March 29th, 2007 by Ross. Filed under coding, libraries, ruby.


Before I left for Guatemala, Ian Davis at Talis asked if I could give him a dump of our MARC records to load into Talis Platform. I had been talking in the #code4lib channel about how I was pushing the idea of using Talis Source to make simple, ad-hoc union catalogs; we could make one for Georgia Tech & Emory (we have joint degree programs) or Arche or Georgia Tech/Atlanta-Fulton Public Library, etc. My thinking was that by utilizing the Talis Platform, we could forgo much of the headache in actually making a union catalog for somewhat marginal use cases (the public library one notwithstanding).

About a week after I got back from Guatemala, I had an email from Richard Wallis with some urls to play around with to access my Bigfoot store. He showed me search services, facet services and augment services. I was unable to be really dive into it much at the time but since I’m working on a total site search project for the library, I thought this would be a good chance to kick the tires a bit to include catalog results.

After two days of poking around, I have made some opinions of it, have some recommendations for it, and wrote a Ruby library to access it.

1) The Item Service

This is certainly the most straightforward and for many people, the most useful service of the bunch. The easiest way to think of the item service is an HTTP based Lucene service (a la Solr or Lucene-WS) of your bib records. It returns something OpenSearch-y (it claims to be a RSS 1.0 document), but it doesn’t validate. That being said, FeedTools happily consumed it (more on that later) and the semantics should be familiar to anyone that has looked at OpenSearch before. Each item node also contains a Dublin Core representation of the record and a link to a marcxml representation. I’m not sure if there’s a description document for Bigfoot.
Although the query syntax is pure Lucene (title:”The Lexus and the Olive Tree”), the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones (for example, my guess is I wouldn’t be able to get an index for 490/440$v that I use for the Umlaut). I don’t see returning the results as OAI_DC being too much of a problem, since the RSS item includes a title (which would have been tricky between the DC and the marcxml). My Ruby library might not generate valid DC, I haven’t really looked into it.

The docs also mention you can POST items to your Bigfoot store, but they don’t mention what your data needs to look like (MARC?) or what credentials you need to add something (I mean, it must be more than just your store name, right?). My hope is to add this functionality to bigfoot-ruby soon (especially since my data is from a bulk export from last October).

2) The Facet Service

This one is intriguing, definitely, since Faceted searching is all the rage right now.  The search syntax is basically the same as the Item Service, except you also send a comma delimited list of the fields you would like to query.  What you get back is either an XML or XHTML document of your results.

For each field you request, you get back a set of terms (you can specify how many you want, with a default of 5) that appear most frequently in your field.  You also get an approximation for how many results you would get in that facet and a url to search on that facet.  It’s quite fast, although, realistically, you can’t do much with the output of facet search alone.

Again, it’s difficult to know what you can facet on (subject, creator and date are all useful — I’m sure there are others) and the facet that (for me, at least) held the most promise — type — is too overly broad to do much with (it uses Leader position 7, but lumps the BKS and SER types all in a label called “text”).  I would like to see Talis implement something like my MARC::TypedRecord concept so one could facet on things like government document or conference.  You could separate newspapers from journals and globes from maps.  Still, the text analysis of the non-fixed fields is powerful and useful and beats the hell out of trying to implement something like that locally.

In bigfoot-ruby, I have provided two ways to do a faceted search:  you can just do the search and get back Facet objects containing the terms and search urls or you can facet with items which executes the item searches automatically (in turn getting a definitive number of results for the query, as well).  Since I didn’t bother to implement threading, getting facets with items can be pretty slow.

3) The Augment Service

To be honest, I’m having a hard time figuring out useful scenarios for the augment service.  The idea is that you give it the URI of an RSS feed, and this service will enhance it with data from your Bigfoot store (at least, that is sort of how I understand it works).  Richard’s example for me was to feed it the output of an xISBN query (which isn’t in RSS 1.0, AFAIK, but, for the sake of example…) and the augment service would fill in the data for ISBNs your library holds.  The API example page mentions Wikipedia, but I don’t know where other than the Talis Platform that you can get Wikipedia entries formatted properly.  I tried sending it the results of an Umlaut2 OpenSearch query, but it didn’t do anything with it.  Presumably this RSS 1.0 feed needs the bib data to be sent in a certain way (my guess is in OAI_DC, like the Item Service), but I’m not sure.  The only use case I can think of for this service is a much simpler way to check for ISBN concordance (rather than isbn:(123456789X|223456789X|323456789X|etc.))

Overall, I’m really impressed with the Talis API.  It is a LOT easier to use than, say, Z39.50 and by using OpenSearch seems more natural to integrate into existing web services than SRU.

Bigfoot-ruby is definitely a work in progress.  I think I would like to split the Search class into ItemService and FacetService.  I don’t like how results is an Array for items and a Hash for facets. Just seems sloppy.  I need to document it, of course and I would like to implement Item POST.  This project also made me realize how bloody slow FeedTools is.  I am currently using it in both the Umlaut and the Finding Aids to provide OpenSearch, but I think it’s really too sluggish to justify itself.

Thanks, Talis, for getting me started with Bigfoot and giving me the opportunity to play around with it.  Also, thanks to Ed Summers for fixing SVN on Code4lib.org.  You wouldn’t be able to download it and futz around with it yourself, otherwise.

.



* To build or to modify

Posted on February 11th, 2007 by Ross. Filed under cms, coding, intranet.


I readily admit that I have a bit of a NIH problem. A lot of this is laziness (hey, it’s a lot of work to figure out how somebody’s code works) coupled with the fact that I write fast married to the dilemma of if I modify this program to meet our needs, what happens when we need to upgrade?

One place where I’ve really struggled with this is surrounding intranets. At Emory I built myStaffWeb specifically to handle situations like:

  1. Employee needs to submit timesheet.
  2. Timesheet needs to go to employee’s supervisor.
  3. If either of these tasks are late, emails need to be sent to the guilty party.
  4. The supervisor either approves the timesheet (sending it to HR) or rejects it (sending it back to the employee).

Workflows like this are pretty commonplace for intranet-y tasks. The problem is that most off the shelf portal systems can’t handle them. Drupal, Joomla/Mambo, the Nukes all have horizontally based roles (members, editors, admins, etc.), which work fine for ‘content’ based sites, but lacks the granularity needed for ‘doing business’.

I built something similar to myStaffWeb when I got to Tech to handle other, similar tasks (set GALILEO password, upload EAD finding aid, manage subject guides, etc.). For ‘custom’ type modules, such a system works well; I am able to build a new module quickly and based on user, group and role models set up workflows pretty easily.

The problem is that as staff want to use more commonplace web tools (wikis, blogs, simple web pages, file management), it gets more and more complicated to keep up. The ability to draw from a large development community to help make your knowledgebase (say) or FAQ makes life a lot easier (and allows you to spend time working on things that help your public rather than your staff).

There are, in fact, some portal/community CMSes that have both horizontal and vertical permissions (’manager’ of this group), such as Plone and TikiWiki, so I had held out hope they would work for us.

TikiWiki eliminated itself pretty quickly, however. It’s got an interface that only an engineer could love, is painfully slow, and uses the Galaxia workflow engine. I’m sure Galaxia is powerful and all, but it’s horribly over engineered for the simple workflows I need. Besides, most of our workflows are repetitive: employee -> supervisor -> task manager, to have to build that workflow in Galaxia every time for every app seems overkill.

Plone seemed perfect in every way. Our sysadmin installed it for the library collaboration committee to experiment with and it appeared to solve our problems. Groups and roles within groups, blogs, wikis, content. It looked like it could handle our intranet and our public website. Assuming we committed to it.

With every intranet I’ve developed, I have always looked at Zope. It seems to hold so much promise in this arena, but somehow, every time, something horrible goes awry. Zope (and therefore Plone) is an unholy beast and one of its biggest problems is that when something goes wrong, because it’s so alien (it uses its own webserver, its own database engine, it uses python in its own special way) there’s no body of knowledge to rely on to get you out of trouble. Nothing you’ve learned by using apache for the last 9 years helps you debug a Zope server problem. There may be ways to access the ZODB directly, but it’s not like being able to open phpMyAdmin and fixing a field value. Everytime I’ve dabbled with Zope, something bad has happened and I haven’t been able to fix it. And that sticks with you.

We had one of those problems early on with Plone (the sysadmin ‘accidentally locked the keys in the car’, as I put it — a bad incompatibility with the Zope control panel and Plone’s CAS plugin), but we got around it, documented it, made a policy for plugins and forged ahead. Despite my reservations about Zope, I suggested we commit to using Plone for the intranet and the public web. I didn’t see any point in trying to maintain two CMS systems if Plone was going to work for the intranet.

And then we found Plone’s fatal flaw. While working on a wiki to document how to troubleshoot the Umlaut while I was in Toronto, something happened and somehow I saved a much earlier revision of my page (losing about half my work). Since I had been saving my edits, I assumed I could just rollback to a previous save, but you know what happens when you assume. To my (and the sysadmin’s) horror, we realized that Plone has no concept of versioning.

There are Plone products that deal with this, but none have ’stable’ releases and, besides, depending on a product for such core functionality seems risky. When and if versioning is integrated into Plone core, will your product be compatible? Will you get stuck behind if a new version of Plone comes out and your particular product doesn’t work with it? This gaping hole in functionality (coupled with learning curve inherent in Plone to begin with) basically brought our Plone experiment to a grinding halt.

So last week I started redesigning the existing intranet. I’ve migrated it to rails, and yet again I’m amazed at how quickly I can get core functionality running. Ruby/Rails is so much better suited to object model in our intranet than PHP was it makes this tremendously easier. Rather than worry about building wikis and blogs and whatnot, we’re just using off the shelf products that are remotely editable. For wikis, we’re using OddMuse, for blogs we’re using WP-MU. This way, I only have to make clients in the intranet to access these remotely.  In essence, the intranet aggregates services and manages permissions (this wiki is only available to the group or the user or to the library, etc.) and handles our specific needs, like timesheets and student employment applications and the like.

So basically we’re finding a compromise between invented here and there.  Just in time for me to drop this project like a hot rock for a new position :) (more on that later).

.



* Üps

Posted on January 23rd, 2007 by Ross. Filed under coding, umlaut.


Argh.  I’ve recently migrated to using Opera on our desktop at home (long story, but basically the experiment with Ubuntu didn’t go over well; we went back to Windows; and IE7 is too slow to be considered useful in any capacity.  And Selena uses Firefox.).  While in the middle of a long post about how I’ve broken the Ümlaut (not the production or subversion versions — but the development server is FUBARed and a huge flaw has been exposed in its design), I managed to, with my stumpy, inaccurate man-fingers, hit some key combination by accident that caused me to leave my
editing page and sent me to some ’Opera search page’.  Thanks!

…pause while Ross saves…

Anyway, a huge refactoring will be taking place (although, honestly, I don’t think it will take me very long).  The long and short of it is:  I was trying to make the Ümlaut database independent.  SQLite had a hard time with the Ümlaut’s liberal use of Marshaling objects in the database.  PostgreSQL wasn’t working with Rails and when I upgraded Rails…  all hell broke loose.  Basically my Marshal plan wasn’t going to work with the direction of Rails development.

But, this has actually led me to a better way (and, I think, much more efficient) way to handle requests.

So, stay tuned for Ü2.

.



* Nice threads

Posted on December 21st, 2006 by Ross. Filed under Polishing the Turd, Ruby on Rails, coding, libraries.


I have been working on Fancy-Pants quite a bit in the last couple of weeks. This is an AJAX layer over Voyager’s WebVoyage — an attempt to de-suck-ify its interface a bit. Why is it called Fancy-Pants? Well, Voyager still has the same underwear, it’s just got a new set of britches.

There are two main problems that it’s trying to solve:

  1. For items that have more than one MFHD, WebVoyage won’t show any item information in the title list.
  2. We wanted to link to 856 URLs from the title list.

Now, we’re already doing the second one, but it’s not implemented particularly well. While we were solving those problems, we wanted to see what we could do about that god-awful table based display.

I took NCSU’s Endeca layout as the baseline template for what I wanted the results to look like. Right now, Fancy-Pants can only be accessed via this Greasemonkey script [get Greasemonkey here]. Greasemonkey, of course, wouldn’t be a requirement, but we’re using it to inject the initial javascript call since we’re having to work on a live system.

For the title list screen, the javascript is looping through the bib ids on the page (it grabs them from the ’save record’ checkboxes) and sends them to a Ruby on Rails app that queries Voyager’s Oracle database and builds a new result set. The javascript hides the original page results (display: none) and inserts a div with the new results. If there are multiple 856es or locations, the result has expanding/collapsing divs to show/hide them.

I send the query terms to Yahoo’s spell check API and will return a link to any suggestions it gives. No, this isn’t the ideal, but I’m still in proof-of-concept stage.

Things I still want to do with title list screen are:

  1. Come up with a way to show what the item is (journal, microform, map, etc.) — I’ve started on this, but it’s very rough
  2. Make the ’sort by’ dropdown a row of links
  3. Turn the ‘Narrow my search’ button/page into a faceted navigation menu with options that make sense for the result set (for instance, limiting language to Dutch, Middle (ca. 1050-1350) isn’t going to come into play that much). Also add some logical facets a la Evergreen
  4. Replace the ’save record’ feature to work during the entire session and be able to save directly to Zotero, Endnote, Bibtex, CiteULike or Connotea.
  5. COinS and UnAPI
  6. Give it the same style as the rest of our new web design.

I’m currently not doing much with the record view page, but I am adding a direct link to the record.  I plan on integrating Umlaut responses here, as well as other context sensitive items - especially those that don’t conform well to OpenURL requests.

If you were able to install the Greasemonkey script and want to try it out, go to GIL’s keyword search and try:

  1. senate hearings — this is a good example of multiple mfhds/856es
  2. thomas friedmann — a good example of “Did you mean”

Also try a journal search for “Nature”.  Then try whatever floats your boat and let me know how it worked.  If you notice that it’s really slow, this is actually because of Voyager.  The “Available online” and relevance icons are all rendered dynamically and they just grind the output to a halt.  When we go live with this, we’d disable those features in WebVoyage to speed things up.
Fancy-pants is by no means a final product.  I view this as a bridge between what we have and an upcoming Solr based catalog interface.  The Solr catalog will still need to interface with Voyager, so Fancy-pants would transition to that.  Ultimately, I would like this whole process to eventually lead to the Communicat.

.



* The Soul-sucking vacuum that is EAD

Posted on December 20th, 2006 by Ross. Filed under Ruby on Rails, archives, coding, libraries, php.


I have a very conflicted relationship with our archives department. While their projects still need to get done, their services get very little use (especially when compared to other pending projects) and every time I get near any of their projects, it starts to become “Ross Singer and the Tar Baby”. Everything, EVERY SINGLE THING, archives/special collections has ever created is arcane, dense, and enormously time consuming. It always seems simple and then it somehow always turns into EAD.

About a year ago, I was tasked with developing something to help enable archives publish their recently converted EAD 2002 finding aids to HTML. I knew that XSLT stylesheets existed to do the heavy lifting in this so I thought it wouldn’t be too hard to get this up and running.

But it never works out that way with archives. The stylesheets that did what they liked worked only in Saxon and Saxon only works with Java (well, now with .Net, too — but that still didn’t help at the time). I wasn’t going to take on a language that I only poke at with a long stick on the most ambitious of days for some archives project (no offense, archives, but come on…). My whining caught Jeremy Frumkin’s attention who pointed me at a project that Terry Reese had done. It was a PHP/MySQL project that took the EAD, called Saxon from the commandline and printed the resulting HTML. It also indexed all these parts of the finding aid and put them in a MySQL database to enable search. I could never get the search part to work very well with our data, so I gave up that part, focusing instead on the Saxon->XSLT->cache to HTML part.

I set up a module in our intranet that let the archivists upload xml files, preview the result with the stylesheet, rollback to an earlier version, etc. No matter what, though, this was a hack. And, most importantly, I never got search working. Also, the detailed description (the dsc) was incredibly difficult to get to display how archives wanted it.

Another thing that nagged at me was, for all this work I (and they) had invested in this, how was this really any better than the HTML finding aids they were migrating from? They put all this work into encoding the EAD and all that we were really doing was displaying a web page that was less flexible in its output than the previous HTML.
This summer, I met with archives to discuss how we were going to enable searching. Their vision seemed pretty simple to implement.

Why do I keep falling for that? How many hours had I already invested in something I thought would be trivial?

The new system was (of course) built in Rails. The plan was to circumvent XSLT altogether so:

  1. The web designer could have more control over how things worked without punching the XSLT tarbaby.
  2. We could make some of that stuff in the EAD “actionable”, like the subject headings, personal names, corporate names, etc.
  3. We could avoid the XSLT nightmare that is displaying the Box/Folder/etc. list.

The fulltext indexing would be provided by Ferret. I thought I could hammer this out in a couple of weeks. I think dumb things like this a lot.

The infrastructure went up pretty quickly. Uploading, indexing, and displaying the browse lists (up until now, they had to get the web designer to add their newly encoded finding aid to the various pages to link to it) all took a little over a week, maybe. Search took maybe another week. For launch purposes, I wasn’t worried about pagination of search results. There were only around 210 finding aids in the system, so a search for “georgia” (which cast the widest net) didn’t really make a terribly unmanageable results page. That’s the nice thing about working with such a small dataset that’s not heavily used. Inefficiencies don’t matter that much. I’ve since added pagination (took a day or two).

No, like last time, the real burden was displaying the finding aid. My initial plan, parsing the XML and divvying it up into a bunch of Ruby objects, was taking a lot of time. EAD is inordinately difficult to put into logical containers that have easily usable attributes. That’s just not how EAD rolls. I found I was severely delaying launch trying to shoehorn my vision on the finding aid, which archives didn’t really care about, of course. They just wanted searching. So, while I was working out my EAD/Ruby object modeler, I deferred to XSLT to get the system out the door.

Rather than using Saxon this time, I opted for Ruby/XSLT (based on libxml/libxslt) for main part of the document and ruby scripting/templates for the detailed collection list. The former worked pretty well (and fast!) but the latter was turning into a nightmare of endless recursion. When I tried looping through all of the levels (EAD can have 9 levels of recursion, c01-c09, starting at any number — I think — and going to any depth), my vain attempts either showed the horrid performance of REXML (a native Ruby XML parser) or attempts at navigating the recursion that would leave you clutching at the fibers of sanity that remain when you get this far in an EAD document.

Finally I found what I thought was my answer: a nifty little Ruby library called CobraVsMongoose that would transform a REXML document to a Ruby hash (or vice-versa). It was unbelievably fast and made working with this nested structure a WHOLE lot easier. There was some strangeness to overcome (naturally). For instance, if an element’s child nodes include an element that appears more than once, it will nestle the the children in an array. If there are no repeating elements, it will put then child nodes in a Hash, (or the other way around, I can’t remember) so you have to check what the object is and process it differently accordingly. Still, it was fast, easy to manipulate and allowed me to launch the finding aids system.

Everybody was happy.

And then archives actually looked at the detailed description. Anything in italics or bold or in those weird ‘emph’ tags wasn’t being displayed. Ok, no problem, I just need to mess with the CobraVsMongoose function… oh wait. Yeah, what I hadn’t really thought about was that Ruby hashes don’t preserve sort order, so there was no way to get the italicized or bolded or whatevered text to display where it was supposed to in the output.

Damn you, tarbaby!

Back to the drawing board. I decided to return to REXML, emboldened now by (what I thought) was a better handle on the dsc structure and some better approaches to the recursion. Every c0n element would get mapped to a Ruby object (Collection, Series, Subseries, Folder, Item, etc.) and nest as children to each other and have partial views that would display them properly.

On my first pass, it was taking 28 seconds for our largest finding aid to display. TWENTY EIGHT SECONDS?! I would like to note, as finding aids go, ours aren’t very large. So, after tweaking a bit more (eliminating checking for children in Item elements, for example), I got it down to 24 seconds. Still not exactly the breathtaking performance I had hoped for.

What was bugging me was how quick CobraVsMongoose was. Somehow, despite also using REXML, it was fast. Really fast. And here my implementation seemed more like TurtleVsSnail. I was all set to turn to Libxml-Ruby (which would require a bunch of refactoring to migrate from REXML) when I found my problem and its solution. This was last night at 11PM.

While poring over the REXML API docs, I noticed that the REXML::Element.each_element method’s argument was called ‘xpath’. Terry had written about how dreadfully slow XPath queries were with REXML and, as a result, I thought I was avoiding them. When I removed the path arg from the each_element call in one of my methods and just iterated through each child element to see if its name matched, it cut the processing time in half! So, while 12 seconds was certainly no thoroughbred, it was definitely the right track. When I eliminated every xpath in the recursion process, I got it down to about 5 seconds. Add a touch of fragment caching and the natural performance boost of a production vs. development site in rails, and I think we’ve got a “good enough for now” solution.

There’s a little more to do with this project. I plan on adding an OpenSearch interface (I have all the plumbing from adding it to the Umlaut), an OAI provider and an SRU interface (when I finally get around to porting that CQL parser). And, yeah, finishing the EAD Object Model.

But right now, archives and I need to spend a little time away from each other.

In the meantime, here’s the development site.

.



* “I dropped my pick”

Posted on September 18th, 2006 by Ross. Filed under About me, coding, geeks, libraries, philosophizing.


Library Geeks #4 is out. I’m back (instead of being the ‘Poltergeek’ of episode 3), and Dan set this one up in kind of a neat way. He certainly leads and guides the dialogue, but it’s much more of a roundtable and informal discussion. No doubt this is largely due to the fact we’re all pretty good friends. Still, I learned a lot about Ed that I didn’t know — pretty amazing since we hang out every day and work on so much together.

Bobby McFerrin, eat your heart out.

.