Archive for the ‘ruby’ Category

* Working around Ruby with XSLT

Posted on September 16th, 2008 by Ross. Filed under jangle, ruby, xslt.


My relationship with Ruby nowadays is roughly akin to somebody addicted to pain killers.  I know it’s not good for me (since everything I work on nowadays is RDF, XML or both) but I’m able to still be productive and the pain of quitting, while in the long run would be better for everybody, just isn’t something I have time for right now.  Maybe someday I’ll make the jump back to Python (since it’s actually pretty good at dealing with both RDF and XML), but for now I’ll just find workarounds to my problems (unlike others, I am completely incapable of juggling more than one language).

I first ran into my big XML and Ruby problem a couple of weeks ago while working on the TalisLMS connector for Jangle.  I’ve, of course, run into it before, but it has never been a total show stopper like this.  In order to add the Resource entity to the TalisLMS (Jangle-ese for bibliographic records) connector, I am querying the Platform store the OPAC uses.  I’m using the Platform rather than the Zebra index that comes with Alto (the records are indexed in both places) because the modified date isn’t sortable in Zebra and that would be an issue when serializing everything to Atom.  The records are transformed into a proprietary RDF format (called BibRDF) when loaded into the Platform (this is for the benefit of Prism, our OPAC).  In order to get the MARC records (there’s no route back to the MARC from BibRDF), I have to pull the UniqueIdentifer (which is the mapped 001)  field out of the BibRDF and throw them in a Z39.50 client (Ruby/ZOOM) and query the Zebra index.  In order to get enough metadata to create a valid Atom entry, I needed to be able to parse the BibRDF (which comes out of the Platform as RDF/XML), since that is the default record format.

And this is where I’d run into problems.  I have the default number of records set to be returned by the Jangle to 100.  That’s a pretty sweet spot for both servers to handle the load and clients to deal with resulting Atom document.  Well, you’d think it was, anyway, except REXML was taking about 10 seconds to parse the Platform response into Ruby objects.

I realize the Rubyists out there are already dismissing this and scrolling down to the comment box to write “well don’t use REXML, you dumbass”, but let me explain.  I generally don’t use REXML (unless it’s something very small and simple), instead opting for Hpricot for parsing XML.  I’ve tended to avoid LibXML in Ruby, when I first tried it, it segfaulted a lot, but that was the past… my reasons for avoiding it lately is because I have this stubborn ideal about having things work with JRuby and that’s just not going to be an option with LibXML (before you scroll down and add another comment about the Ruby/ZOOM requirement, it will eventually be replaced with Ruby-SRU… probably).  Hpricot was falling flat on its face with the BibRDF namespace prefixes, though (j.0:UniqueIdentifier).  It seems to have problems with periods in the prefix, so that was a no go.

So I had REXML and I had horrible performance.  Now what?

Well, JSON is fast in Ruby, so I thought that might be an option.  The Platform has a transform service, if you pass an argument with the URL for an XSLT stylesheet, it will output the result in the format you want.  Googling found several projects that would turn XML into JSON via XSLT (this one seems the best if you have an XSLT 2.0 parser), but they weren’t quite what I needed.  I wanted to preserve the original RDF/XML since I was just going to be turning around and regurgitating it back to the Jangle server, anyway.  I just needed a quick way to grab the UniqueIdentifier, MainAuthor and LastModified fields and shove the rest of the XML into an object attribute.

I have always chafed at the thought of actually doing anything in XSLT.  In retrospect (after I’ve been using almost exclusively for a month, now), I realize that my opinion was probably actually the result of the data that I was trying to transform (EAD, the metadata format designed to punish technologists) rather than XSLT itself (the project got sucked into a vortex when I tried working with the EAD directly with Ruby, too).  Still, I had always resisted.  The syntax is weird, variables confused me, I just never got the hang of it.

But, damn, it’s fast.

And, when I turned the XML into JSON (with XML), it was perfect.  Here’s my stylesheet.  Here’s what the output from the Platform looks like.  Here’s the output from the TalisLMS connector.

I wasn’t done, yet, though.  The DLF ILS-DI Adapter for Jangle’s OAI-PMH service was sooooo slow.  Requests were literally taking around 35 seconds each.  This was because I was using FeedTools to parse the Atom documents and Builder::XmlMarkup to generate the OAI-PMH output.  And this was silly.  Atom is a very short hop to OAI-PMH, and there was really no need to manipulate the data itself at all.  However, I did need to add stuff to the final XML output that I wouldn’t know until it was time to render.  So I wrote these two XSLTs.  I have patterns in there which are identified by “##verb##” or “##requestUrl##”, etc.  This way, I can load the XSLT file into my Ruby script, replace the patterns with their real values via regex, and then transform the Atom to OAI-PMH using libxslt-ruby.  Requests are now down to about 5 seconds.  Not bad.

All in all I’m pretty happy with this.  And I don’t have to quit my addiction just yet.

For those of you that noticed that libxslt-ruby doesn’t quite jibe with my JRuby requirement, well, I guess I’m not a very dogmatic at the end of the day (which is right about now).

.



* Blindly groping towards ActivePlatform

Posted on April 9th, 2008 by Ross. Filed under activerdf, coding, platform, ruby.


Something I’ve taken it upon myself to do since I joined Talis is make ActiveRDF a viable client to access the Platform.  While this is mostly selfishness on my part (I want to keep developing in Ruby and there’s basically no RDF support right now, plus this gives me a chance to learn about the RDF/SPARQL-y aspects of the Platform), I also think that libraries like this can only help democratize the Platform.

So far, it’s been pretty ugly.  I haven’t had much time to work on it, granted, but the time I’ve spent on it has made me think that there will be a lot of work to do.  Couple this with some of the things that make the Platform difficult to work with in Ruby anyway (read: Digest Authentication) and this might be a more uphill battle than I’ll ever have time for, but I figure it’s either this or go back to Python and I’m not quite ready to give up on Ruby yet.

Currently, performance is abysmal with ActiveRDF against the Platform, so I’ll need to think of shortcuts to improve that (I’m not even considering write access presently).  Here’s some code (this is as much for my benefit, so I can remember what I’ve done) to work with Ian Davis’ Quotations Book Example store:

require ‘time’ # Otherwise ActiveRDF starts freaking out about DateTime
require ‘active_rdf’

$activerdf_without_xsdtype = true
# less than ideal, but without it, ActiveRDF sends
# ^^<http://www.w3.org/2001/XMLSchema#string> with string literals even if you don’t want
# to send the datatype.  I haven’t actually tried it with other datatypes to see how this breaks
# down the road.

ConnectionPool.set_data_source(:type => :sparql, :results => :sparql_xml, :engine=>:joseki,  :url=> “http://api.talis.com/stores/iand-dev2/services/sparql”)

Namespace.register :foaf, “http://xmlns.com/foaf/0.1/”
Namespace.register :dc, “http://purl.org/dc/elements/1.1/”
Namespace.register :quote, “http://purl.org/vocab/quotation/schema”

QUOTE::Quotations.find_by_dc::creator(”Loren, Sophia”).each do | quote |

# print the important stuff from each graph

# http://purl.org/vocab/quotation/schema#quote has to be manually added as a predicate
# the “#” seems to cause problems
quote.add_predicate(:quote, QUOTE::quote)
puts quote.quote
puts quote.subject
puts quote.rights
puts quote.isPrimaryTopicOf

end

If you actually try to execute this, you’ll see that it takes a long time to run (God help you if you try it on QUOTE::Quotations.find_by_dc::subject(”Age and Aging”)).  A really long time.

If you set some environment vars before you go into irb:

$ export ACTIVE_RDF_LOG_LEVEL=0
$ export ACTIVE_RDF_LOG=./activerdf.log

then you can tail -f activerdf.log and see what exactly is happening.

After ActiveRDF does it’s initial SPARQL query (SELECT DISTINCT ?s WHERE { ?s <http://purl.org/dc/elements/1.1/creator> “Loren, Sophia” . }), it’s doing two things for every request in the block:

  1. a SPARQL query for every predicate associated with the URI (http://api.talis.com/stores/iand-dev2/services/sparql?query=SELECT+DISTINCT+%3Fp+WHERE+%7B+%3Chttp%3A%2F%2Fapi.talis.com%2Fstores%2Fiand-dev2%2Fitems%2F1187139384317%3E+%3Fp+%3Fo+.+%7D+)
  2. a SPARQL query for the value of the attribute (predicate):  http://api.talis.com/stores/iand-dev2/services/sparql?query=SELECT+DISTINCT+%3Fo+WHERE+%7B+%3Chttp%3A%2F%2Fapi.talis.com%2Fstores%2Fiand-dev2%2Fitems%2F1187139384317%3E+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Felements%2F1.1%2Fcreator%3E+%3Fo+.+%7D

for every predicate in the graph.  You can imagine how crazily inefficient this is, since to get every value for a resource, you have to make a different HTTP request for each one.

Obviously this would be a lot easier if it used DESCRIBE rather than SELECT, but without a real RDF library to parse the resulting graph, I’m not sure how ActiveRDF would deal with what the triple store returned.

So, anyway, these are some of the hurdles in making ActiveRDF work with the Platform, but I’m not quite ready to throw in the towel, yet.

.



* Objectifying OpenURL

Posted on January 9th, 2008 by Ross. Filed under OpenURL, coding, ruby.


Sometime in November, I came to the realization that I had horribly misinterpreted the NISO Z39.88/OpenURL 1.0 spec.  I’m on the NISO Advisory Committee for OpenURL (which makes this even more embarrassing) and was reviewing the proposal for the Request Transfer Message Community Profile and its associated metadata formats when it dawned on me that my mental model was completely wrong.  For those of you that have primarily dealt with KEV based OpenURLs (which is 99% of all the OpenURLs in the wild), I would wager that your mental model is probably wrong, too.

A quick primer on OpenURL:

  • OpenURL is a standard for transporting ContextObjects (basically a reference to something, in practice, mostly bibliographic citations)
  • A ContextObject (CTX, for short from now on) is comprised of Entities that help define what it is.  Entities can be one of six kinds:
    • Referent - this is the meat of the CTX, what it’s about, what you’re trying to get context about.  A CTX must have one referent and only one.
    • ReferringEntity - defines the resource that cited the referent.  This is optional and can only appear once.
    • Referrer - the source of where the CTX came from (i.e. the A&I database).  This is optional and can only appear once.
    • Requester - this is information about who is making the request (i.e. the user’s IP address).  This is optional and can only appear once.
    • ServiceType - this defines what sorts of services are being requested about the referent (i.e. getFullText, document delivery services, etc.).  There can be zero or many ServiceType entities defined in the CTX.
    • Resolver - these are messages specifically to the resolver about the request.  There can be zero or more Resolver entities defined in the CTX.
  • All entities are basically the same in what they can hold:
    • Identifiers (such as DOI or IP Address)
    • By-Value Metadata (the metadata is included in the Entity)
    • By-Reference Metadata (the Entity has a pointer to a URL where you can retrieve the metadata, rather than including it in the CTX itself)
    • Private Data (presumably data, possibly confidential, between the entity and the resolver)
  • A CTX can also contain administrative data, which defines the version of the ContextObject, a timestamp and an identifier for the CTX (all optional)
  • Community Profiles define valid configurations and constraints for a given use case (for instance, scholarly search services are defined differently than document delivery).  Context objects don’t actually specify any community profile they conform to.  This is a rather loose agreement between the resolver and the context object source:   if you provide me with a SAP1, SAP2 or Dublin Core compliant OpenURL, I can return something sensible.
  • There are currently two registered serializations for OpenURL:  Key/Encoded Values where all of the values are output on a single string, formatted as key=value and delimited by ampersands (this is what majority of all OpenURLs that currently exist look like) and XML (which is much rarer, but also much more powerful)
  • There is no standard OpenURL ‘response’ format.  Given the nature of OpenURL, it’s highly unlikely that one could be created that would meet all expected needs.  A better alternative would be for a particular community profile to define a response format since the scope would be more realistic and focused.

Looking back on this, I’m not sure how “quick” this is, but hopefully it can bootstrap those of you that have only cursory knowledge of OpenURL (or less).  Another interesting way to look at OpenURL is Jeff Young’s 6 questions approach, which breaks OpenURL down to “who”, “what”, “where”, “when”, “why” and “how”.

One of the great failings of OpenURL (in my mind, at least) is the complete and utter lack of documentation, examples, dialog or tutorials about its use or potential.  In fact, outside of COinS, maybe, there is no notion of “community” to help promote OpenURL or cultivate awareness or adoption.  To be fair, I am as guilty as anybody for this failure, since I had proposed making a community site for OpenURL, but due to a shift in job responsibilities and then the wholesale change in employers, coupled with the hacking of the server it was to live on, left this by the wayside.  I’m putting this back on my to do list.

What this lack of direction leads to is that would-be implementors wind up making a lot of assumptions about OpenURL.  The official spec published at NISO is a tough read and is generally discouraged by the “inner core” of the OpenURL universe (the Herbert van de Sompels, the Eric Hellmans, the Karen Coyles, etc.) in favor of the “Implementation Guidelines” documents.  However, only the KEV Guidelines are actually posted there.  The only other real avenue for trying to come to grips with OpenURL is to dissect the behavior of link resolvers.  Again, in almost every instance this means you’re working with KEVs and the downside of KEVs is that they give you a very naive view of OpenURL.

KEVs, by their very nature, are flat and expose next to nothing about the structure of the model of the context object they represent.  Take the following, for example:

url_ver=Z39.88-2004&url_tim=2003-04-11T10%3A09%3A15TZD
&url_ctx_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Actx&ctx_ver=Z39.88-2004
&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&ctx_id=10_8&ctx_tim=2003-04-11T10%3A08%3A30TZD
&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=book&rft.aulast=Vergnaud
&rft.auinit=J.-R&rft.btitle=D%C3%A9pendances+et+niveaux+de+repr%C3%A9sentation+en+syntaxe
&rft.date=1985&rft.pub=Benjamins&rft.place=Amsterdam%2C+Philadelphia
&rfe_id=urn%3Aisbn%3A0262531283&rfe_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook
&rfe.genre=book&rfe.aulast=Chomsky&rfe.auinit=N&rfe.btitle=Minimalist+Program
&rfe.isbn=0262531283&rfe.date=1995&rfe.pub=The+MIT+Press&rfe.place=Cambridge%2C+Mass
&svc_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Asch_svc&svc.abstract=yes
&rfr_id=info%3Asid%2Febookco.com%3Abookreader

Ugly, I know, but bear with me for a moment.  From this example, let’s focus on the Referent:

rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=book&rft.aulast=Vergnaud
&rft.auinit=J.-R&rft.btitle=D%C3%A9pendances+et+niveaux+de+repr%C3%A9sentation+en+syntaxe
&rft.date=1985&rft.pub=Benjamins&rft.place=Amsterdam%2C+Philadelphia

and then let’s make this a little more human readable:

rft_val_fmt:  info:ofi/fmt:kev:mtx:book
rft.genre:  book
rft.aulast:  Vergnaud
rft.auinit:  J.-R
rft.btitle:  Dépendances et niveaux de représentation en syntaxe
rft.date:  1985
rft.pub:  Benjamins
rft.place:  Amsterdam, Philadelphia

Looking at this example, it’s certainly easy to draw some conclusions about the referent, the most obvious being that it’s a book.

Actually (and this is where it gets complicated and I begin to look pedantic) it’s really only telling you, I am sending some by value metadata in the info:ofi/fmt:kev:mtx:book format, not that the thing is actually a book (although the info:ofi/fmt:kev:mtx:book metadata values do state that, but, ignore that for a minute since genre is optional).

The way this actually should be thought of:

ContextObject:
    Referent:
       Metadata by Value:
          Format:  info:ofi/fmt:kev:mtx:book
          Metadata:
             Genre:  book
             Btitle:  Dépendances et niveaux de représentation en syntaxe
             …
    ReferringEntity:
       Identifier:  urn:isbn:0262531283
       Metadata by Value:
           Format:  info:ofi/fmt:kev:mtx:book
           Metadata:
               Genre:  book
               Isbn:  0262531283
               Btitle:  Minimalist Progam
               …
    Referrer:
       Identifier:  info:sid/ebookco.com:bookreader
    ServiceType:
       Metadata By Value:
           Format:  info:ofi/fmt:kev:mtx:sch_svc
           Metadata:
               Abstract:  yes

So, this should still seem fairly straightforward, but the hierarchy certainly isn’t evident in the KEV.  It’s a good starting point to begin talking about the complexity of working with OpenURL, though, especially if you’re trying to create a service that consumes OpenURL context objects.

Back to the referent metadata.  The context object didn’t have to send the data in the “metadata by value” stanza.  It could have just sent the identifier “urn:isbn:9027231141″ (and note in the above example, it didn’t have an identifier at all).  It could also have sent metadata in the Dublin Core format, MARC21, MODS, ONIX or all of the above (the Metadata By Value element is repeatable) if you wanted to make sure your referent could be parsed by the widest range of resolvers. While all of these are bibliographic formats, in Request Transfer Message context objects (which would be used for document delivery, which got me started down this whole path), you would conceivably have one or more of the aforementioned metadata types plus a Request Transfer Profile Referent type that describes the sorts of interlibrary loan-ish types of data that accompany the referent as well as an ISO Holdings Schema metadata element carrying the actual items a library has, their locations and status.

If you only have run across KEVs describing journal articles or books, this may come as a bit of a surprise.  Instead of saying the above referent is a book, it becomes important to say that the referent contains a metadata package (as Jonathan Rochkind calls it) that is in this (OpenURL specific) book format.  In this regard, OpenURL is similar to METS.  It wraps other metadata documents and defines the relationships between them.  It is completely ambivalent about the data it is transporting and makes no attempt to define it or format it in any way.  The Journal, Book, Patent and Dissertation formats were basically contrived to make compatibility with OpenURL 0.1 easier, but they are not directly associated with OpenURL and could have just as easily been replaced with, say, BibTex or RIS (although the fact that they were created alongside Z39.88 and are maintained by the same community makes the distinction difficult to see).

What this means, then, is that in order to know anything about a given entity, you also need to know about the metadata format that is being sent about it.  And since that metadata could literally be in any format, it means there are lot of variables that need to be addressed just to know what a thing is.

For the Umlaut, I wrote an OpenURL library for Ruby as a means to parse and create OpenURLs.  Needless to say, it was originally written with that naive, KEV-based, mental model (plus some other just completely errant assumptions about how context objects worked) and, because of this, I decided to completely rewrite it.  I am still in the process of this, but am struggling with some core architectural concepts and am throwing this out to the larger world as an appeal for ideas or advice.

Overall the design is pretty simple:  there is a ContextObject object that contains a hash of the administrative metadata and then attributes (referent, referrer, requester, etc.) that contain Entity objects.

The Entity object has arrays of identifiers, private data and metadata.

And then this is where I start to run aground.

The original (and current) plan was to populate the metadata array with native metadata objects that are generated by registering metadata classes in a MetadataFactory class.  The problem, you see, is that I don’t want to get into the business of having to create classes to parse and access every kind of metadata format that gets approved for Z39.88.  For example, Ed Summers’ ruby-marc has already solved the problem of effectively working with MARC in Ruby, so why do I want to reinvent that wheel?  The counter argument is, by delegating these responsibilities to third party libraries, there is no consistency of APIs between “metadata packages”.  A method used in format A may very well raise an exception (or, worse, overwrite data) in format B

There is a secondary problem that third party libraries aren’t going have any idea that they’re in an OpenURL context object or even know what that is.  This means there would have to be some class that handles functionality like xml serialization (since ruby-marc doesn’t know that Z39.88 refers to it as info:ofi/fmt:xml:xsd:MARC21), although this can be handled by the specific metadata factory class.  This would also be necessary when parsing an incoming OpenURL since, theoretically, every library could have a different syntax for importing XML, KEVs or whatever other serialization is devised in the future.

So I’m looking for advice on how to proceed.  All ideas welcome.

.



* HACK IS NOT A CRIME

Posted on August 15th, 2007 by Ross. Filed under coding, libraries, ruby.


…although LinuX_Xploit_Crew, with all due respect, I think it actually is.

Oh well, we’re back with a new theme (which nobody will see except to read comments, since I’m pretty sure all traffic comes from the code4lib planet) and an updated WordPress install. Look out, world!

So, in the downtime here’s a non-comprehensive rundown of what I’ve been working on:

  1. I’ve written an improved (at least, I think it’s improved) alternative to Docutek’s Eres RSS interface. Frankly, Docutek’s sucked. Maybe we have an outdated version of Eres, but the RSS feeds would give errors because you have to click through a copyright scare page before you can view reserves, but you can’t link the RSS links to this form and get the item. I wrote a little Ruby/Camping app that takes urls like: http://eres.library.gatech.edu/course/WS-1001-A/Fall/2007 and turns that into a usable feed. I needed the course id/term/year format to show them in Sakai. My favorite part of this project was finding Rubyscript2exe. This allows me to just bundle one file (the compiled camping app) plus a configuration file. Granted, an asp.net app would be even easier for sites to install, but I didn’t have time to learn asp.net. I have more ideas of what I would like to do with this (such as show current circ status for physical reserves), but in the chaos that is our library reorg, I haven’t gotten around to even showing anybody what I’ve written so far.
  2. I broke ground on a Metalib X-Server Ruby library. It took me a while to wrap my head around how this needed to be modelled, but I think it’s starting to take shape. It doesn’t actually perform queries, yet, but it connects to the server, allows you to set the portal association and find and set the category to search. Quicksets and MySets are all derivations of the same concept, so I don’t think it will be take me long to actually incorporate actual searching. For proof-of-concept, I plan on embedding this library in MemoryHole, our Solr-based discovery app. I’ve actually stopped development of MemoryHole so we can focus on vuFind, since they do functionally the same thing and I’d rather help make vuFind better than replicate everything it does, only in Ruby. The reason I’m doing this proof-of-concept in MemoryHole rather than vuFind is solely due to familiarity and time.

In other news, my last post seems to have caused a bit of a stir. My plan is to write a response, but the short of it is that I feel the arguments for an MLS are extremely classist.

Also my bathroom is finished and it looks great.

ERESidue

Blogged with Flock

Tags:

.



* In search of… Bigfoot

Posted on March 29th, 2007 by Ross. Filed under coding, libraries, ruby.


Before I left for Guatemala, Ian Davis at Talis asked if I could give him a dump of our MARC records to load into Talis Platform. I had been talking in the #code4lib channel about how I was pushing the idea of using Talis Source to make simple, ad-hoc union catalogs; we could make one for Georgia Tech & Emory (we have joint degree programs) or Arche or Georgia Tech/Atlanta-Fulton Public Library, etc. My thinking was that by utilizing the Talis Platform, we could forgo much of the headache in actually making a union catalog for somewhat marginal use cases (the public library one notwithstanding).

About a week after I got back from Guatemala, I had an email from Richard Wallis with some urls to play around with to access my Bigfoot store. He showed me search services, facet services and augment services. I was unable to be really dive into it much at the time but since I’m working on a total site search project for the library, I thought this would be a good chance to kick the tires a bit to include catalog results.

After two days of poking around, I have made some opinions of it, have some recommendations for it, and wrote a Ruby library to access it.

1) The Item Service

This is certainly the most straightforward and for many people, the most useful service of the bunch. The easiest way to think of the item service is an HTTP based Lucene service (a la Solr or Lucene-WS) of your bib records. It returns something OpenSearch-y (it claims to be a RSS 1.0 document), but it doesn’t validate. That being said, FeedTools happily consumed it (more on that later) and the semantics should be familiar to anyone that has looked at OpenSearch before. Each item node also contains a Dublin Core representation of the record and a link to a marcxml representation. I’m not sure if there’s a description document for Bigfoot.
Although the query syntax is pure Lucene (title:”The Lexus and the Olive Tree”), the downside is that it’s not documented anywhere what the indexes are and I doubt there would be any way to add new ones (for example, my guess is I wouldn’t be able to get an index for 490/440$v that I use for the Umlaut). I don’t see returning the results as OAI_DC being too much of a problem, since the RSS item includes a title (which would have been tricky between the DC and the marcxml). My Ruby library might not generate valid DC, I haven’t really looked into it.

The docs also mention you can POST items to your Bigfoot store, but they don’t mention what your data needs to look like (MARC?) or what credentials you need to add something (I mean, it must be more than just your store name, right?). My hope is to add this functionality to bigfoot-ruby soon (especially since my data is from a bulk export from last October).

2) The Facet Service

This one is intriguing, definitely, since Faceted searching is all the rage right now.  The search syntax is basically the same as the Item Service, except you also send a comma delimited list of the fields you would like to query.  What you get back is either an XML or XHTML document of your results.

For each field you request, you get back a set of terms (you can specify how many you want, with a default of 5) that appear most frequently in your field.  You also get an approximation for how many results you would get in that facet and a url to search on that facet.  It’s quite fast, although, realistically, you can’t do much with the output of facet search alone.

Again, it’s difficult to know what you can facet on (subject, creator and date are all useful — I’m sure there are others) and the facet that (for me, at least) held the most promise — type — is too overly broad to do much with (it uses Leader position 7, but lumps the BKS and SER types all in a label called “text”).  I would like to see Talis implement something like my MARC::TypedRecord concept so one could facet on things like government document or conference.  You could separate newspapers from journals and globes from maps.  Still, the text analysis of the non-fixed fields is powerful and useful and beats the hell out of trying to implement something like that locally.

In bigfoot-ruby, I have provided two ways to do a faceted search:  you can just do the search and get back Facet objects containing the terms and search urls or you can facet with items which executes the item searches automatically (in turn getting a definitive number of results for the query, as well).  Since I didn’t bother to implement threading, getting facets with items can be pretty slow.

3) The Augment Service

To be honest, I’m having a hard time figuring out useful scenarios for the augment service.  The idea is that you give it the URI of an RSS feed, and this service will enhance it with data from your Bigfoot store (at least, that is sort of how I understand it works).  Richard’s example for me was to feed it the output of an xISBN query (which isn’t in RSS 1.0, AFAIK, but, for the sake of example…) and the augment service would fill in the data for ISBNs your library holds.  The API example page mentions Wikipedia, but I don’t know where other than the Talis Platform that you can get Wikipedia entries formatted properly.  I tried sending it the results of an Umlaut2 OpenSearch query, but it didn’t do anything with it.  Presumably this RSS 1.0 feed needs the bib data to be sent in a certain way (my guess is in OAI_DC, like the Item Service), but I’m not sure.  The only use case I can think of for this service is a much simpler way to check for ISBN concordance (rather than isbn:(123456789X|223456789X|323456789X|etc.))

Overall, I’m really impressed with the Talis API.  It is a LOT easier to use than, say, Z39.50 and by using OpenSearch seems more natural to integrate into existing web services than SRU.

Bigfoot-ruby is definitely a work in progress.  I think I would like to split the Search class into ItemService and FacetService.  I don’t like how results is an Array for items and a Hash for facets. Just seems sloppy.  I need to document it, of course and I would like to implement Item POST.  This project also made me realize how bloody slow FeedTools is.  I am currently using it in both the Umlaut and the Finding Aids to provide OpenSearch, but I think it’s really too sluggish to justify itself.

Thanks, Talis, for getting me started with Bigfoot and giving me the opportunity to play around with it.  Also, thanks to Ed Summers for fixing SVN on Code4lib.org.  You wouldn’t be able to download it and futz around with it yourself, otherwise.

.



* Pay no attention to the man behind the curtain

Posted on February 18th, 2007 by Ross. Filed under ruby.


port install rubygems

.