Archive for the ‘SRU’ Category
* Explosive Diaereses
Posted on March 28th, 2006 by Ross. Filed under OpenURL, Polishing the Turd, Ruby on Rails, SRU, coding, libraries, php, unapi.
I’ve been waiting for a while to have this title. Well, actually, not a long while, and that’s testimony to how quickly I’m able to develop things in Rails.
While I think SFX is fine product and we are completely and utterly dependent upon it for many things, it does still have its shortcomings. It is not a terribly intuitive interface (no link resolver that I’m aware of has one) and there are some items it just doesn’t resolve well, such as conference proceedings. Since conference proceedings and technical reports are huge for us, I decided we needed something that resolved these items better. That’s when the idea of the übeResolver (now mainly known as ‘the umlaut’) was born.
Although I had been working with Ed Summers on the Ruby OpenURL libraries before Code4Lib 2006, I really began working on umlaut earlier this month when I thought I might have something coherent together in time before the ELUNA proposal submission deadline. Although I barely had anything functional on the 8th (the deadline — 2 days after I had really broken ground), I could see that this was actually feasible and doable.
Three weeks later and it’s really starting to take shape (although it’s really, really slow right now). Here are some examples:
A book: ‘Advances in Communication Control Networks’
Granted, the conference proceeding is less impressive as a result of IEEE being available via SFX (although, in this case, it’s getting the link from our catalog) and the fact that I’m having less luck with SPIE conferences (they’re being found, but I’m having some problems zeroing in on the correct volume — more on that in a bit), but I think that since this is the result of < 14 days of development time, it isn't a bad start.
Now on to what it's doing.
If the item is a "book", it queries our catalog for ISBN; asks xISBN for other matches, queries our catalog for that; does a title/author search; does a conferenceName/title/year search. If there are matches, it then asks the opac for holdings data. If the item is either not held or not available, it does the same to our consortial catalog. Currently it’s doing both, regardless, because I haven’t worried about performance.
It checks the catalog via SRU and tries to fill out the OpenURL ContextObject with more information (such as publisher and place). This would be useful to then export into a citation manager (which most link resolvers have fairly minimal support for). While it has the MODS records, it also grabs LCSH and Table of Contents (if they exist). When I find an item with more data, I’ll grab it as well (such as abstracts, etc.).
It then queries Amazon Web Services for more information (editorial content, similar items, etc.).
It still needs to check SFX, but, unfortunately, that would slow it down even more.
For journals, it checks SFX first. If there’s no volume, issue, date or article title, it will try to get coverage information. Unfortunately, SFX’s XML interface doesn’t send this, so I have to get this information from elsewhere. When I made our Ejournal Suggest service, I had to create a database of journals and journal titles and I have since been adding functionality to it (since I am running reports from SFX for titles and it includes the subject associations, I load them as well — it includes coverage, too, so including that field was trivial). So when I get the SFX result document back, I parse it for its services (getFullText, getDocumentDelivery, getCitedBy, etc.) and if no article information is sent, I make a web service request to a little PHP/JSON widget I have on the Ejournal Suggest database that gets back coverage, subjects and other similar journals based on the ISSN. The ‘other similar journals’ are 10 (arbitrary number) other journals that appear in the same subject headings, ordered by number of clickthroughs in the last month. This doesn’t appear if there is an article, because I haven’t decided if it’s useful in that case (plus the user has a link to the ‘journal level’ if they wish).
Umlaut then asks the opac for holdings and tries to parse the holdings records to determine if a specific issue is held in print (this works well if you know the volume number — I have thought about how to parse just a year, but haven’t implemented it yet). If there are electronic holdings, it attempts to dedupe.
There is still a lot more work to do with journals, although I hope to be able to implement this soon. The getCitedBy options will vary from grad students/faculty to undergrads. Since we have very limited seats to Web of Science, undergraduates will, instead, get their getCitedBy links to Google Scholar. Graduate students and faculty will get both Web of Science and Google Scholar. Also, if no fulltext results are found, it will then go out to the search engines to try to find something (whether it finds the original item or a postprint in arxiv.org or something). We will also have getAbstracts and getTOCs services enabled so the user can find other databases that might be useful or table of content services, accordingly. Further, I plan on associating the subject guides with SFX Subjects and LCC, so we can make recommendations from a specific subject guide (and actually promote the guide a bit) based, contextually, by what the user is already looking at. By including the SFX Target name in the subject items (which is an existing field that’s currently unused), we could also match on the items themselves.
The real value in umlaut, however, will come in its unAPI interface. Since we’ll have Z39.88 ContextObjects, MODS records, Amazon Web Services results and who knows what else, umlaut could feed an Atom store (such as unalog) with a whole hell of a lot of data. This would totally up the ante of scholarly social bookmarking services (such as Connotea and Cite-U-Like) by behaving more like personal libraries that match on a wide variety of metadata, not just url or title. The associations that users make can also aid umlaut in recommendations of other items.
The idea here is not a replacement of the current link resolver, the intention is to enhance it. SFX makes excellent middleware, but I think it’s interface leaves a bit to be desired. By utilizing its strength, we can layer more useful services on top of it. Also, a user can add other affiliations that they belong to in their profile, so umlaut can check their local public library or, if they are taking classes at another university, they can include those.
At this point I can already hear you saying, “But Ross, not everyone uses SFX”. How true! I propose a microformat for link resolver results that could be parseable by umlaut (and in an ‘eating your own dog food’ fashion, will add this to umlaut’s template, eventually), making any link resolver available to umlaut.
There is another problem that I’ve encountered while working on this project, though, too. Last week and the week before, while I was doing the bulk of the SRU development, I kept on noticing (and reporting) our catalog (and, more often, it’s Z39.50 server) going down. Like many times a day. After concluding that, in fact, I was probably causing the problem, I finally got around to doing something that I’ve been meaning to do for months (and I would recommend to everyone else if they want to actually make useful systems): exporting the bib database into something better. Last week I imported our catalog into Zebra and sometime this week I will have a system that syncs the database every other hour (we already have the plumbing for this for our consortial catalog). I am also experimenting with Cheshire3 (since I think it’s potential is greater — it’s possible we may use both for different purposes). The advantage to this (besides not crashing our catalog every half hour) is that I can index it any way want/need to as well as store the data any way I need to in order to make sure that users get the best experience they can.
Going back to the SPIE conferences, there is no way in Voyager that I can limit my results to less than 360+ results for “SPIE Proceedings” in 2003. At least, not from the citations I get from Compendex (which is where anyone would get the idea to look for SPIE Proceedings in our catalog, anyway). With an exported database, however, I could index the volume and pinpoint the exact record in our catalog. Or, if that doesn’t scale (for instance, if they’re all done a little differently), I can pound the hell out our zebra (or cheshire3 or whatever) server looking for the proper volume without worrying about impacting all of our other services. I can also ‘game the system’ a bit and store bits in places that I can query when I need them. Certainly this makes umlaut (and other services) more difficult to share to other libraries (at least, other libraries that don’t have similar setups to ours), but I think these sorts of solutions are essential to improving access to our collections.
Oh yeah, and lest you think that mirroring your bib database is too much to maintain: Zebra can import marc records (so you can use your opac’s marc export utility) and our entire bib database (705,000 records) takes up less than 2GB of storage. The more indexes added, the larger the database size, of course, but I am indexing a LOT in that.
* But, you see, my wheel is nothing like all of those other seemingly identical wheels
Posted on October 5th, 2005 by Ross. Filed under DSpace, OAI, Python, SRU, coding, libraries.
I am still feeling my way around Python. I have yet to grasp the zen of being Pythonic, but I am at least coming to grips with real object orientation (as opposed to the named hashes of PHP) and am actually taking the leap into error handling, which, if you have dealt with any of the myriad bugs in any of my other projects, you’d know has been a bit of a foreign concept to me.
Python project #2 is known as RepoMan (thanks to Ed Summers for the name). It attempts to solve a problem that not one but two other opensource projects already have solved admirably (I’ll go into more about this in a bit). RepoMan is an OAI Repository indexer that makes said repository available via SRU. I created it in an attempt to make our DSpace implementation searchable from remote applications (namely, the site search and the upcoming alternative opac). It’s an extremely simple two script project that has only taken a week to get running largely due to the existence of two similar and available python scripts that I could modify for my own use. It’s also due to the help of Ed Summers and Aaron Lav.
The harvester is, basically, Thom Hickey’s one page OAI harvester with some minor modification. I have added error handling (the two lines I added to compensate for malformed xml must have been over the “one page limit”) and instead of outputting to a text file, it shoves the records in a Lucene index (thanks to PyLucene). This part still needs some work (I’m not sure what it would do with an “updated” record, for example), but it makes a nice index of the Dublin Core fields, plus a field for the whole record, for “default” searches. This was a good exercise for me to work with xml, Python and Lucene, because I was having some trouble when trying to index the MODS records for the alternative opac.
The SRU server is, basically, Dan Chudnov’s SRU implementation for unalog. It needed to be de-Quixotefied and is, in fact, much more robust than Dan’s original (of course, unalog’s implementation doesn’t need to be as “robust”, since the metadata is much more uniform), but certainly having a working model to modify made this go much, much faster. The nice part is that there might be some stuff in there that Dan might want to put back into unalog.
So, here is the result. The operations currently supported are explain and searchRetrieve and majority of CQL relations are unsupported, but it does most of the queries I need it to do and, most importantly, it’s less than a week old.
So the burning question here is: why on earth would I waste time developing this when OCKHAM’s Harvest-to-Query is out there, and, even more specifically, OCLC’s SRW/U implementation for DSpace is available? Further, I knew full well that these projects existed before I started.
Lemme tell ya.
Harvest-to-Query looked very promising. I began down this road, but stopped about halfway down the installation document. Granted, anything that uses Perl, TCL and PHP has to be, well, something… After all, those were the first three languages I learned (and in the same order!). Adding in IndexData’s Zebra seemed logical as well since it has a built-in Z39.50 server. Still, this didn’t exactly solve my problem. I’d have to install yazproxy, as well, in order to achieve my SRU requirement. Requiring Perl, TCL, PHP, Zebra and yazproxy is a bit much to maintain for this project. Too many dependencies and I am too easily distracted.
OCLC’s SRW/U seemed so obvious. It seemed easy. It seemed perfect. Except our DSpace admin couldn’t get it to work. Oh, I inquired. I nagged. I pestered. That still didn’t make it work. I have very limited permissions on the machine that DSpace runs on (and no permissions for Tomcat), so there was little I could do to help. This also solved a specific purpose, but didn’t necessarily address any other OAI providers that we might have.
So, enter RepoMan. Another wheel that closely resembles all the other wheels out there, but possibly with minor cosmetic changes. Let a thousand wheels be invented.
* Metasearch Metadata Metaphor
Posted on September 15th, 2005 by Ross. Filed under SRU, libraries, philosophizing.
SRW/U is to Yngwie J. Malmsteen as OpenSearch is to Keith Richards.
Yngwie Malmsteen is technically superior, however aesthetically unlistenable (unimplementable, in the case of SRW/U).
Keith Richards is sloppy, unsophisticated and writes timeless melodies that resonate with the masses (OpenSearch is sloppy, unsophisticated — while time will tell if OpenSearch becomes “timeless” [seems doubtful, honestly], but there are certainly a lot of OpenSearch targets).
Mike Taylor wrote a very insightful reaction to Dan’s worklog posting (and incredibly objective, given his investment and relationship to SRW/U). And he’s right.
And I’m right. Unless SRW/U can capture some of the mojo that OpenSearch has, it might as well be Yngwie J. Malmsteen.
Pages:
Categories:
- About me
- Access 2005
- Access 2006
- activerdf
- American History
- American Tackle Football — Collegiate
- American Tackle Football — Professional
- archives
- Atlanta vs. elsewhere
- Che
- cms
- code4lib2006
- code4lib2007
- code4libcon2008
- coding
- COinS
- Community building
- daisycms
- drupal
- DSpace
- Eulogy
- experiment
- geeks
- Grails
- Groovy
- Guatemala
- GV1202 .F34
- intranet
- jangle
- libraries
- life
- Linked Data
- Lucene
- Master of Library Science
- music
- OAI
- OpenURL
- philosophizing
- php
- platform
- plone
- Polishing the Turd
- politics
- presentations
- Problem Solving
- Python
- Real estate classifieds
- ruby
- Ruby on Rails
- sakai
- SRU
- Standards Schmandards
- SuDoc
- Super Heroes
- Talis
- Toronto
- two-point-oh-no
- umlaut
- unapi
- Uncategorized
- unicode
- URIs
- xslt
Archives:
- July 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- December 2008
- November 2008
- September 2008
- July 2008
- April 2008
- March 2008
- January 2008
- October 2007
- September 2007
- August 2007
- July 2007
- June 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- November 2006
- October 2006
- September 2006
- August 2006
- July 2006
- June 2006
- May 2006
- April 2006
- March 2006
- February 2006
- January 2006
- December 2005
- November 2005
- October 2005
- September 2005
- August 2005
- July 2005
- June 2005
- May 2005
- April 2005
- February 2005