I have been slowly taking the MARC codes lists and modeling them as linked data. I released a handful of them several months ago (geographic area codes, countries and languages) and have added more as I get inspired or have some free time. Most recently, I’ve added the Form of Item, Target Audience and Instruments and Voices terms.
The motivation behind modeling these lists is that they are extremely low-hanging fruit: they are controlled vocabularies that (usually) appear in the MARC fixed (or, at any rate, controlled) fields. What this means is that they should be pretty easy to link to from our source data. The identifiers are based on the actual code values in an effort to not actually have to look anything up when converting MARC into RDF.
I’ll go over each code list and explain what their function and how to link to them from MARC:
The purpose of these is a little vague: they’re hard to classify as to what exactly they are; there are states (Tennessee), countries (India), continents (Europe), geographic features (Andes, Congo River, Great Rift Valley), areas or regions (Tropics, “Southwest, New” –whatever that means–, “Africa, French-speaking Equatorial“), hemispheres (Southern hemisphere), planets (Uranus) and then there are entries for things like “Outer Space” and “French Community” (which, as I understand it, is sort of the French analog to the British Commonwealth); in short, they are all over the map (literally).
I have modeled these things as wgs84:SpatialThings. I don’t know if that is 100% appropriate (e.g. “French Community”) and am open to recommendations for other classes. Given that they are somewhat hierarchical and are used to define the geographic “subject” of a work, it might be more appropriate to model them using SKOS.
The geographic area code is found in the MARC 043$a (which is a repeatable subfield in a non-repeatable field) and should be a 7 character string (although this may vary based on local cataloging practices). Most codes will be much shorter than this: the specification requires right padding hyphens (“-”) to seven characters (“aa—–”). To turn this into a MARC Codes URI, you’ll drop the trailing hyphens and append “#location”:
I’m not sure what is actually the “best” property to use to link to these resources, but I have been using <http://purl.org/dc/terms/spatial> (although, admittedly, not consistently). This would entail that these resources are also a <http://purl.org/dc/terms/Location> which is something I can live with.
Not all of the geographic area codes are linked to anything, but some are linked to the authorities at http://id.loc.gov/authorities/, dbpedia, geonames, etc.
These are a little more consistent than the geographic area codes, but they are definitely not all “countries”. With a few exceptions (United States Misc. Caribbean Islands) they are actual “political entities”, with countries (Guatemala), and states/provinces/territories (Indiana, Nova Scotia, Gibraltar, Gaza Strip).
Like the geographic area codes, I’ve modeled these as wgs84:SpatialThings.
They can appear in several places in the MARC record: they will almost always appear in the 008 in positions 15-17 as the “country of publication”. If one code isn’t enough to convey the full story of the production of a particular resource (!), the code may also appear in the 044$a (repeatable subfield, non-repeatable field). There are a couple of fields that the country codes could appear in: the 535$g, 775$f and the 851$g; I have no idea how common it would be to find them there (and they have a different meaning — the 535/851 define the location of the item, for example).
To generate the country code URI, take the value from the MARC 008[15-17] or 044$a, strip any leading or trailing spaces and append “#location”. The URIs look like:
To link to these resources, I’ve been using the RDA:placeOfPublication property, although I’m sure there are plenty of others that are appropriate (seems like a logical property for BIBO, for example).
The original code lists are also grouped by region, but there are no actual codes for this. I created some for the purposes of linked data:
etc. (until 12).
Since we only use the country codes in MARC to note the place of publication, these are far less valuable than the geographic area codes (which are much more ambiguous in meaning), since it’s much more interesting when you can say that all of these things:
are referring to the same place as all of these things:
which, in turn, are referring to the same place as this:
which, in my mind, has tremendous potential.
Unbeknownst to me prior to undertaking this project, the Library of Congress is actually the maintenance agency for ISO 639-2 and the ISO codes are actually a derivative of the MARC codes list. They aren’t actually a 1:1 mapping (there are 22 codes that are different in the ISO list), but they’re extremely close. What is particularly nice about this is that most locale/language libraries are aware of these codes so it’s fairly easy to map to other locales (notably ISO 639-1, which is used by xml:lang).
The Library of Congress publishes an XML version of the list which is what I used to model it as linked data. One of the nice features of this list was that it has attributes on the name that denote whether or not there’s an authority record for it:
which we can then take, tack the substring ” language” onto it and look it up in http://id.loc.gov/authorities:
giving us a link between things created in a particular language and things created about that language.
To use the language codes, take the value of positions 35-37 of the 008 or the 041 (the different subfields all define a different place the text might have a different language, so check the spec on this one). I doubt it hardly ever appears in actual data, but the 242$y might have the language of the translated title.
Take that value (be sure to strip any trailing/leading whitespace — it’s supposed to be 3 characters: no more, no less) and plug it into the following URI template:
The language resources link to id.loc.gov (as mentioned above) as well as Lingvoj/Lexvo (they link to both, where appropriate, since there are likely still many data sources out there still using the Lingvoj URIs). There are a handful (for example, Swedish) that link to dbpedia, but since those links are available in Lexvo, it’s not essential they appear here.
There are two codes lists that are directly related to music-based resources (sound recordings, scores and video): the musical composition codes and the Instruments and Voices codes. Given that there has been a lot of work put into modeling music data for the linked data cloud, I thought it would be most useful to orient both of these lists to be used with the Music Ontology.
The composition codes basically denote the “genre” of the music contained in the resource. It’s extremely classical-centric and sometimes lumps a lot of different forms into one genre code (try Divertimentos, serenades, cassations, divertissements, and notturni on for size), but they are definitely a start for finding like resources.
They are modeled as mo:Genre resources and include links to id.loc.gov, dbpedia and wikipedia. To get the code, either use positions 17-19 of the MARC 008 field or the 047$a (both a repeating field and subfield). The normalized code should always be two alpha characters long, and downcased.
They go into a URI template like:
It would be really useful to find other datasources that use mo:Genre to link these to.
This is a very small list that broadly describes the format of the resource being described. This is probably most useful to use with dcterms:format, so they’ve all been modeled with the rdf:type dcterms:MediaType. A full third of the codes describe microforms (granted, out of 9 total), which should give you some some sense of how relevant these are.
Getting the code from the MARC record is dependent on the kind of record you’re looking at. For books, serials, sound recordings, scores, computer files and mixed materials, take the 23rd position from the 008. For visual materials and maps use the 29th position. They should be one, lowercase alpha character.
URIs look like:
The resources link to http://id.loc.gov/authorities (think Genre/Form terms), http://id.loc.gov/vocabulary/graphicMaterials and (for a couple) dbpedia.
Ideally, these will eventually link to whatever is analogous is RDA (if somebody can point that out to me).
Unlike the previous code list, this one seems much more useful. It is used to define how often a continuing resource is updated. Unfortunately, it is extremely print-centric (the only term more frequent than “daily” is “Continuously updated” which is defined as “Updated more frequent than daily.”), but some of the terms would seem to hold value even outside of the library context (Annual, Biweekly, Quarterly, etc.). It doesn’t take a tremendous leap of the imagination see how these might be useful for events calendars (Monthly, etc.) or for GoodRelations-type datasets (“Semi-annual Blowout Sale!”).
To get the code from the MARC record, check the 008 or the 853-855$w. Presumably, this should only appear for continuing resources (SER). It’s a one letter code, lower cased.
The URIs look like:
They are modeled as dcterms:Frequency resources and link to dbpedia where available.
This is another fairly short, extremely generalized list. It is primarily useful to determining the age-level of children’s resources, most likely (5 of the 8 terms are for juvenile age groups). They are of rdf:type dcterms:AgentClass. Resources are linked (where appropriate — and maybe even a few that aren’t) to dbpedia and http://id.loc.gov/authorities/.
For books, music (scores, sound recordings), computer files and visual materials, get the code from the 008. It is one letter, lower cased. URIs follow the fairly consistent form we’ve seen thus far:
The terms describe the instruments or vocal groups that either appear (for sound recordings, for example) or are intended (scores) for a particular resource. Like many of the other codes lists, these are quite general and maddeningly biased towards classical music (Continuo, Celeste, Viola d’amore, but no banjo or sitar, for instance). Like the form of musical composition terms, I modeled these to use with the Music Ontology, namely as the object of mo:instrument. mo:Instrument has this note:
Any taxonomy can be used to subsume this concept. The default one is one extracted by Ivan Herman
from the Musicbrainz instrument taxonomy, conforming to SKOS. This concept holds a seeAlso link
towards this taxonomy.
so these terms have been modeled as skos:Concepts. There are skos:exactMatch relationships to the Musicbrainz taxonomy where appropriate (as well as links to id.loc.gov/authorities and dbpedia). The original code lists had an implication of hierarchy (“Larger ensemble – Dance orchestra” should be thought of as “Dance orchestra” with broader term “Larger ensemble”), but that’s not actually used in MARC. I broke these broader terms out on their own for this vocabulary, since it seemed useful in a linked data context and wouldn’t actually hurt anything (the codes are two letters, so the “broader terms” are just using the first letter).
To get the code, use the MARC 048 subfield a or b (for ensemble or solo parts, respectively) and take the first two characters (which must be letters). This code may be followed by two digit number (left padded with zeroes) signifying how many parts. Drop this number, if present.
I am not sure when or if I will model any more codes lists. Ideally, the Library of Congress should be doing these (they’ve done the relator codes, and preservation events lists). The only other lists I can see much value in are the Specific Material Form Terms (the MARC 007) and the MARC Organization codes.
I have done a bit of work on the specific material forms list, but it’s fairly complicated. My current approach is a hybrid of controlled vocabularies and RDF schema (after all, it makes sense for a globe to be rdf:type <http://purl.org/NET/marccodes/smd/terms/Globe> rather than that be some property set on an untyped resource). For an RDF schema, though, I would prefer a “better” namespace than purl.org/NET/, although perhaps it doesn’t really matter much.
No matter what, it would certainly push the limits of my freebie Heroku account that this is currently running on.
I am definitely open to any ideas or recommendations people might have for these (and requests for other lists to be converted). I’d also be interested to see if are able to use them with your data.