What we talk about when we talk about http://dbpedia.org/resource/Love

I’ve been accused of several things in the Linked Data community this week:  a circular reasoner, a defender of the status quo “just because that’s how we’ve always done it”, and (implicitly) an httpRange-14 apologist.  Quite frankly, none of these are true or quite what I mean (and I’m, of course, over dramatizing the accusations), but let’s focus on the last point for now (which may clear up some of the other points, as well).

Ed’s post (as he explains at the end) is a reference to me calling bullshit on his claim that “[he] think[s] httpRange-14 is an elaborate scholarly joke“.  Let me be clear from the outset that I am not particularly dogmatic on this issue.  That is, I don’t think the internet will break if the resource and carrier are conflated, but I also don’t think it’s that hard to keep them separated and that the value in doing so outweighs any perceived costs.

First off, let me explain what httpRange-14 is to the uninitiated (skip on ahead if you feel pretty comfortable with this).  In linked data (or semantic web, you can choose the words that feel best to you), we run into a problem with identifiers and what, exactly, they are identifying.  Let’s say I want to talk about Chattanooga.  Well, “Chattanooga” is not a web resource, but if I want talk about it unambiguously, it needs an identifier, preferably an HTTP URI, so other people can refer to it unambiguously and say things about it and discover it.  Ideally, this web representation would also have human readable (HTML) and machine readable (RDF, XML, etc.) versions.  But the important distinction here is that the city of Chattanooga cannot be retrieved on the web, only these HTML, RDF, XML surrogates.  If the surrogate has the same URI (identifier) as the resource it’s describing it starts to get difficult to figure out what we’re talking about.

So to try to make this a little clearer, let’s say I am making this representation of Chattanooga for people to use:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://www.geonames.org/ontology#P.PPL> ;
    <http://www.geonames.org/ontology#population> "155554"^^xsd:integer.

But I also feel I need to let people know some administrative data about it, so they know when it was last modified and by whom, etc., so:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://www.geonames.org/ontology#P.PPL> ;
    <http://www.geonames.org/ontology#population> "155554"^^xsd:integer ;
    dcterms:creator <http://dilettantes.code4lib.org/about#me> ;
    dcterms:created "2010-07-09"^^xsd:date ;
    dcterms:modified "2010-07-09T11:25:00-6"^^xsd:dateTime .

Now things get confusing.  My new assertions (dcterms:creator/created/modified) are being applied to the same resource as my city, so I am saying that I created a city of 155,554 people today (what have you done today, chump?).

The way we get around this is through a layer of indirection, basically we just use two URIs: you request an RDF document from http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf and it has something like:

<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee#place>
  rdf:type <http://www.geonames.org/ontology#P.PPL> ;
  <http://www.geonames.org/ontology#population> "155554"^^xsd:integer.
<http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee.rdf>
    rdf:type <http://xmlns.com/foaf/0.1/Document> ;
    <http://xmlns.com/foaf/0.1/primaryTopic> <http://dilettantes.code4lib.org/resources/Chattanooga_Tennessee#place> ;
    dcterms:creator <http://dilettantes.code4lib.org/about#me> ;
    dcterms:created "2010-07-09"^^xsd:date ;
    dcterms:modified "2010-07-09T11:25:00-6"^^xsd:dateTime .

And this keeps things a little clearer.  I created the document you’re looking at today, not the resource that the document is describing.  So this way when you say that my RDF is terrible (fair accusation) you’re not necessarily saying that about the city of Chattanooga (and vice versa).  You can read more about this at Cool URIs for the Semantic Web (by the way, I tend to favor the “hash URI” approach, for simplicity’s sake).

Now back to Ed’s post.  His argument is that if he uses http://en.wikipedia.org/wiki/William_Shakespeare as his identifier (referent, really) we should be smart enough to know when we say that this URI is a foaf:Person and that it was dcterms:created on “2001-10-14″ that we’re referring to two different things.

The first comment is from Ian (full disclosure: my boss, fuller disclosure: this doesn’t mean I agree with him) who simultaneously “completely agrees” with Ed and yet supplies an argument that punches a gigantic hole in the side of Ed’s thesis.

To put it another way, sure, maybe we can tell that dcterms:created is a strange assertion for a foaf:Person and we have other ways to tell that Shakespeare was born in 1564 (via a bio:Birth resource or something), but this breaks down for books and all sorts of other entities.  So you have dcterms:created “2003-09-04″ and dcterms:creator <http://en.wikipedia.org/wiki/Douglas_Coupland> on http://en.wikipedia.org/wiki/Girlfriend_in_a_Coma_%28novel%29 and we’ve now sown some confusion.  This ambiguity becomes more problematic down the road when the context changes (that is, assumptions I can make about wikipedia and wikipedia’s model don’t necessarily apply elsewhere).

Right around the time I graduated from high school, the guitarist in my band at the time made me a cassette copy of Jimi Hendrix’s “Jimi Plays Monterey“.  The sound quality was pretty terrible and, as I recall, my tape player ate it once making it even worse.  Still, I loved that album (Jimi, while playing Dylan’s “Like a Rolling Stone” says “I know I missed a verse, it’s alright, baby.”): I love the songs, I love the playing, I love the energy of the performance.  The medium that album came to me on, however, was subpar.  There are general attributes of “cassette tapes” and then there was “this particular recording on this particular cassette”.

At the same time in my life, I had a compact disc of the BulletBoys’ eponymous album.  Fidelity-wise, the sound of this album was orders of magnitude better than my copy of “Jimi Plays Monterey”, but pretty much everything else about it sucked.

The carrier is not the content.  Being able to refer to the quality of my dilapidated cassette without dragging the Jimi Hendrix Experience into it is useful.  I should be able to say that my BulletBoys CD sounded better than my Hendrix tape without that being a staggering example of bad taste.

In libraries, we have a long history of data ambiguity.  We have struggled enough to figure out the semantics in our AACR2/ISBD data that when we have the chance to easily and concretely identify the things we are talking about, we should take it.  I am not proposing abstracting things into oblivion with resources on top of resources – just sensibly being sure you’re talking about what you say you are.

Unfortunately, one of my problems with the new RDA vocabularies is that in several instances it schmushes multiple statements together to avoid the modeling the “hard parts” (this is precisely the same issue I have with Ian’s later comment).  For example, RDA has a bunch of properties that are intended to “hand wave” around the complexities of FRBR, such as http://RDVocab.info/Elements/otherDistinguishingCharacteristicOfTheExpression.  So you’d have something like:

<http://example.org/1>
    <http://RDVocab.info/Elements/title> "Something: a something something" ;
    <http://RDVocab.info/Elements/titleOfTheWork> "Something" .

What you’ve done here with “titleOfTheWork” is say that <http://example.org/1> has a work, is itself not a work and the work’s title is “Something”.   That’s some attribute!  But if we can say all of that, why would we not just model the work?! Even if we don’t know where in the WEMI chain <http://example.org/1> falls, if we did something like this:

<http://example.org/1>
    dcterms:title "Something: a something something" ;
    ex:hasWork <http://example.org/works/1234> .

<http://example.org/works/1234>
    a <http://RDVocab.info/uri/schema/FRBRentitiesRDA/Work>;
    dcterms:title "Something" .

we’ve now done something useful, unambiguous and reusable (and not ignoring FRBR while simultaneously defining it).  The closed nature of IFLA’s development of these vocabularies don’t lead to me have much hope, though.

But, again, back to Ed.  Like I said, I really don’t think the internet will fall apart and satellites will come crashing to the earth if we don’t adhere consistently to httpRange-14.  No, the reason why I call bullshit on Ed’s statement is because he finds the use of owl:sameAs on resources such as http://purl.org/NET/marccodes/muscomp/sn#genre to be inappropriate.  While in his post he claims it’s fine that we conflate the resource of William Shakespeare as a foaf:Person and foaf:Document that was modified on “2010-06-28T17:02:41-04:00″, he on the other hand questions the appropriateness of <http://purl.org/NET/marccodes/muscomp/sn#genre> owl:sameAs <http://dbpedia.org/resource/Sonatas> because by doing so it infers that <http://purl.org/NET/marccodes/muscomp/sn#genre> has a photo collection at <http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Sonata> (which, in fact, has little to do with the musical genre and actually has a lot of pictures of Hyundais, among other things).

This is a perfectly fair, valid and important point (and one that absolutely needs to be addressed), but doesn’t this also mean he actually cares that we say what we really mean?

2 comments
  1. Here is the sad problem:

    “Resource” is an overloaded term that lies at the root of the main conflation issue.

    Imagine if we could just say: “Real World Object” (RWO) instead “Non Information Resource” (NIR), with regards to the “Subjects” of Structured Descriptions.

    Imagine if we could say, with consistency: an “Object” on an HTTP Network is also known as a “Resource”.

    Then we would be able to say the following:

    1. A Web Resource carries (or bears) the Representation of a Real World or HTTP Network Object

    2. In the context of Linked Data – a Web Resource explicitly carries (or bears) a Structured Representation of the Description of a Real World or HTTP Network Object

    3. The Resource Type above is actually known as a Descriptor

    4. Every Descriptor has an Unambiguously Named Subject

    5. When giving Names to Descriptor Subjects the Linked Data meme requires the use of HTTP URIs i.e., “Name aspect” of the “Name/Address” duality inherent in the Generic URI abstraction

    6. When injecting Descriptor Resources into the Web do so using HTTP URLs (as per normal Web publishing practice i.e., using “Address” aspect of the Generic URI abstraction mentioned above).

    Links:

    1. http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1624 – Data 3.0 Manifesto

    Kingsley

  2. BTW – why isn’t the relation:

    owl:sameAs .

    Fixing DBpedia anomalies is as simple as putting together fixed triples into a Resource in one of the many RDF model formats and then just publishing somewhere on the Web (or send directly to me) and it will be loaded into a DBpedia-LINKs specific Named Graph within DBpedia host Virtuoso instance.

    Trivial :-)

    Kingsley

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>