Since announcing the preview release of 194 Million Open Linked Data Bibliographic Work descriptions from OCLC’s WorldCat, last week at the excellent OCLC EMEA Regional Council event in Cape Town; my in-box and Twitter stream have been a little busy with questions about what the team at OCLC are doing.
Instead of keeping the answers within individual email threads, I thought they may be of interest to a wider audience:
Q I don’t see anything that describes the criteria for “workness.”
“Workness” definition is more the result of several interdependent algorithmic decision processes than a simple set of criteria. To a certain extent publishing the results as linked data was the easy (huh!) bit. The efforts to produce these definitions and their relationships are the ongoing results of a research process, by OCLC Research, that has been in motion for several years, to investigate and benefit from FRBR. You can find more detail behind this research here: http://www.oclc.org/research/activities/frbr.html?urlm=159763
Q Defining what a “work” is has proven next to impossible in the commercial world, how will this be more successful?
Very true for often commercial and/or political, reasons previous initiatives in this direction have not been very successful. OCLC make no broader claim to the definition of a WorldCat Work, other than it is the result of applying the results of the FRBR and associated algorithms, developed by OCLC Research, to the vast collection of bibliographic data contributed, maintained, and shared by the OCLC member libraries and partners.
Q Will there be links to individual ISBN/ISNI records?
- ISBN – ISBNs are attributes of manifestation [in FRBR terms] entities, and as such can be found in the already released WorldCat Linked Data. As each work is linked to its related manifestation entities [by schema:workExample] they are therefore already linked to ISBNs.
- ISNI – ISNI is an identifier for a person and as such an ISNI URI is a candidate for use in linking Works to other entity types. VIAF URIs being another for Person/Organisation entities which, as we have the data, we will be using. No final decisions have been made as to which URIs we use and as to using multiple URIs for the same relationship. Do we Use ISNI, VIAF, & Dbpedia URIs for the same person, or just use one and rely on interconnection between the authoritative hubs, is a question still to be concluded.
Q Can you say more about how the stable identifiers will be managed as the grouping of records that create a work change?
You correctly identify the issue of maintaining identifiers as work groups split & merge. This is one of the tasks the development team are currently working on as they move towards full release of this data over the coming weeks. As I indicated in my blog post, there is a significant data refresh due and from that point onwards any changes will be handled correctly.
Q Is there a bulk download available?
No there is no bulk download available. This is a deliberate decision for several reasons.
Firstly this is Linked Data – its main benefits accrue from its canonical persistent identifiers and the relationships it maintains between other identified entities within a stable, yet changing, web of data. WorldCat.org is a live data set actively maintained and updated by the thousands of member libraries, data partners, and OCLC staff and processes. I would discourage reliance on local storage of this data, as it will rapidly evolve and become out of synchronisation with the source. The whole point and value of persistent identifiers, which you would reference locally, is that they will always dereference to the current version of the data.
Q Where should bugs be reported?
Today, you can either use the comment link from the Linked Data Explorer or report them to firstname.lastname@example.org. We will be building on this as we move towards full release.
Q There appears to be something funky with the way non-existent IDs are handled.
You have spotted a defect! – The result of access to a non established URI should be no triples returned with that URI as subject. How this is represented will differ between serialisations. Also you would expect to receive a http status of 404 returned.
Q It’s wonderful to see that the data is being licensed ODC-BY, but maybe assertions to that effect should be there in the data as well?.
The next release of data will be linked to a void document providing information, including licensing, for the dataset.
Q How might WorldCat Works intersect with the BIBFRAME model? – these work descriptions could be very useful as a bf:hasAuthority for a bf:Work.
The OCLC team monitor, participate in, and take account of many discussions – BIBFRAME, Schema.org, SchemaBibEx, WikiData, etc. – where there are some obvious synergies in objectives, and differences in approach and/or levels of detail for different audiences. The potential for interconnection of datasets using sameAs, and other authoritative relationships such as you describe is significant. As the WorldCat data matures and other datasets are published, one would expect initiatives from many in starting to interlink bibliographic resources from many sources.
Q Will your team be making use of ISTC?
Again it is still early for decisions in this area. However we would not expect to store the ISTC code as a property of Work. ISTC is one of many work based data sets, from national libraries and others, that it would be interesting to investigate processes for identifying sameAs relationships between.
The answer to the above question stimulated a follow-on question based upon the fact that ISTC Codes are allocated on a language basis. In FRBR terms language of publication is associated with the Expression, not the Work level description. As such therefore you would not expect to find ISTC on a ‘Work’ – My response to this was:
Note that the Works published from WorldCat.org are defined as instances of schema:CreativeWork.
What you say may well be correct for FRBR, but the the WorldCat data may not adhere strictly to the FRBR rules and levels. I say ‘may not’ as we are still working the modelling behind this and a language specific Work may become just an example of a more general Work – there again it may become more Expression-like. There is a balance to be struck between FRBR rules and a wider, non-library, understanding.
Q Which triplestore are you using?
We are not using a triplestore. Already, in this early stage of the journey to publish linked data about the resources within WorldCat, the descriptions of hundreds of millions of entities have been published. There is obvious potential for this to grow to many billions. The initial objective is to reliably publish this data in ways that it is easily consumed, linked to, and available in the de facto linked data serialisations. To achieve this we have put in place a simple very scalable, flexible infrastructure currently based upon Apache Tomcat serving up individual RDF descriptions stored in Apache HBase (built on top of Apache Hadoop HDFS). No doubt future use cases will emerge, which will build upon this basic yet very valuable publishing of data, that will require additional tools, techniques, and technologies to become part of that infrastructure over time. I know the development team are looking forward to the challenges that the quantity, variety, and always changing nature of data within WorldCat will provide for some of the traditional [for smaller data sets] answers to such needs.
As an aside, you may be interested to know that significant use is made of the map/reduce capabilities of Apache Hadoop in the processing of data extracted from bibliographic records, the identification of entities within that data, and the creation of the RDF descriptions. I think it is safe to say that the creation and publication of this data would not have been feasible without Hadoop being part of the OCLC architecture.
Hopefully this background will help those interested in the process. When we move from preview to a fuller release I expect to see associated documentation and background information appear.