I have been banging on about Schema.org for a while. For those that have been lurking under a structured data rock for the last year, it is an initiative of cooperation between Google, Bing, Yahoo!, and Yandex to establish a vocabulary for embedding structured data in web pages to describe ‘things’ on the web. Apart from the simple significance of having those four names in the same sentence as the word cooperation, this initiative is starting to have some impact. As I reported back in June, the search engines are already seeing some 7%-10% of pages they crawl containing Schema.org markup. Like it or not, it is clear that Schema.org is rapidly becoming a de facto way of marking up your data if you want it to be shared on the web and have it recognised by the major search engines.
It is no coincidence then, at OCLC we chose Schema.org as the way to expose linked data in WorldCat. If you haven’t seen it, just search for any item at worldcat.org, scroll to the bottom of the page and open up the Linked Data tab and there you will see the [not very pretty, but hay it’s really designed for systems not humans] Schema.org marked up linked data for the item, with links out to other data sources such as VIAF, LCSH, FAST, and Dewey.
As with everything new it was not perfect from the start. We discovered some limitations in the vocabulary as my colleagues attempted to describe WorldCat resources. Leading to the creation of a Library vocabulary (as a potential extension to Schema.org) to help encode some of the stuff that Schema couldn’t. Fortunately, those at Schema.org are open to extension proposals and, with the help of the W3C, run a Group [WebSchemas]to propose and discuss them. Proposals that have already been accepted include those from news and ecommerce groups.
Things have moved on and, I have launched another W3C community Group – Schema Bib Extend to attempt to build a consensus, across a wide group of those concerned about things bibliographic, around proposing extensions to the Schema.org vocabulary. Addressing it’s capability for describing these types of resources – books, journals, articles, theses, etc., etc. in all forms and formats.
My personal hope being that the resulting proposals, if and when adopted by Schema.org, will enable libraries, publishers, interest groups, universities, retailers, OCLC, and others to not only publish data about their resources in a way that the search engines can understand, but also have a light weight way to interconnect them to each other and authoritative identifiers for place, name, subject, etc., that will help us begin to form a distributed web of bibliographic data. A bit of a grand ambition for a fairly simple vocabulary you may think, but things worth having are worth reaching for.
So focusing back on the short term for the moment. Extending Schema.org to better describe bib resources could have significant benefits anyway. What is in library catalogues, and other bibliographic sources, is mostly hidden to search engines – OPAC pages are almost impossible scrape intuitively, data formats used are only understood by the library and publisher worlds, and even if they ascertain the work a library is describing, there is little way to identify that it is, or is not, the same as one in another library. It is no accident that Google Book Search came into being utilising special data ingest processes and search techniques to help. Unfortunately there is a significant part of the population unaware of it’s existence and few who use it as part of their general search activities. By marking up your resources in their terms, your data should appear in the main search indexes and you may even get a better results listing (courtesy of Google Rich Snippets).
OK, that’s the pitch for Schema.org (and getting together to extend it a little in the bibliographic direction) over. Now on to the point of this post – the mindset we should adopt when approaching the generic, high level, course grained, broad but shallow, simplistic [choose your own phrase] Schema.org vocabulary to describe rich and [already] richly described resources we find in libraries. Although all my examples will be library/bibliographic ones, I believe that much of what I describe here will be of use and relevance to those in other industries with rich and established ways to describe their data and resources.
Initially let me get a few simple things out of the way. Firstly, the Schema.org vocabulary is not designed to, and will never, replace any rich industry specific vocabularies or ontologies. It’s prime benefits are that it is light-weight (understandable by non-experts) and cross-sectoral (data from many domains can be merged and mixed) and, oh yes becoming broadly adopted. Secondly nobody is advocating that anyone starts to use it instead of their currently used standards – either mix it with your domain specific standards and/or use it as ‘publicly understandable’ publishing format for web pages and the like. Finally, although initially conceived as a web page markup (Microdata) format, the schema.org vocabulary is equally applicable as Linked Data vocabulary that can be used in the creation of RDF data. The increasing use and reference to RDFa in Schema.being a reflection of this. This is also exemplified by the use of Schema.org in the RDF N-Triples dump file OCLC has published of a sub-set of WorldCat data.
So moving on. You have your resources already being described, following established practice, in domain specific format(s) and you want to attempt to describe them using the Schema.org vocabulary. In the library/publishing community we have more such standards than you can shake a stick at – MARC (of several mostly incompatible flavours), MODS, METS, ONIX, ISBD, RDA, to name just some. Each have their enthusiasts, and proponents, many being a great starting point for a process that might go something like this:
Working my way through all the elements of the [insert your favourite here] standard let me find an equivalent in Schema that I can map my data to.
This can become a bit of an involved operation. Take something as simple as the author of a book for instance. Bibliographic standards have concepts such as main author, corporate, creator, contributor, etc. Schema>Book only has the simple property ‘author’. How can I reflect the rich nuances and detail in my [library] format, in this simplistic Schema.org vocabulary? Simple answer – you can’t, so don’t try. The question you have to ask yourself at this point is: By adding all this detail will I confuse potential consumers of this data, or will the Googles of this world just want to know the people and organisations connected with [linked to] this book in a creative (text) way. Taking this approach of looking at the problem from the data/domain expert’s end of the telescope means that you have to go through a similar process for each and every element in your data forma/vocabulary/standard. An approach that will most probably lead to a long list of things missing from and recommendations for Schema.org that they (the group, not the vocabulary) would be unlikely to accept.
Let me propose an alternative approach by turning the telescope around and viewing the data, that you care about and want to publish, from the non-expert consumer’s point of view. Using my book example again it might go like this:
Schema has a Book class (great!) let me step through it’s properties and identify where in [insert your favourite standard here] I could get that from.
So for example, the ‘author’ property of Schema’s Book class comes from it being a sub-class of the generic CreativeWork class where it is defined as being a Person or Organization – The author of this content. You can now look into your own vocabulary or standard to find the elements which would contain author-ish data to map to Schema.
Hang on a moment though! The Book>author property is defined as being a instance of (or link to) Person or Organization classes. This means that when we start to publish our data in this form, it is not a matter of just extracting the text string of the author’s name from our data; we need to provide a link to a description of that author (preferably also in Schema.org format). WorldCat data does this by providing author links to VIAF – a pattern repeated with other properties such as ‘about’ (with links to Dewey and LCSH).
Taking this approach limits you to only thinking about the things Schema [currently] concerns itself with – a much simpler process.
If that was all there was to it, there would be no need for the Schema Bib Extend Group. As we did at OCLC with WorldCat, some gaps were identified in the result, making it unsatisfactory in some areas in providing a description for even a non-expert. Obvious candidates [for a Book] include a holding statement, and how to describe the type of book (ebook, talking book, etc.) and the format it is in (paper/hard back, large print, CD, Cassette, MP3, etc.) However, approaching it from this direction encourages you to firstly look across other areas of the Schema.org vocabulary and other extension proposals for solutions. GoodRelations, soon to be merged into Schema, offers some promising potential answers for holdings (describing them as Offers to hire/lease). A proposal from the Radio/TV community includes a PublicationEvent.
Finally it is only the gaps, or anomalies, apparent at a Schema.org level that should turn into proposals for extension. How they would map to elements of standards from our own domain would be down to us [as with what is already in Schema.org] to establish and share consensus driven good practice and lots, and lots, of examples.
We, especially in the library community, have invested much time and effort over many decades in describing [cataloguing] our resources so that people can discover and benefit from them. Long gone are the days when the way to find things was to visit the library and flick through draws full of catalogue cards. Libraries were quick to take advantage of the web, putting up their WebOPAC’s so that you could ‘search from home’. However, study after study has shown that people are now not visiting the library online either. The de facto [and often only] start point is now a search engine – increasingly as represented by a generic search prompt on your phone or tablet device.
This evolution in searching practice would be fine [from a library point of view] if library resources were identified and described to the search engines such that they can easily consume and understand it – so far it hasn’t been. Schema.org is a way to do that, and to be realistic at the moment is the only show in town that fits that particular bill. We realised decades, if not centuries ago, that for people to find our things we need to describe them, but the best descriptions in the world are about as much use as a chocolate teapot if they are not in places where those people are looking.
If you want to know more about bibliographic extension proposals to Schema.org, or help in creating them, join us at Schema Bib Extend.
And remember – when you are thinking about relating your favourite standard to Schema.org, check which end of the telescope you are using before you start.