Last modified: 2018-02-15
Abstract
Our research[1] involves comparing the terminology employed within the Linked Open Data (LOD) Cloud with terminology employed within two KOSs: The Universal Decimal Classification (UDC) and the Basic Concepts Classification (BCC). In doing so we will connect two quite distinct literatures and communities of practice: the Semantic Web community, which has tended to be centered in computer science, and the KO community. In the Semantic Web community there have been increasing efforts to curate and preserve the machine-readable knowledge items as published on the Web using Linked Data formats (Beek at al. 2014a,b). Controlled vocabularies play a prominent role in these efforts. They provide a way to index the knowledge graph, and they represent a semantically enriched layer in this graph. In KO, systematic studies of KOS have been proposed already (Tennis 2012), and such studies have also been executed for a number of small samples.
The promise of the web-based LOD Cloud is to free up data, metadata and information to a large extent from what often is called ‘data silos’—isolated Information Systems, which come with their own domain-specific knowledge organization systems, and are often barely interoperable. The LOD Cloud promises to deliver machine-readable KOS and their implementation in a way that enables easy cross-linking. For example, the platform GeoNames (http://www.geonames.org) publishes about 11 billion place names in machine readable form, and has been used by many other services to relate a term like “New York” to a specific geographic reference, which in turn enables other services to link other names to this location, e.g., “City of New York,” “New York City,” or the historic term “Nieuw Amsterdam.”
To be able to compare the different terminologies expressed in vocabularies, one first has to have an overview of them. Hence, our research involves the initial step of surveying the terminologies that are currently employed in Linked Open Data. This will result in an Atlas of Vocabularies.
The Semantic Web holds the promise that different information repositories can all be encoded in the flexible graph-based representation language of RDF. Atomic statements in RDF take the form of triples, which are composed of a subject, predicate, and object term. RDF relies on URIs in order to name/denote concepts and instances. Besides these syntactic properties, RDF also has a model-theoretic semantics that allows inferences to be drawn mechanically across different sets of information. If, for example, one website contains the RDF triple “Birds have wings” and another website contains the RDF triple “Penguins are birds,” a computer can infer that “Penguins have wings.” But this will only work if the same – or interoperable – terminology is employed. At present a wide variety of controlled vocabularies are employed across the LOD Cloud, but their formal semantics, including the inferences that follow from it, are not yet studied on a larger scale.
This holds in particular for the areas of the humanities and the arts, but also to some extent the social sciences. Mirroring the large variety of social and cultural phenomena in these fields, we find very specific, context-rich vocabularies developed by research communities as well as curators of collections. Increasingly big data projects in the social sciences and humanities embrace Semantic Web technology (e.g. Hyvönen 2012). The ultimate goal of the collaboration in this project is to enhance the findability of facts and vocabulary used in the LOD Cloud and to enable scholars in the social sciences and humanities to find the right points to connect to when publishing Linked Open Data.
This diverse terminology can itself only be surveyed mechanically: The LOD laundromat, developed at VU University Amsterdam, allows us to identify the relationships among diverse terminology. The LOD Laundromat (http://lodlaundromat.org/) is a platform that scrapes, cleans, harmonizes and republishes a very large subset of the LOD Cloud. It currently serves more than 38 billion RDF triples, collected from over 650 thousand datasets. This vast collection of Linked Data is made available in a uniform and standards-compliant format for others to (re)use. The LOD Laundromat allows its entire data collection to be queried through open web services, and is currently the only framework that allows data to be searched for and browsed on such a large scale.
But the collection of web-based vocabularies is only a first step. We will then proceed to compare the terminology of the LOD Cloud with the controlled vocabulary of the UDC and BCC. Note here that the challenge of interoperability across the LOD Cloud is itself a KO challenge, yet there has been limited communication between the KO community and those active in developing the Semantic Web. We chose the UDC and BCC because these classifications have explicitly grappled with interdisciplinarity, and have pursued a faceted approach to classification (On BCC see Szostak 2013). The potential of the Semantic Web will best be realized if connections can be drawn across data repositories. We thus wonder if KOSs that strive to facilitate interdisciplinarity can play a key role in encouraging interoperability in the LOD cloud. Can the terminology employed in the LOD cloud be connected to KOS controlled vocabulary? Can the hierarchies and other relationships recognized within KOSs be used to structure terminology in the LOD cloud?
We will use those two generic classifications (UDC, BCC) as reference systems to develop generic principles of indexing. We will use high-level topical categorisations - similar to the UDC classes and facets as place, time, person/organization (data publisher), form, language etc. - similar to the UDC Common Auxiliary numbers. We will contrast this with the phenomenon-based approach of the BCC, and ask questions of What (is studied)?, Why, Who, Where, and When? These categorisations will be tested in the archived version of the LOD Laundromat and eventually implemented in the open web services of the ‘living’ LOD Laundromat. In particular, we will explore how general classification systems such as the UDC and BCC can be used to index Linked Data in a way that allows searching for concepts across domains, without becoming lost in the richness of the KOSs embedded in the LOD. In other words, we aim at a kind of union catalog for the LOD Laundromat snapshot, which will also be archived along with the LOD Laundromat data collection itself. One key question we hope to investigate is how interdisciplinarity is present/expressed or hidden/undiscovered.
At present, anyone wishing to code data for the Semantic Web has to choose among a bewildering array of sources of terminology. The choices made will determine which other data repositories a computer can connect your data to. Our research can potentially ease the choice facing those wishing to employ LOD and expand the degree of interoperability. We hope, in particular, to develop recommendations for LOD publication for communities in the Social Sciences and Humanities (SSH), with emphasis on the re-use of existing vocabularies (among which we will encourage interoperability). We will identify, evaluate and index SSH-relevant vocabularies by mapping clusters of similar meaning onto Knowledge Organization Systems (KOSs).
Though implications for the Semantic Web are perhaps most obvious, our research also has important implications for KO. If KOSs can play a critical role in encouraging interoperability across the Semantic Web, then the field of KO gains an important new audience for its work. Note that the premise of the Semantic Web is that data of all types needs to be explicitly coded in terms of RDF triples. In other words, the Semantic Web is grounded in the recognition that there are limits to what can be discovered by simply searching texts. The KO community’s longstanding efforts to develop structured controlled vocabulary at times seem to be overshadowed by search algorithms that search full texts rather than metadata, but the Semantic Web potentially places KO at the center of future developments in machine searching.
In addition, research has shown that classifications themselves form navigable knowledge networks among the resources to which they are linked (Suchecki et al. 2012; Smiraglia et al. 2013).
Much effort is undertaken in the KOS domain to bring KOSs into use in the LOD Cloud (e.g. Baca and Gill 2015). There is effort to link general controlled vocabularies, such as The Getty Vocabularies, to the LOD Cloud. There are definite advantages in vocabulary mapping for people-centered properties (what librarians call “authority control” of names), for LOD, to alleviate the problems of property proliferation in LOD environments. The discourse concerning the Semantic Web reveals a research agenda for KOSs including direct linkage of domain-centric ontologies within the LOD Cloud, including most importantly for this project, vocabulary alignment. We hope to provide advice on how KOSs might be revised to reflect and serve the LOD Cloud (especially from the perspective of interdisciplinarity).
It should be stressed that the proposed research will provide a much-needed link between LOD and KOSs. By mapping one onto the other we can compare the structure of the two. KOSs always combine some sort of logical structure with “literary warrant”: the idea that a place must be found in the KOS for all works or ideas. Comparing LOD clusters with a KOS can indicate where a particular KOS needs to be amended. That is, LOD clusters provide literary warrant for clarifying the KOS. In turn the mapping can suggest how LOD can be better structured/indexed to facilitate the practice of actually linking data. We can thus harness the wisdom of the KO community to the important practice of achieving interoperability or even consensus on LOD terminology. To achieve this interplay with respect to the BCC we must render the BCC into LOD terminology and then compare the result with the clusters of LOD terminology we obtain.
References
Baca, Murtha; Gill, Melissa. (2015). “Encoding Multilingual Knowledge Systems in the Digital Age: the Getty Vocabularies.” Knowledge Organization 42(4), 232-243.
Beek, W., L. Rietveld, H.R. Bazoobandi, J. Wielemaker, S. Schlobach (2014). “LOD Laundromat: A Uniform Way of Publishing Other People’s Dirty Data.” In: International Semantic Web Conference (ISWC), pp. 213-228.
Beek, W., P. Groth, S. Schlobach, R. Hoekstra. “A Web Observatory for the Machine Process- ability of Structured Data on the Web.” In: Proceedings of the 2014 ACM Conference on Web Science (WebSci), pp. 249-250.
Hyvönen, E. (2012). Publishing and Using Cultural Heritage Linked Data on the Semantic Web. Synthesis Lectures on the Semantic Web: Theory and Technology (Vol. 2). doi:10.2200/S00452ED1V01Y201210WBE003
Smiraglia, Richard P., Andrea Scharnhorst, Almila Akdag Salah and Cheng Gao. 2013. “UDC in Action.” In Classification and Visualization: Interfaces to Knowledge, Proceedings of the International UDC Seminar, 24‐25 October 2013, The Hague, The Netherlands, ed. Aïda Slavic, Almila Akdag Slah and Sylvie Davies eds.,Würzburg: Ergon-Verlag, pp. 259‐72.
Suchecki, Krzysztof, Alkim Almila Akdag Salah, Cheng Gao and Andrea Scharnhorst. 2012. “Evolution of Wikipedia's Category Structure.” Advances in Complex Systems 15 supp01: 1250068.
Szostak, Rick. 2013. Basic Concepts Classification. (regularly updated since 2013) https://sites.google.com/a/ualberta.ca/rick-szostak/research/basic-concepts-classification-web-version-2013
Tennis, Joseph T. 2012. “The Strange Case of Eugenics: A Subject’s Ontogeny in a Long-lived Classification Scheme and the Question of Collocative Integrity.” Journal of the American Society for Information Science and Technology 63: 1350–59. doi:10.1002/asi.22686
[1] Digging Into the Knowledge Graph, 2016 Digging Into Data Challenge https://diggingintodata.org/awards/2016/project/digging-knowledge-graph