Last modified: 2017-12-19
Abstract
Research data management (RDM) is increasingly becoming part of routine work for many libraries and research institutions. Vocabularies for RDM have also emerged to meet the needs of data management and curation. Examples include a taxonomy of data-related terms for the geographical information systems (GIS) (Cole, 2005), the vocabulary for research data repository registration and description (Vierkant et al., 2012), and the DPCVocab framework by Chao et al. (2015). While these endeavors provide useful frameworks for understanding the description of data and curation practices, the complex and entangled relations between concepts in datasets and other research artifacts are still a less known territory waiting for exploration. The recent hype in artificial intelligence (AI) research and development and the promises that knowledge organization systems (KOS) have toward AI call for a re-examination of relations in KOS.
Concepts and relations between concepts are the backbones of knowledge organization system (KOS). The most common types of relations among terms in thesauri and subject heading lists are broader terms (BT), narrower terms (NT), and related terms (RT), which establish parent-child, part-whole, or associative relations between concepts. Such relations primarily represent the scope of concepts and can be limited in representing many other concept relations beyond concept scope. For example, the Unified Medical Language System (UMLS) defines about 60 relations in five broad categories: R1. physically_related_to, R2. spatically_related_to, R3. functionally_related_to, R4. termporallu_related_to, and R5. conceptually_related_to (UTS, 2017). Another example is the relationships defined by Schema.org for Dataset, which are largely inherited from the CreativeWork class. Major relationships for Dataset (or CreativeWork) include hasPart, isBasedOn, isPartOf, and mentions (Schema.org, 2017).
The content and formats of a KOS are largely determined by the purpose of KOS and the technology available at the time. Thesauri and subject heading lists fit into early computing technology to support traditional methods of indexing, which uses terms to represent the knowledge in publications. While this tradition remain to be the mainstream method of publication subject representation, the methods of representing concepts and relations have gradually expanded as technology advances enabled such methodological expansion in building KOS.
Ontologies as one of the KOS types entail much richer relations for the concepts they cover. In both UMLS and Schema.org examples, the relations between concepts have gone far beyond the traditional BT/NT/RT types, which can be considered as a sign of modern KOS evolving from term-based representation of knowledge to real-world oriented representation.
Relations between concepts (and/or entities, events, and other things) vary depending on the criteria by which relations are defined or viewed. In the domain of research data management, it is known that different types of research, e.g., experimental, observation, simulation, and survey, generate different types of data and terminologies vary between practitioners and basic science researchers even within the same disciplinary domain. Datasets often come with documentation (some in the form of user guide or manual) and computing code, and are associated with publications. Interactions between datasets, between datasets and documentation, and between datasets and computing code can result in different types of relations.
This paper is an extension of this author’s 2002 ISKO article on the evolving paradigm of knowledge representation and organization, in which the author presented two opposite but crossover spectra in knowledge representation and organization, one for pragmatism vs. epistemologism and the other for integration vs. disintegration (Qin, 2002). By using two cases – one is the GenBank annotation records and the other is the data and artifact collection from a gravitational wave search, this paper will demonstrate the types of relations existing in and between datasets, publications, computing codes, and workflows. The analysis and generalization of these relations will reference the research in AI’s knowledge representation and KOS, including both ad hoc subject categories and formal KOS, because in the next AI era, relations as one of the key components of AI applications will be required to function not only as part of KOS for indexing data and publications, but more importantly, to function as codifiable knowledge for machine consumption. The paper will start with an introduction, followed by a literature review on theoretical support and practical evidence for various types of relations, a theoretical framework for relation typology and its role in KOS, and a discussion of the implications of this relation typology for KOS development.