Last modified: 2018-06-20
Abstract
An important social issue in the cultural heritage domain is related to collection, analysis, publication and enhancement of collective history and memory of the stakeholders, be they spoken or written. This kind of information about cultural heritage presents a real challenge for their formalization because of the data diversity and incompleteness. Moreover, the data about cultural heritage are sparse and distributed, and can be found in different sources: online, in databases, in libraries, museums, in press papers, in the memories of stakeholders, etc. Thus, we are witnessing a prodigious rise of the volume of digital and physical contents describing this heritage and the increase of the production power associated with dissemination techniques, at different scales, and especially at the regional level. This diversity of resources brings many problems such as data documentation, representation, integration and interoperability within the same knowledge base. Most of the attempts to resolve semantic interoperability problems focus on standardization and development of shared structures like FRBR, FRBRoo, CIDOC CRM etc. Among those technologies, the CIDOC CRM is an ontology format specifically designed to model cultural heritage domains. This model offers a common meta-data schema making it understandable and interrelated by implicit and explicit relationships. Those ontology schemas are meant to promote a common data comprehension about cultural heritage by providing a common and extensible semantic framework in which all information can be mapped.
In our project, the objective is to provide a knowledge representation that interconnects all of these data, thanks to the semantic web technologies, in order to assist domain experts in producing and providing digital content. The originality of the project is to adopt a multidisciplinary approach to provide stakeholders, experts and non-experts, help them in the discovery of knowledge specific to their heritage, thanks to the extraction, structuring and visualization of knowledge from heterogeneous digital corpora. According to UNESCO, which has contributed significantly to the definition of the heritage (UNESCO, 1954, 1970, 1982), and then to The International Committee for the Conservation of Industrial Heritage (TICCIH, 2003), the industrial heritage can be defined as:
- Material assets: buildings, machinery, equipment, workshops, factories, processing and refining sites, shops, production centers and social activities related to the textile industry;
- Immaterial assets: memories, events, festivals, collective images, intellectual production transmitted by know-how which can be a succession of gestures dictated and displayed in production centers.
In our work, the main efforts are focused on modeling of the domain stakeholders, the spatial entities and thematic, which belong to both of the assets.
In this paper, we first provide a brief description of existing studies which aim at building of domain ontologies related to several fields in the cultural heritage area using semantic web technologies.
Then, we present a three step methodology for semi-automatic building of semantic representation of the studied domain from heterogeneous documents.
During the first step, we collect and formalize the history through interviews with stakeholders. In addition to the collected information, we also exploit a web mapping/visualization? of stakeholders organized by their type (Kergosien et al, 2015).
During the second step, we describe our methodology for identification and extraction of information related to industrial cultural heritage from heterogeneous textual documents (interviews, numerical documents from libraries, newspapers, etc.). The proposed approach combines lexicon projection with text mining methods to improve the identification of relevant data. Lexica of spatial entities initially cover regional municipalities. The lexicon of the domain’s stakeholders was built semi-automatically with experts. To create a thematic lexicon, existing specialized resources defined by experts (Joconde created by French museums, Rameau created by the National Library of France, Wiktionnary, and other) were analyzed and filtered manually. Text mining approach is based on the Word2vec algorithm and is exploited for identification of new terms from the processed corpus. The main purpose is to build a semantic representation of the studied domain as precise as possible.
The indexed documents are structured in XML MODS format[1], which is a document indexing format created by the Congress Library in the United States. This standard is a compromise between the complexity of the MARC format used by libraries and the extreme simplicity of the Dublin Core metadata.
Then, during the third step, we present a first ontology built automatically in the OWL CIDOC CRM format to merge together all our lexica. In this phase, it is important to filter the CIDOC CRM model to obtain a sub-model with the relevant concepts and properties.
The experiments were carried out on a corpus of thousands heterogeneous documents (newspaper articles from LaVoixDuNord, documents with metadata from libraries and interviews) related to the Textile Industrial Heritage (TIH) on the territory of Nord of France. The ontology built is tested and validated by experts using the Protege tool.
In future, we propose to extend our work and design a generic and semi-automatic approach for building semantic representation related to industrial heritage. Besides, we propose to test our method on heterogeneous data related to industrial and mining heritage, collected within the framework of the MemoMines project.
[1] http://www.loc.gov/standards/mods/
E. Kergosien, B. Jacquemin, M. Severo et S. Chaudron, Vers l'interopérabilité des données hétérogènes liées au patrimoine industriel textile, In 18ème colloque international sur le document numérique (CIDE'18), pp.15, Montpellier, 2015