Last modified: 2018-06-20
Abstract
The use of data is part of the decision-making process (Ikemoto & Marsh, 2007). On a decision evaluation, the fact of a decision needs to receive data inputs and integrate them with multiple data sources as, for example, a process of data integration about user's interactions with their communication channels, towards to build for service holders new perceptions of this context.
However, when analyzed a datum as a triple formed by entity, attribute, and value (Santos & Sant’Ana, 2015) implies an use of elements to assure a minimal semantic to understand what is available, notedly on obtaining data processes (Sant’Ana, 2016) and, in this sense, minimizing a semantic dissonance (Berg, 2015; Rathod, 2006; Ross Parry, Nick Poole, & Jon Pratty, 2008) on the moment of collecting - research problem of this study.
In this context, the study goal is to identify semantics characteristics of datasets, at the moment of data collecting, from dataset's structures found on export data interfaces available on user’s interactions analysis tools, on Internet communication channels, and on statistical data access tools involved in a scientific journal management process, thru an application of data analysis and data model techniques.
The research universe was delimited to exportable dataset's structures, found in journal publishing systems, online social networks statistics, search engines, and web analytics tools. The sample analyzed was restricted to dataset's structures, available in reports found in Open Journal Systems (OJS), Google Analytics, Google Search Console, Twitter Analytics, and Facebook Insights. These resources did not present any version control numbering, except by OJS (2.6). The data was collected in September' 2017 from "Electronic Journal Digital Skills for Family Farming" accounts.
It was adopted an exploratory analysis methodology to identify characteristics about how data are available and structured on those data resources, contemplating a systematically describing process of datasets, entities, and attributes related to the interaction between users and communications channels from a scientific journal.
A total of 255 exportable datasets were found, distributed in 5 file formats: Comma-Separated Values (CSV) (82), Google Docs Spreadsheet File Format (69), Excel Microsoft Office Open XML Format Spreadsheet file (50), Portable Document Format (50), and Excel Binary File Format (3). Except for CSV, all other file formats were discarded, mainly because CSV is a machine-readable, open file format, and available in every export data interfaces analyzed. It was collected 82 CSV datasets from Google Analytics (50), Google Search (20), Open Journal Systems (7), Facebook Insights (3), and Twitter Analytics (2).
In order to systematize the analysis, it was applied concepts from Entity-Relationship (ER) Model (Silberschatz, Korth, & Sudarshan, 2010) with entities to store data collected from i) services, ii) resources available in the services, iii) datasets available in the resources, and iv) attributes available in the datasets.
Also, it was developed two auxiliary tables i) format, to store file format types available on datasets, and ii) data type to store data types: "a named (and in practice finite) set of values" (Date, 2016, p. 228).
This applied ER Model provides a structure to store data from entities and attributes from each dataset. Applying this ER structure on data collected in this study was possible to identify 82 entities, 2280 attributes, with a subset of 1342 unique attribute labels.
The ER structure and data was stored in a Google Spreadsheet file. After that, the file was uploaded to a DataBase Management System (DBMS) to a further data analysis. It was developed a Python script to reorder the data stored in DBMS to a new data structure, adopting the Online Analytical Processing (OLAP) cube as representation with Service (s), Entity (e), and Attribute (a) data used as dimensions (Gray, Bosworth, Lyaman, & Pirahesh, 1996; Inmon, 1996; Kimball & Ross, 2011). The collected data was reordered to OLAP cube dimensions by a pivot table process (Cornell, 2005).
It was intended to observe on intersections of OLAP cube the characteristics shared internally and externally by services, entities and, attributes that can affect semantics aspects on data collecting.
The results show that 88.69% of attributes doesn't it relate to any description about its content. Added to that, all attributes that share equal labels between distinct services came without description on collecting. This subset of attributes had a significant importance to interoperability applicability of those datasets, with a capability to distinguish the context on collecting process and also be part of a group of potential primary keys or unique fields, helping to build relationships between data from this sources, or even in a geographic, timing or linguistic determination.
Associated to this scenario, several attributes came with filtering, grouping or sorting specifications as a part of text labels, a pattern only followed by online social networks statistical data export tools, which can increase complexity involved in interpreting those attributes and to determine characteristics of values by fully or semi-automated data collecting algorithms.
Therefore, based on information from exportable datasets found on studied services, it's possible to confirm that an entity (ex) could have two attributes (ax and ay) sharing the same semantic (S), even when both attributes have different labels, expressed by the formula:S(ex, ax)=S(ex, ay)
For example, two attributes that came with filtering specifications as a part of text labels fit in this formula.
In addition to this, when attributes from different entities (ex and ey) are sharing same labels (ax) it's not a guarantee of these attributes are sharing same formal semantic characteristics on collecting process by external agents, forcing external teams to interpret the semantics of these elements locally, expressed by the formula: S(ex,ax)<>S(ey,ax)
For example, attributes that share equal labels in distinct services, without a proper description of its content, may require an interpretation by external agents, increasing a risk of wrong interpretations of attribute meaning when available on another service.
To minimize this dissonance, export interfaces can add semantic information. This information may be fundamental to help external agents to interpret data from different services. In this sense, semantic dissonances between entities and attributes in these datasets can be better represented on collecting process with use of controlled vocabularies on labeling rules.
Keywords: Semantic dissonance. OLAP. Data analysis. Data collecting. Data.
ReferencesBerg, O. (2015). Collaborating in a social era: ideas, insights and models that inspire new ways of thinking about collaboration. Göteborg: Intranätverk.
Cornell, P. (2005). A complete guide to PivotTables: a visual approach. Berkeley, CA : New York: Apress ; Distributed to the Book trade in the United States by Springer-Verlag.
Date, C. J. (2016). The new relational database dictionary: a comprehensive glossary of concepts arising in connection with the relational model of data, with definitions and illustrative examples: [terms, concepts, and examples]. Sebastopol, CA: O´Reilly.
Gray, J., Bosworth, A., Lyaman, A., & Pirahesh, H. (1996). Data cube: a relational aggregation operator generalizing GROUP-BY, CROSS-TAB, and SUB-TOTALS (pp. 152–159). IEEE Comput. Soc. Press. https://doi.org/10.1109/ICDE.1996.492099
Ikemoto, G. S., & Marsh, J. A. (2007). Cutting Through the “Data-Driven” Mantra: Different Conceptions of Data-Driven Decision Making. Yearbook of the National Society for the Study of Education, 106(1), 105–131. https://doi.org/10.1111/j.1744-7984.2007.00099.x
Inmon, W. H. (1996). Building the data warehouse (2nd ed.). New York: Wiley Computer Pub.
Kimball, R., & Ross, M. (2011). The Data Warehouse Toolkit The Complete Guide to Dimensional Modeling. New York, Estados Unidos da América: John Wiley & Sons. Retrieved from http://nbn-resolving.de/urn:nbn:de:101:1-2014122311140
Parry, R., et al., Semantic Dissonance: Do We Need (And Do We Understand) The Semantic Web?, in J. Trant and D. Bearman (eds.). Museums and the Web 2008: Proceedings, Toronto: Archives & Museum Informatics. Published March 31, 2008. Consulted September 19, 2017. http://www.archimuse.com/mw2008/papers/ parry/parry.html
Rathod, A. (2006). A messaging system to handle semantic dissonance (Thesis). Rochester Institute of Technology, New York. Retrieved from http://scholarworks.rit.edu/cgi/viewcontent.cgi?article=1668&context=theses
Sant’Ana, R. C. G. (2016). Ciclo de vida dos dados: uma perspectiva a partir da ciência da informação. Informação & Informação, 21(2), 116. https://doi.org/10.5433/1981-8920.2016v21n2p116
Santos, P. L. V. A. da C., & Sant’Ana, R. C. G. (2015). Dado e Granularidade na perspectiva da Informação e Tecnologia: uma interpretação pela Ciência da Informação. Ciência da Informação, 42(2), 11.
Silberschatz, A., Korth, H. F., & Sudarshan, S. (2010). Database system concepts (6th ed.). New York: McGraw-Hill.