Introduction

Introduction

Words matter. In conservation we select the terms we use in our documentation with specificity. The ‘knocked corner’ on a pamphlet is different from a ‘dog eared corner’. The identification of an adhesive mix used on a spine lining will impact success for later removal. Communicating clearly and specifically about the words we use in our treatment documentation is critical to our work.

Our word choices are not always the same from lab to lab, from specialty to specialty, or across languages. Variation is due to many factors including different traditions, training systems and specialty practices. Restricting word choices is not desirable and will not result in meaningful sharing of information.

However, it is possible to explain the meaning behind terms and associate words used locally (in a lab or organisation) with a well-defined concept. Aligning local words with concepts allows searching across labs, specialties and languages. Records produced using terms and concepts from one lab/vocabulary can be searched using terms and concepts from another. Through this repository a conservator will be able to use familiar terminology to search records produced by other labs, in a different specialty, or even a different language.

Taking part by contributing your terms means that you will be able to search conservation records using your own terminology (even if they have been produced with a different terminology).

Machine readable records

The group of technologies collectively described as Linked Data require that conservation documentation be machine readable. Specific terms are matched to a concept (identified by a URI, see Encoding records) - for example the Getty’s Art & Architecture Thesaurus (AAT) includes an entry for super (URI: http://vocab.getty.edu/aat/300263577), a bookbinding term with many variations. By matching a local word to this AAT entry, one’s record can connect to other records that use the words ‘crash’ or ‘mull’ or even ‘boekbindersgaas’ to describe the same thing. Doing this matching at scale leads to what is called the ‘alignment’ of different vocabularies.

Linked Conservation Data repository

The Linked Conservation Data (LCD) repository is a location where data about the alignment of conservation vocabularies is stored so that they may be used to create Linked Data for conservation treatment records and documentation created by conservators, conservation scientists, and others who preserve cultural heritage.

License

Content submitted to the repository falls under copyright law. Contributions to the repository should be provided without any constraints to allow the material submitted to be reused without any license fees. The proposed license for contributions without constraints to the repository is explained in Creative Commons Zero. Optionally, constraints outlined in the Creative Commons Attribution 4.0 International License are allowed. Specifying these constraints should be done following the instructions in section Constraints of use and it should include a sentence with the exact attribution text, if required.

Scope

This document provides guidance for contributing data about conservation vocabularies to the Linked Conservation Data repository. The guidelines proposed in this document do not depend on the scale of the vocabularies. They apply to short word-lists used in local databases and also to full hierarchical thesauri of thousands of terms.

This document also serves as an introduction to the reasons for using a repository with aligned vocabularies. It will explain the benefits of maintaining and using consistent internal terminology in one's lab. It will also provide both non-technical and technical steps to contributing to the repository.

Audience

This document is targeted primarily at conservators who are responsible for maintaining vocabularies regardless of scale. It is also targeted at conservators who are interested in establishing consistent use of terminology in their documentation system. In practical terms this means conservators who use lookup fields or check-boxes in databases or similar tools.

Resources

Various steps of the processes explained in this document require different levels of technical expertise.

Compiling lists of the words used in one’s conservation documentation and matching those vocabularies to other vocabularies can be done by any conservator, technician or intern, although often this work is best done by the staff who are responsible for creating the documentation. Similarly, defining new terms where matches do not exist in other vocabularies is also achievable by conservation staff.

Encoding this data in specific file formats and validating and submitting the data to the repository may require greater technical expertise. Undertaking processes on a large scale may also require technical expertise to achieve automation. Such expertise in cultural organisations can be found in departments supporting digital infrastructure, collection metadata departments or academic departments in computer and information science. These guidelines are meant to be followed by conservators. The guidelines offer ‘exit points’ where required steps which pose technical challenges can be undertaken by the repository maintainers.

Structure and use of this document

This document should be read sequentially if read for the first time. The technical complexity and the ideas addressed increase after each section. The document can then be used as a reference document.

The document outlines the types of vocabularies considered in LCD. It then provides guidelines for using the LCD repository as a host for conservation vocabularies whose content is not offered online as structured or Linked Data. This typically includes vocabularies published as text documents either in print or PDF files. The document continues with guidelines on how to provide data about the relationships of concepts and terms from different vocabularies to the LCD repository. It concludes with presenting example files that are expected in the LCD repository. The document is meant to be read sequentially when read for the first time. It can then be used as a reference document thereafter.

Types of vocabularies

Types of vocabularies

This section outlines the types of vocabularies considered in LCD. Different processes are required for each type in order to be shared effectively through the LCD repository. In sections Hosting vocabularies in LCD and Aligning vocabularies for LCD, the described processes refer to one or more of these types. They are listed here in order of increasing complexity of structure. The structure of each type can be produced by building upon the previous one. More complexity of a vocabulary does not necessarily mean better quality, but in general it leads to easier use of vocabulary data. These types are illustrated in a figure at the end of this section (figure 1). Lists are used for looking up terms when filling in conservation records. Glossaries are used when audiences looking up the vocabularies have different contextual understandings and therefore a word on its own does not necessarily convey the required meaning. Thesauri are used for large vocabularies which are difficult to browse alphabetically especially when the sought term is not known. Thesauri provide extra ways of navigating terms in addition to alphabetical order, such as terms hierarchies and related terms (“see also”). For a thorough explanation please refer here.

Lists

This primarily includes plain lists of terms (word-lists), without definitions/descriptions, which are used as lookup lists or options in structured records, e.g. as database lookup fields or tick-boxes in survey forms. They do not always consist of terms, they could also include other sequences of symbols (for example, drawing patterns for marking condition on photographs of objects). These lists are often local in scope, i.e. they apply to institutions or conservation labs.

A word-list does not indicate whether some terms are more general than others. For example, the term ‘oil’ is more general than the term ‘linseed oil’ since linseed oil is a specific type of oil. In a word-list there is no way of indicating this relationship between the two terms.

Glossaries

This includes word-lists with unambiguous descriptions for the use of a term. Sometimes these descriptions are called ‘scope notes’. These word-lists may include 'used for' notes to document synonymous terms that are used elsewhere. To make this possible, a crucial distinction needs to be made between the concept that the scope note describes and the term used to refer to that concept. The same concept can be represented by multiple terms, for example the concept (from the AAT ) of “any greasy substance that is liquid at room temperature and insoluble in water”, can be referred to by the terms: ‘oil’, ‘huile’ and ‘έλαιο’. These terms are also called ‘labels’ of the concept. Separating the concept from its terms/labels allows control of synonyms and equivalent terms. Note that in other documents discussing terminology, ‘term’ may be used instead of ‘concept’ but this is not the case in this document.

Thesauri

This includes glossaries which also feature standardised relationships between concepts. There may be different types of relationships in a thesaurus. Those more relevant to conservation thesauri are:

  • hierarchical relationships
    • broader/narrower: the relationship between a parent and child concept which indicates that the parent concept is more general (broader) and that the child concept is more specific (narrower). Concepts in a thesaurus are more general near the top of the hierarchy and become more specific further down each branch. For example, in bookbinding a ‘Byzantine endband’ is a specific type of ‘endband’ (broader relationship).
    • whole-part: the relationship between a parent and child concept which indicates that things described with the child concept are parts of things described with the parent concept. For example, again in bookbinding an ‘endband core’ is a component/part of an ‘endband’ (broader partitive relationship).
  • associative relationships
    • related: the relationship between two concepts which indicates relevance. For example, the concept ‘endband core’ may be related to the concept ‘leather’ since leather is often a material used for endband cores. Note that ‘leather’ is not a specific type of ‘endband’ and it is also not a part of an endband - it only signifies the relevant concept of the material that can be used to produce the endband and other binding components.

Thesauri ideally keep the types of these relationships consistent, for example they do not mix broader with broader-partitive in the same hierarchy, or they do not use related and broader interchangeably.

Figure 1: Increasing complexity of structure in vocabularies. Note the separation of concepts and labels in glossaries and the additional links between concepts in thesauri.

Vocabularies expressed in SKOS

The Simple Knowledge Organisation System (SKOS) is recommended by the W3C for publishing vocabularies online and has been widely adopted. Such adoption means that there is a wealth of tools able to handle SKOS data. SKOS vocabularies feature Uniform Resource Identifiers (URIs) for each concept. A URI provides a unique reference to a concept in a uniquely identified domain (referred to as a namespace). For example, the concept ‘endbands’ as defined in the Language of Bindings thesaurus is uniquely identified as concept/2370 in the namespace https://w3id.org/lob/ and therefore it can be identified globally here: https://w3id.org/lob/concept/2370. In practical terms, this means that every concept of the vocabulary can be mapped to a web-address unique to that concept. However, the existence of a webpage for a concept is not required as long as the URI is unique and reserved for the concept. URIs perform an identical function to the Internationalised Resource Identifiers (IRIs) but the latter also permit non-ASCII characters which allows Greek, Chinese, Cyrillic, etc. characters to be used in the identifier.

SKOS also formalises relationships between concepts and relationships between concepts and lexical labels, such as:

  • skos:prefLabel, which links a concept with its preferred label
  • skos:altLabel, which links a concept with additional non-preferred labels
  • skos:scopeNote, which links a concept with its description text
  • skos:broader and skos:narrower, which allow establishing structured hierarchies of concepts
  • skos:related, which links a concept to a related concept

A full list of relationships formalised by SKOS can be found here: https://www.w3.org/TR/2009/REC-skos-reference-20090818/#vocab.

Aim

Aim

As part of the LCD effort for conservation vocabularies our aim is to express word-lists, glossaries and thesauri as SKOS data to enable interoperability. This means building or adopting software applications which will allow conservators interested in a concept and its associated records to find what terms other conservators use to describe that concept and associated records. For example, by matching ‘endbands’ (Language of Bindings thesaurus, http://w3id.org/lob/concept/2370) with ‘headbands’ (Getty Arts & Architecture Thesaurus, http://vocab.getty.edu/aat/300195163), a conservator searching with ‘endbands’ (Language of Bindings thesaurus, http://w3id.org/lob/concept/2370) will also find results for ‘headbands’ (Getty Arts & Architecture Thesaurus, http://vocab.getty.edu/aat/300195163). The individual records do not need alteration if this matching data is available which is the purpose of this effort. This is illustrated in the following diagram (figure 2).

Figure 2: Illustration of the role of the LCD repository, primarily holding data about concept matching across vocabularies and also as a potential host of vocabularies where such capacity does not exist.

Hosting vocabularies in LCD

Hosting vocabularies in LCD

This section of the guidelines are for vocabulary maintainers who do not have the resources to publish their vocabularies online as SKOS / Linked Data. This could be maintainers who have produced a vocabulary for their organisation, e.g. in a text document, but do not have access to a web-server, or relevant technical expertise and support to convert it into an encoding/structure that can be used for Linked Data. The following sections outline the tasks required to do so depending on the type of vocabulary processed.

Deciding on meaning

Applies to

This process applies to word-lists.

Purpose

To clarify the context within which terms should be used, especially when they are ambiguous.

How

This requires a survey of the records produced with the word-list to confirm how the terms are used in practice. This includes writing a (short) description of the meaning of the term. This description is known as ‘scope note’. This is not meant to be an accurate definition, but clear enough to encapsulate the overall concept. For a term that has been used to mean different things in different records (polysemy), a copy of the term for each meaning is needed, followed by a qualifier to avoid confusion. For example ‘lining’ may refer to both the process of strengthening an object and to the component added. These two are given different entries in the AAT: lining (process) and lining (material), (also see section Terms with multiple concepts).

Output

A list of terms with unique labels and associated scope notes where necessary. This is broadly the case for the Painting Conservation Glossary from the Smithsonian Conservation Institute available here.

Encoding records

Applies to

This process is required for vocabularies held in formats which cannot be processed easily by software to separate labels, scope notes and relationships, to find relevant terms, or to distinguish conflicting uses. Typically this includes vocabularies in print or typeset in PDF files. It may also include vocabularies in text which is partially or inconsistently tagged in wiki-type websites. It may also include vocabularies in text which rely on the textual narrative to communicate labels, scope notes and relationships. Another obstacle of formats which cannot be processed easily may be the fact that terms and concepts are dispersed across documents or resources and their grouping cannot be done automatically.

Purpose

  1. To separate vocabulary information into labels, scope notes and relationships
  2. To produce a consistent list of relevant concepts with their associated labels, scope notes and related concepts

How

Methods depend on the format. A simple but time-consuming method is transcribing text into a spreadsheet or database form by hand.

More complex methods may require scraping websites and automatically identifying tagged text of interest. The process involves writing a script to load webpages holding vocabulary information, extracting it and storing it in a structured document. An example of doing this on the Smithsonian Painting Conservation Glossary using a script can be found in section: Encoding Python script. In other cases it may require transforming tagged text to a new structure, for example using XSLT to simplify an elaborate HTML page. Tools such as Tabula can help with extracting records from a PDF file.

It is likely that the process of encoding is simplified when identifiers are used for concepts and possibly for labels. These identifiers would offer unambiguous references to concepts and labels at local level. Maintainers should consider the next section (Producing URIs) before establishing local identifiers during encoding.

Output

A computer file with structured data corresponding to the concepts, labels, scope notes and relationships of the vocabulary. For example, encoding/transforming the webpage of the Smithsonian Painting Conservation Glossary could result in this table:

concept id term scope note broader concept related concept
20 oil A general term from a water-insoluble viscous liquid.
5 drier Any catalytic material which when added to a drying oil accelerates drying or hardening of the film.
15 linseed oil The most popular drying oil used as paint medium. The medium hardens over several weeks as components of the oil polymerize to form an insoluble matrix. Driers can be added to accelerate this process. 20 5

Or an XML file with XML elements:

<concept>
  <id>20</id>
  <term>oil</term>
  <scopeNote>A general term from a water-insoluble viscous liquid.</scopeNote>
</concept>
<concept>
  <id>5</id>
  <term>drier</term>
  <scopeNote>Any catalytic material which when added to a drying oil accelerates drying or hardening of the film.</scopeNote>
</concept>
<concept>
  <id>15</id>
  <term>linseed oil</term>
  <scopeNote>The most popular drying oil used as paint medium.  The medium hardens over several weeks as components of the oil polymerize to form an insoluble matrix. Driers can be added to accelerate this process.</scopeNote>
  <broader>20</broader>
  <related>5</related>
</concept>

Producing URIs

Applies to

This process is required for all vocabularies which do not already provide URIs (see Vocabularies expressed in SKOS) for each of their concepts.

Ideally URIs should be created and maintained long-term as described in this section by the vocabulary maintainer. If the technicalities of creating the URIs make the process too resource intensive, then contact the LCD repository maintainers for suggestions.

Purpose

To provide unique identifiers and unambiguous reference points for concepts at a global scope.

How

LCD requires that a vocabulary concept has a single URI. Concepts that are updated in later versions of vocabularies should maintain the URIs from earlier versions. Updates to scope notes should not change the meaning of the concept but instead explain it in a better way. If the meaning does change, maintainers should consider creating new concepts while keeping the old ones. Using a different URI for a concept means that we are referring to a different concept. URIs used to refer to different versions of the whole vocabulary may change when the vocabulary is updated.

If the host organisation for the vocabulary has an existing practice for producing and maintaining URIs, then it is recommended to follow that practice. If there is no such practice then URIs should be produced as explained next.

URIs for vocabularies

The following patterns for URI production can be applied to any namespace. Vocabulary maintainers or host organisations can use any namespace they are committed to manage. Users can be redirected from that namespace to a location presenting information about the vocabulary if this is at a different place. In practice this means that the URIs point to one server and each one of them is then passed to another server at request. This is seamless to the end user. This redirection is beneficial because it allows another host organisation to take over the management of the vocabulary without affecting the original URIs which remain the same (persistent). The cost for this flexibility is the requirement for managing the redirection server.

If you do not want to use your own namespace or cannot afford managing the redirection server, an alternative option is using the https://w3id.org namespace. This is managed by a consortium of partners who are committed to providing persistent redirection services. Other redirection services specific for conservation may become available.

The URIs for vocabularies should look like this:

https://w3id.org/vocabularyName

and for their individual versions:

https://w3id.org/vocabularyName/version/version

Where:

vocabularyName is the full name or abbreviation of the vocabulary. For example, the vocabularyName of the Language of Bindings thesaurus is lob. Choosing the vocabularyName will require reviewing the w3id repository for availability of the proposed name.

version is the identifier or name or number of the corresponding version of the vocabulary. Note that this follows /version/ in the URI to indicate to human agents that this corresponds to a dataset.

URIs for concepts

The URIs for concepts should look like this:

https://w3id.org/vocabularyName/concept/conceptId

Where:

vocabularyName as before.

conceptId is the local identifier of the concept as produced during encoding (see Encoding records). Note that this follows /concept/ in the URI to indicate to a human agent that this corresponds to a vocabulary entry.

Managing w3id.org URIs

The w3id.org project offers redirection for URIs of a vocabulary. The idea is that the w3id server accepts requests to a URI, it follows redirection rules set by the vocabulary maintainer and sends the request to the server described in the redirection rule. This means that if the vocabulary concepts are held for example on LCD servers and at some point the maintainer decides to run their own server for the vocabulary, then the URIs are not affected, since the redirection rule on w3id.org can be modified to send requests to the new server. The information provided over here: https://w3id.org/ explains how to use redirection rules in the w3id.org GitHub repository. If individual web-pages/requests for the concepts of the vocabulary cannot be served, it is advised that the name of the vocabulary is submitted to w3id.org with a simple redirection for every URI to a holding page so that the address w3id.org/vocabularyName redirects to a valid page.

Output

Producing the URIs and associating them with the concepts improves the output described in section Encoding records by replacing the local identifiers of concepts with global ones (note that this also applies to links to broader and related concepts). For example, if we assign a vocabulary name ‘spg’ to the Smithsonian Painting Conservation Glossary, we have:

concept URI term scope note broader concept related concept
https://w3id.org/spg/concept/20 oil A general ... liquid.
https://w3id.org/spg/concept/5 drier Any ... the film.
https://w3id.org/spg/concept/15 linseed oil The most ... process. https://w3id.org/spg/concept/20 https://w3id.org/spg/concept/5

Or an XML file with tags:

<concept>
  <conceptUri>https://w3id.org/spg/concept/20</conceptUri>
  <term>oil</term>
  <scopeNote>A general term from a water-insoluble viscous liquid.</scopeNote>
</concept>
<concept>
  <conceptUri>https://w3id.org/spg/concept/5</conceptUri>
  <term>drier</term>
  <scopeNote>Any catalytic material which when added to a drying oil accelerates drying or hardening of the film.</scopeNote>
</concept>
<concept>
  <conceptUri>https://w3id.org/spg/concept/15</conceptUri>
  <term>oil</term>
  <scopeNote>The most popular drying oil used as paint medium.  The medium hardens over several weeks as components of the oil polymerize to form an insoluble matrix. Driers can be added to accelerate this process.</scopeNote>
  <broader>https://w3id.org/spg/concept/20</broader>
  <related>https://w3id.org/spg/concept/5</related>
</concept>

Exporting to SKOS

Applies to

This process applies to all vocabularies submitted to the LCD repository.

Ideally the output from the previous section should be exported to SKOS as described in this section. If the technicalities of exporting to SKOS make the process too resource intensive, then LCD accepts submissions of tabular data from the last section’s output for upload to the LCD repository (see How to upload to the LCD repository).

Purpose

To convert vocabulary data into a consistent SKOS syntax.

How

This process involves taking the output as described in section Producing URIs and encoding it into a format widely recognised by relevant software (a good test is to check whether it is parsable by the OWLAPI library). This could result into the turtle) syntax for encoding SKOS data using the principles of the Resource Description Framework. Software like SKOS Play can assist with transforming CSV files into SKOS turtle. Depending on availability and familiarity of tools, maintainers may choose to use alternative tools such as 3M (for XML), Karma and STELETO. This resource may also be useful: https://www.w3.org/wiki/ConverterToRdf. A scenario illustrating conversion of CSV vocabulary data to SKOS uses this example spreadsheet with some of the Smithsonian Painting Conservation Glossary: here.

Output

To continue with the Smithsonian Painting Conservation Glossary example, the output of exporting to SKOS is the following file in turtle syntax:

@prefix dct: <http://purl.org/dc/terms/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix spgc: <https://w3id.org/spg/concept/> .

<https://w3id.org/spg/> a skos:ConceptScheme;
  dc:creator "Smithsonian Institution"@en;
  dct:rights <https://creativecommons.org/licenses/by/4.0/>;
  dct:title "Smithsonian Painting Conservation Glossary"@en, "Vocabulaire Smithsonien pour la restauration des peintures"@fr;
  skos:hasTopConcept spgc:20, spgc:5 .

spgc:20 a skos:Concept;
  skos:inScheme <https://w3id.org/spg/>;
  skos:narrower spgc:15;
  skos:prefLabel "huile"@fr, "oil"@en;
  skos:topConceptOf <https://w3id.org/spg/> .

spgc:15 a skos:Concept;
  skos:broader spgc:20;
  skos:inScheme <https://w3id.org/spg/>;
  skos:prefLabel "l'huile de lin"@fr, "linseed oil"@en;
  skos:related spgc:5 .

spgc:5 a skos:Concept;
  skos:inScheme <https://w3id.org/spg/>;
  skos:prefLabel "drier"@en, "siccatif"@fr;
  skos:topConceptOf <https://w3id.org/spg/> .

Packaging the dataset

Applies to

This process applies to all vocabularies submitted to the LCD repository.

Ideally the output from the previous section should be packaged with version information as explained in this section. If the technicalities of packaging make the process too resource intensive, then the repository maintainers can help. Alternatively, provide clear versioning information and a separate LICENSE file as explained in How to upload to the LCD repository.

Purpose

To format vocabulary data as versioned datasets. This is useful because versioned vocabularies allow keeping track of updates to concepts and their links.

How

This process involves taking the output as described in section Exporting to SKOS and assigning the version of the produced dataset. Versioning and provenance metadata should be included in the same dataset file as vocabulary data. We propose the following:

  • that the vocabulary version URI (see Producing URIs) is used as a dataset identifier
  • that the version (or other provenance) information is related to the dataset identifier using relationships provided by the Dublin Core (DC), RDF Schema (RDFS) and Web Ontology Language (OWL) schemas

A simple way to do that is to use the TriG) syntax with minimal changes to the output shown in the Exporting to SKOS section.

Note that the automatic transformation of CSV to TriG using SKOSPlay does not produce a valid output.

Output

The TriG encoding of the provenance information for the Smithsonian Painting Conservation Glossary should like this:

@prefix dct: <http://purl.org/dc/terms/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix spgc: <https://w3id.org/spg/concept/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://w3id.org/spg/1.0> {
  <https://w3id.org/spg/1.0> a <http://rdfs.org/ns/void#Dataset> ;
    owl:versionInfo "1.0" ;
    dct:issued "2020-01-01T12:00:00Z"^^xsd:dateTime ;
    dct:rights <https://creativecommons.org/licenses/by/4.0/> ;
    dc:creator "Smithsonian Institution" ;
    dct:created "2015-01-01T12:00:00Z"^^xsd:dateTime .

  <https://w3id.org/spg/> a skos:ConceptScheme;
    dct:creator "Smithsonian Institution"@en;
    dct:rights <https://creativecommons.org/licenses/by/4.0/>;
    dct:title "Smithsonian Painting Conservation Glossary"@en, "Vocabulaire Smithsonien pour la restauration des peintures"@fr;
    skos:hasTopConcept spgc:20, spgc:5 .

  spgc:20 a skos:Concept;
    skos:inScheme <https://w3id.org/spg/>;
    skos:narrower spgc:15;
    skos:prefLabel "huile"@fr, "oil"@en;
    skos:topConceptOf <https://w3id.org/spg/> .

  spgc:15 a skos:Concept;
    skos:broader spgc:20;
    skos:inScheme <https://w3id.org/spg/>;
    skos:prefLabel "l'huile de lin"@fr, "linseed oil"@en;
    skos:related spgc:5 .

  spgc:5 a skos:Concept;
    skos:inScheme <https://w3id.org/spg/>;
    skos:prefLabel "drier"@en, "siccatif"@fr;
    skos:topConceptOf <https://w3id.org/spg/> .
}

Aligning vocabularies for LCD

Aligning vocabularies for LCD

These guidelines are for vocabulary maintainers and researchers who wish to align conservation terminology across different vocabularies to facilitate joint searching of conservation records from different databases. A common scenario of alignment is the target-driven alignment where concepts are matched between two vocabularies where one is considered the source and the second the target. The first step is identifying the two vocabularies.

Target vocabularies

Backbone thesaurus

Vocabularies in conservation often cover terminology about the technology and condition of objects and more rarely terminology about treatment. From a knowledge organisation point of view it is good practice to separate concepts in broad categories so that records produced using these concepts can be automatically classified in broader categories. These broad categories can be considered as top concepts in hierarchies of structured vocabularies. It is important for interoperability to ensure that conservation vocabularies can a) be semantically aggregated and organized for the conservators’ needs, and b) be part of a wider universe of terminologies in humanities and cultural heritage. LCD has chosen the Backbone thesaurus (BBT) as an overarching thesaurus that can accommodate vocabularies from any conservation source or other discipline. The BBT is a generic skeleton thesaurus providing the necessary broad categories, with generic universal concepts, allowing for the conceptual subordination of the top-level concepts of the AAT (see below) and other vocabularies that might not be possible to be aligned to the AAT. By matching top concepts of vocabularies with the AAT and BBT categories we ensure interoperability with vocabularies in conservation and other fields.

Getty Arts & Architecture Thesaurus

LCD has chosen the Getty Art & Architecture Thesaurus (AAT) as a hub for alignment of conservation terminology and we encourage vocabulary maintainers to attempt to match terms with that. By matching concepts of vocabularies to AAT concepts we ensure that the hub is a common reference point for terminology in conservation. This is also the most efficient way of matching concepts across different vocabularies.

Missing terms in AAT

While AAT is considered as a hub thesaurus for LCD, its coverage may not be adequate in some areas of conservation. This means that in some cases new concepts will need to be submitted for inclusion in the AAT. The LCD consortium can submit terms to the AAT on behalf of vocabulary maintainers. To initiate this process, vocabulary maintainers can upload a CSV file (see How to upload to the LCD repository) with the concept id, preferred label, scope note and proposed AAT broader term alongside a bibliographical reference of the concept being used with that label. A template file can be used for this which can be found here. Independent submissions of new terms to the AAT can also be done over here (requires an account). More information about this process can be found here. Please note that contributing concepts to the AAT requires agreeing to the Getty Data Contribution and License Agreement in addition to the recommended LCD licenses.

Choosing individual vocabularies

Matching between two vocabularies directly is discouraged. In exceptional cases, when a project requires direct comparison of terminologies then this can be done. Where possible, effort for terminology matching should be directed to matching with the AAT.

SKOS matching properties

Matching two concepts involves producing a statement linking the two in a particular way. For example, a concept from one vocabulary may be broader to a concept from another, or a concept from one vocabulary (spgc:20 [oil]) may be very similar or exactly the same as the concept from another (aat:300014254 [oil (organic material)]). SKOS formalises these links into a set of properties. In the last example this would be:

spgc:20 skos:exactMatch aat:300014254

The SKOS properties that vocabulary maintainers are encouraged to use for matching are described next.

Hierarchical properties

These include the properties skos:broadMatch and skos:narrowMatch which serve the same purpose as skos:broader and skos:narrower discussed in the section SKOS vocabularies, only that they apply to concepts across different vocabularies.

Equivalence properties

These include the properties skos:closeMatch and skos:exactMatch. Official SKOS documentation provides the following description for these properties: “A skos:closeMatch link indicates that two concepts are sufficiently similar that they can be used interchangeably in some information retrieval applications. A skos:exactMatch link indicates a high degree of confidence that two concepts can be used interchangeably across a wide range of information retrieval applications.” For example, when ranking results from queries, two concepts from different vocabularies linked with skos:exactMatch may be ranked higher than those linked with skos:closeMatch. The choice of equivalence property depends on the confidence level of the conservator reading the two scope notes for the concepts.

Associative properties

These include the skos:relatedMatch which can be used by a conservator who sees value in redirecting colleagues from the current concept to another which they think will be of interest. Note that skos:related is used between concepts in the same vocabulary whereas skos:relatedMatch between concepts of two different vocabularies.

Matching

Applies to

This task applies to vocabulary maintainers and conservators working with documentation records from multiple vocabularies.

Purpose

To identify the concepts from different vocabularies which are the same or similar.

How

The process can be undertaken manually on general purpose software such as spreadsheet editors (for example see the recommended template for SKOS Play). It can be more efficient when using specialised vocabulary matching software. This software typically accesses the two vocabularies that are being matched and asks user input on the appropriate SKOS property for the match. The LCD consortium proposes the following tools for vocabulary matching:

  • VisTA allows matching individual concepts as well as hierarchies of concepts between two SKOS vocabularies. Based on versioning metadata in the vocabulary datasets, the tool allows verification of existing matches following vocabulary updates (including possible relocations of concepts in hierarchies). VisTA can handle SKOS data.
  • VMT allows matching individual concepts with concepts of the AAT thesaurus. VMT runs on a web browser window and can handle tabular data (CSV) as explained here.
  • OpenRefine is a general purpose tool for managing tabular data (CSV) which includes support for automatic matching of concepts based on labels. Matching with the AAT thesaurus can be done by following instructions over here. These require an instance of OpenRefine installed locally.

Output

The output of this process is a separate file with the matching statements between the two vocabularies. VISTA produces matching records in SKOS TriG files which are ready to be uploaded to the LCD repository. VMT produces CSV files which can be converted to SKOS as explained in sections: Exporting to SKOS and Packaging SKOS. Other tools may be used for this process including SSSOM which provides a standard for mapping across vocabularies. Nevertheless the resulting file should follow the same format.

A URI has to be created to identify the produced alignment result and may look like this: https://w3id.org/vocab/align/local_vocabularyName-to-remote_vocabularyName/version

Ideally the output from matching should be submitted as a SKOS TriG syntax file as explained next. If the technicalities of producing such file make the process too resource intensive, then tabular data in the form of a CSV file can be submitted (see Uploading to LCD).

For the above example the TriG syntax may look like this:

@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix spg: <https://w3id.org/spg/> .
@prefix spgc: <https://w3id.org/spg/concept/> .
@prefix aat: <http://vocab.getty.edu/aat/> .

<https://w3id.org/spg/align/spg-to-aat/1> {

  <https://w3id.org/spg/align/spg-to-aat/1> 
    dct:identifier "trig-example-1" ;
    dc:creator "Ceri";
    dc:subject "alignment";
    dct:created "2020-05-05T12:00:00.000+01:00"^^xsd:dateTime ;
    dct:description "Example mappings (1) in TriG format"@en .    

  spgc:20 skos:exactMatch aat:300014254 .
  spgc:15 skos:exactMatch aat:300014292 .
  spgc:5 skos:closeMatch aat:300014732 .

}

How to upload to the LCD repository

How to upload to the LCD repository

Accepted formats

LCD should ideally hold SKOS files (preferably using the TriG syntax). If producing such files is technically too demanding, then tabular data in CSV format can be used instead. LCD repository maintainers will undertake the task of converting CSV files to SKOS TriG files.

CSV templates

A good starting point for producing vocabularies in CSV is this template. A good starting point for suggesting concepts to be submitted to the AAT is this template. A good starting point for producing matches across vocabularies in CSV is this template.

SKOS Validation

If SKOS is submitted then before uploading any files to the LCD repository validation with a suitable SKOS validator is advised, for example: http://labs.sparna.fr/skos-testing-tool/ or directly with https://github.com/cmader/qSKOS/. The uploaded data should pass the following tests:

LCD repository

If the technicalities of uploading data to the LCD repository makes the process too resource intensive, then files can be emailed to one of the LCD repository maintainers.

Uploading files with encoded vocabularies or alignment data to the LCD repository requires the following procedure:

  1. Forking the LCD vocabulary repository on Github. More information on forking GitHub repositories can be found here. This will create a copy of the LCD repository.
  2. Adding contact information and brief context in a README file as explained here.
  3. Creating the required folders for the vocabulary and alignment files and uploading them to the forked repository.
  4. Submitting a pull request for changes from the forked repository to the LCD repository as explained here.
  5. The proposed changes (i.e. the newly uploaded files) are reviewed by the LCD repository maintainers and they are either approved or rejected with comments.

Possible filenames

The following types of files with their associated naming conventions and formats can be submitted to the LCD repository:

Vocabulary files:

  • ./vocabs/vocabularyName/vocabularyName-version.trig
  • ./vocabs/vocabularyName/vocabularyName-version.ttl
  • ./vocabs/vocabularyName/vocabularyName-version.csv

Terms to submit to the AAT:

./aat/vocabularyName/vocabularyName-version--aat-submit.csv

Matching files:

  • ./align/local_vocabularyName--remote_vocabularyName/local_vocabularyName-version--remote_vocabularyName-version.trig
  • ./align/local_vocabularyName--remote_vocabularyName/local_vocabularyName-version--remote_vocabularyName-version.ttl
  • ./align/local_vocabularyName--remote_vocabularyName/local_vocabularyName-version--remote_vocabularyName-version.csv

Constraints of use

Vocabularies should be available for use as explained in License. The way that a vocabulary is shared depends on the format used. For turtle and CSV files a separate file should be uploaded over here:

./vocabs/vocabularyName/LICENSE

or

./align/local_vocabularyName--remote_vocabularyName/LICENSE

For TriG files, the license information should be included using the dct:rights property while a separate file with this information can also be uploaded.

Quick start

Quick start

Non-technical quick start

To share your vocabulary on the LCD repository, follow these steps:

  1. Ensure that you can share with an appropriate license as explained here.
  2. Ensure you have noted the meaning of each of your terms and that there are no ambiguities (for example do not use the same term to mean two different things).
  3. If you do not already have URIs for your terms, or if you are not sure what a URI is, contact the repository maintainers to help you produce them.
  4. Use this template to enter your vocabulary data as explained here:

  5. Match your terms to the Arts & Architecture Thesaurus terms using this template as explained here:

  6. If you have never used Github before, email the resulting files to one of the repository maintainers. Otherwise follow the instructions over here.

Technical quick start

The LCD vocabularies repository aims to collect individual conservation vocabularies in SKOS format to assist data integration. Please consult the sections about packaging SKOS data in TriG (for publishing and for matching) and about the file naming conventions when contributing data to the repository. We advise forking the LCD vocabulary repository and pushing changes to it. The hub for matching conservation vocabularies is the Arts & Architecture Thesaurus. LCD can submit new terms to the AAT on behalf of a vocabulary maintainer.

Flowcharts

This document is based on work done by the LCD consortium during 2019. Please consult these flowcharts for easy reference.

How to publish as SKOS

How to match concepts

Example process

Example process

For the purposes of this document we have selected an example based on the Painting Conservation Glossary from the Smithsonian Conservation Institute available here.

Note: the URIs used in this example have not been registered with the w3id.org repository and are only here for the purposes of the example.

Hosting

Encoding Python script

The following Python script performs web scraping - extracting terms and descriptions from the list on the Smithsonian Painting Conservation Glossary webpage:

import requests
from lxml import html

# replace tabs with spaces, normalize and trim whitespace
def clean(s):
   return " ".join(str(s).replace("\t", " ").strip().split())

# get the HTML page content from web URL
PAGE_URL = "https://www.si.edu/mci/english/learn_more/taking_care/painting_glossary.html"
LOCAL_FILE = "smithsonian.csv"
page = requests.get(PAGE_URL, timeout=5.000)
uri_base = "https://w3id.org/spg/concept/"

# success?
if page.status_code == 200:
   # parse out the specific list items we are interested in
   tree = html.fromstring(page.content)
   items = tree.xpath('//div[@id="site_sections"]/ul/li/p')

   with open(LOCAL_FILE, "w", encoding="utf-8") as output:
       line = f"\"concept id\",\"concept\",\"description\"\n"  # header
       output.write(line)
       i = 0
       for item in items:
           i = i + 1
           # parse and clean terms and descriptions
           concept = clean(item.xpath('./strong/text()')[0])
           desc = clean(item.xpath('./text()')[0][2:])  # removed first 2 chars (colon and space)
           desc = desc.replace('"','""') # escape quotation marks in the description
           line = f"\"{uri_base}{i}\",\"{concept}\",\"{desc}\"\n"  # item formatted as comma delimited term and description
           output.write(line)
else:
   print(f"Could not get data, status code {page.status_code} returned")

The script extracts the list of terms and descriptions and writes the results to a local CSV file which can be found here.

The above list includes terms which refer to more than one concepts, such as https://w3id.org/spg/concept/29 glaze and https://w3id.org/spg/concept/41 light fastness. In the next step we manually split these terms into separate concepts.

Terms with multiple concepts (polysemes)

URI label scope note
https://w3id.org/spg/concept/86 glaze (glass-like surface production) To impart a glass-like surface. Aged glaze is very sensitive to solvents.
https://w3id.org/spg/concept/87 light fastness (dimension) The relative degree of change or lack of change in color of materials exposed to the same amount and character of light.
https://w3id.org/spg/concept/29 glaze (coloring) To cover paler under painting with a layer consisting of transparent pigments and excess medium. Traditionally used to add color to forms modeled in monochrome opaque paint.
https://w3id.org/spg/concept/41 light fastness (color quality) ability to withstand color changes on exposure to light.

The resulting CSV file can be found here

Build hierarchies

The CSV file is then re-formatted and saved as .xlsx based on the template provided by the SKOS Play website. The file can be found here. Notice that in this file concepts have been marked as broader to other concepts. Related links have also been established.

XLSX file for SKOS Play conversion

The SKOS Play conversion results in a Turtle file as shown here.

Packaging as Trig

The Turtle file from the last step can be packaged with a version for uploading to the repository. The complete Turtle file can be seen online here: spg-20210325.trig.

Align with AAT

A similar .xlsx file is used to match vocabulary concepts to the AAT. This is done manually by domain experts, i.e. in this case a conservator with relevant experience. This template file does not include scope notes, broader concepts or related concepts. Instead it includes the required SKOS properties for matching. The file can be seen here.

XLSX file for SKOS Play conversion

The SKOS Play conversion results in a Turtle file as shown here.

Packaging as Trig

The Turtle file from the last step can be packaged with a version for uploading to the repository. The complete Turtle file can be seen online here: conservation-vocabularies/spg-20210325--aat-20200518.trig

Repository maintainers

Repository maintainers