2: Creating LOD#

Learning objectives:

You learn how to

  • convert the data in a CSV file to into RDF data

  • work with LDwizard and with the CLARIAH Data Legend Tool Cow.

Introduction#

Cultural heritage institutions often provide access to heterogenous data, in a wide variety formats, including relational databases, CSV, XML documents or JSON. Given this complexity, it can be difficult to combine and to integrate datasets from different GLAM organisations. As we saw in the previous module, we can overcome such difficulties by making use of RDF. The technology allows us to harmonise different datasets by representing these data in a generic format.

In this lesson, you will learn how to convert data that is not in the RDF format yet. The first section will explain the general steps that you follow. These general steps can be applied to data in any format, such as JSON, XML, CSV or an Excel Spreadsheet. After this, the explanation will become more concrete and more specific. The second section will illustrate these general steps by explaining how you can convert a CSV file into LOD. The module concludes with a brief explanation of how RDF data can be published.

General steps for creating RDF data#

To convert a given data set into the RDF format, we can follow the five basic steps below.

1. Create a conceptual model of the original data set.

Data in a tabular format consist of rows and of columns. The rows are usually instances of the general topics or entities that are being described (e.g. books, works of art or people). The columns typically describe properties of these entities. These properties specify the aspects or the characteristics of the entities that are captured in the data set (e.g. the title, the material or the year of birth).

As a first step in the conversion process, you ought to make a clear and comprehensive list of all the entities and the properties in your dataset. As a first step in the conversion process, you need to make a clear and comprehensive list of all the entities in the dataset, together with all the properties of these entities. Such an overview of the entities and the properties (or attributes) is often referred to as the conceptual model of the database. You may describe this model in a text file, but, in the case of more complicated databases, you may also make use of a more formal notation technique, such as an Entity Relationship Diagram.

2. Find appropriate URIs for all the properties in the model.

As a second step, you can try to connect all the properties you have found to existing URIs. This will make the data more interoperable and more reusable. These properties generally correspond to the columns in a tabular data structure. You can add these URIs to the description of the model you created during the first step.

URIs for properties can be found in the following vocabularies:

More URIs can be found via the Linked Open Vocabularies website.

3. Establish the Resource class IRIs.

The dataset, as mentioned, describe certain topics or entities, such as people, books or organisations. As a third step, try to find a suitable URI for the general topics that are being described in the dataset. Such an URI for the general entity can be referred to as a Resource class IRI.

4. Produce a system for assigning unique identifiers to the individual items that are described in your dataset.

The rows in the tabular dataset typically represent instances of the entities that are being described. How can these individual instances be identified? If your CSV has row numbers, for example, you may choose to work with these row numbers as identifiers for the objects. Otherwise, you may choose to generate new identifiers for the individual records.

It is also advisable to publish the identifiers as HTTP URIs. This means that you also implement an infrastructure in which the URI can be resolved to a webpage that provides meaningful information about the resource that is being identified. In such an infrastructure, the identifier for the object is often appended to a base URI.

5. Transform the dataset.

Finally, you need to transform the dataset. The procedure is essentially that each value in the original data set (e.g. a single cell in a spreadsheet) will become a separate triple. The URI that you have assigned to the row (according to step 4) should become the subject. The URI of the property you associated with the column should be used as a predicate. The value itself will become an object. This transformation is usually carried out using a tool.

Illustration: Converting CSV into RDF#

This section of the lesson will explain how you can convert a CSV file into RDF. The demonstration will be based on a very CSV file, consisting of four rows only. The CSV file describes a small collection of four books.

ID,title,author,year_of_publication,publisher
1,Brave New World,Aldous Huxley,1932,English, Chatto & Windus
2,1984,George Orwell,1949,English,Secker & Warburg
3,Madame Bovary,Gustave Flaubert,1856,French,Michel Lévy Frères
4,Im Westen nichts Neues,Erich Maria Remarque,1929,German,Propyläen Verlag

The first row of this CSV file is the header, which specifies the column names. Each row contains six values, which are separated by commas.

1. Create a conceptual model of the original data set.

As explained above, it is necessary, first of all, to describe the underlying structure or the conceptual model of the data. The dataset describes books, so, for now, we can simply provide the term ‘BOOK’ as a description of the entity. In this example, it is also quite easy to find the properties, as these can be copied directly to the header of the CSV file.

ENTITY: BOOK
PROPERTIES:
Key
Title
Author
Year_of_publication
Publisher

2. Find appropriate URIs for all the properties in the model.

Once you have developed a sufficiently clear understanding of the model, you can begin to link the various properties that you have identified to URIs. During this specific search, you can make use of the Linked Open Vocabularies websites, as mentioned. On this site, you can type in a property in the search bar at the top. This online application will then attempt to find matching terms. Note that properties are distinguished visually form other types of resources via the dark blue colour.

We shall work with the following URIs for the properties in our sample data set:

Title: 
http://purl.org/dc/terms/title

Author: 
https://schema.org/author

Year of publication: 
http://purl.org/dc/terms/date

Language: 
http://purl.org/dc/terms/language

Publisher: 
http://purl.org/dc/terms/publisher

In many cases, multiple options are available for a single property. The column “year of publication” could have been mapped to a URI named https://d-nb.info/standards/elementset/gnd#dateOfPublication, defined by the National Library of Germany. For this example, we have mapped the column to “dcterms:date”, following conventions that have emerged in the field. Libraries often describe the year of publication using dcterms:date.

3. Establish the Resource class IRIs.

Books can be defined as instances of the resources idnetified by the URI http://purl.org/dc/terms/BibliographicResource.

4. Produce a system for assigning unique identifiers to the individual items that are described in your dataset.

The column named “ID” has not been been connected to a URI, as this column will be used to identify the individual books. The URIs for these books will consist of a combination of a base URI and an identifier. As a base URI, we will work with https://bookandbyte.universiteitleiden.nl/lod/resource/. The identifiers in the column “ID” from the CSV file will be appended to this base URI. In a robust LOD instrastructure, we should make sure that these URI can lead to landing pages offering information about the identified resources. In an experimental research setting, this is not strictly necessary.

5. Transform the dataset.

We are now ready to convert the CSV file. This tutorial will explain two tools that can be used for this purpose: LDWizard and COW

Converting CSV file to RDF using LDWizard#

  • Open LDWizard via https://ldwizard.netwerkdigitaalerfgoed.nl/#1

  • Upload your CSV file.

  • Under “Key column”, you need to explain how the various records in the data set can be identified. You can select one of the existing columns, but, if none of the columns are suitable, you can also select the row number that is generated automatically by LDWizard. In our example, we select the column named “ID”. As explained, this identifier value will eventually be appended to a base URI.

  • Under Resource class IRI, specify the type of resources that are being described in the data set. In our example, we provide the URI http://purl.org/dc/terms/BibliographicResource. Note that you can also search for terms in LDWizard. If you type in a term, the tool will suggest several valid suggestions.

  • The base URI, which will be used to generate URIs for the individual rows can be supplied under “Advanced”.

  • For each of the columns in your dataset, indicate the URI they should be connected to. At this point, you can work with the mappings between columns and URIs you developed in step 2. You can copy and paste the URIs you found yourself, or you can make use of the pull-down menu provided by the LDWizard tool.

  • When you have associated all the column names with URIs, you can click on “Next”. On the page that opens after this, you can download the result as RDF. The RDF file will be saved as a file with the .nt extention.

For this simple CSV file, the result looks as follows:

<https://bookandbye.universiteitleiden.nl/resource/1> <http://purl.org/dc/terms/title> "Brave New World" .
<https://bookandbye.universiteitleiden.nl/resource/1> <https://schema.org/author> "Aldous Huxley" .
<https://bookandbye.universiteitleiden.nl/resource/1> <http://purl.org/dc/terms/date> "1932" .
<https://bookandbye.universiteitleiden.nl/resource/1> <http://purl.org/dc/terms/publisher> "English" .
<https://bookandbye.universiteitleiden.nl/resource/1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/BibliographicResource> .
<https://bookandbye.universiteitleiden.nl/resource/2> <http://purl.org/dc/terms/title> "1984" .
<https://bookandbye.universiteitleiden.nl/resource/2> <https://schema.org/author> "George Orwell" .
<https://bookandbye.universiteitleiden.nl/resource/2> <http://purl.org/dc/terms/date> "1949" .
<https://bookandbye.universiteitleiden.nl/resource/2> <http://purl.org/dc/terms/publisher> "English" .
<https://bookandbye.universiteitleiden.nl/resource/2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/BibliographicResource> .
<https://bookandbye.universiteitleiden.nl/resource/3> <http://purl.org/dc/terms/title> "Madame Bovary" .
<https://bookandbye.universiteitleiden.nl/resource/3> <https://schema.org/author> "Gustave Flaubert" .
<https://bookandbye.universiteitleiden.nl/resource/3> <http://purl.org/dc/terms/date> "1856" .
<https://bookandbye.universiteitleiden.nl/resource/3> <http://purl.org/dc/terms/publisher> "French" .
<https://bookandbye.universiteitleiden.nl/resource/3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/BibliographicResource> .
<https://bookandbye.universiteitleiden.nl/resource/4> <http://purl.org/dc/terms/title> "Im Westen nichts Neues" .
<https://bookandbye.universiteitleiden.nl/resource/4> <https://schema.org/author> "Erich Maria Remarque" .
<https://bookandbye.universiteitleiden.nl/resource/4> <http://purl.org/dc/terms/date> "1929" .
<https://bookandbye.universiteitleiden.nl/resource/4> <http://purl.org/dc/terms/publisher> "German" .
<https://bookandbye.universiteitleiden.nl/resource/4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/BibliographicResource> .
Error: no endpoint defined

Converting CSV file to RDF using CoW#

A second tool that you can use to convert data in the CSV format to RDF is CoW. The name of the tool is an acronym for CSV on the Web converter. CoW was developed within the CLARIAH programme, and was made available within the Data Legend suite of tools. CoW can convert a CSV file into RDF data, based on specifications provided in a JSON file. The JSON contains terms takes from the CSV on the Web (CSVW) Vocabulary. You work with CoW on the Command Line and in Python3. For this reason, the tool does demand some familary with Command Line tools, Python and the JSON format.

If pip is available on your computer, you can install CoW by opening the Terminal (Mac Os) or the Command Prompt (Windows) and enter the following command:

pip3 install cow-csvw

Next, navigate in the Terminal or the Command Prompt to the directory that contains the CSV that your want to transform. In this new directory, provide the command below:

cow_tool build [name of csv file]

If your CSV file is named ‘glam-data.csv’, for example, the command should be as follows:

cow_tool build glam-data.csv

This first command creates a CSV file on your machine. The name of this new file is derived from the CSV file. The tool appends the string ‘-metadata.json’ to the original file name. Next, open this newly created JSON file and make a number of changes.

  • In the @base field, under @context, specify the base URI that should be used for the identification of the individual entities in the data set.

  • In aboutUrl, underneath tableSchema, you need to indicate the column that can be used to identify individual rows. In our example, the column named ‘ID’ can be used for this purpose. The column name needs to be given in angular brackets (e.g. ‘{ID}’). If none of the available columns are suitable, you can provide the value ‘{_row}’. With this value, CoW will work with the row numbers that are generated automatically.

  • Underneath tableSchema, CoW generates a list of columns. Each column becomes a separate object in this list. In this part of the JSON file, you can enter the mapping between column names and URIs which you had developed in step 2 of the conversion process. The URIs for the properties can be given in a new field named propertyUrl.

CoW offers many more possibilities, but the actions above need to be completed in any case to transform the CSV according to the procedure explained in this module. If you have made all the necessary changes, you can generate the RDF using the following command:

cow_tool convert glam-data.csv

This command creates a new file with the .nq extension. These two letters refer to the name of the serialisation of the RDF data. The tool creates triples in the N-Quads format. If you want to see the RDF data in another format, you can convert the data using the EasyRdf tool. To convert the data format, you firstly need to copy all the data in teh field underneath “Input Data”. The type of output can be selected under “Output Data”. Options include RDF/XML, Turtle and JONS-LD. The result can be shown on the screen or they can be downloaded to your computer.

Publishing RDF data#

We can argue that RDF is not LOD yet if hasn’t been published openly. The RDF that you created using the steps explained in this tutorial can be published in a number of ways. The easiest way is simply to upload the RDF data to a server. You can place the file containing the RDF data in a Github repository, for instance. Such a basic upload helps to make the data available, but, when you do this, the data do not become part of the LOD cloud. Such a published data dump cannot actually be queried like other LOD.

The recommended way of publishing RDF data is by setting up a SPARQL endpoint. Such endpoints are also referred to as triple stores. A SPARQL endpoint is essentially an online application which can receive and process SPARQL queries. These types of queries will be explained in more detail in the next module. SPARQL endpoints can be created using OpenLInk Virtuoso or using Sesame, among other tools.