Module 3: SPARQL queries
Contents
Module 3: SPARQL queries#
Level: Advanced (300)
Learning objectives:
The basic syntax of a SPARQL query.
Being able to run SPARQL queries on the endpoint of heritage institutions.
Introduction#
SPARQL (pronounced “sparkle”) is the query language for the Semantic Web. The name of the standard is an acronym standing for SPARQL Protocol and RDF Query Language. To be able to search Linked Open Data, you need to know how to build SPARQL queries. This is what you will learn within this module, by looking at several sample queries. You will also learn how to formulate SPARQL queries on Wikidata using the Wiki query service.
RDF#
RDF statements, as you saw in the first lesson, consist of three components: subjects, predicates and objects. You can see an example of an RDF statement below:
<https://data.rkd.nl/artists/56854>
<schema:name>
'Piet Mondriaan' .
In this example, the base URI for schema.org
is replaced with a prefix. This was done to make the URI shorter and more manageable.
The identifier that serves as the subject in this RDF statement (or triple) was assigned by the RKD, the Netherlands Institute for Art History. The RDF triple gives information about one of the artists described in their digital archive. More specifically, it states that the entity specified by the identifier (https://data.rkd.nl/artists/56854) has a name, namely “Piet Mondriaan”. The subject and the object of RDF triples can either be an URI or a literal value. In the example above, the URIs are given in angular brackets, and the literals are given in quotes. This is the convention in the so-called N-Quads serialisation.
SPARQL#
Now imagine a situation in which you have received a URI without any further context, such as https://data.rkd.nl/artists/32439
. The identifiers created for the artists in the RKD do not reveal any details about the people that are being identified. In this situation, you may want to know which artist is being identified exactly. We can use SPARQL to find more information about these identifiers. The RDF triple (which reads like an affirmative statement) firstly needs to be transformed into a question. You can do this as follows.
PREFIX rkdo: <http://data.rkd.nl/def#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX schema: <http://schema.org/>
SELECT ?name
WHERE {
<https://data.rkd.nl/artists/32439>
schema:name
?name
}
SPARQL queries look quite similar to regular RDF triples. A significant difference, however, is that certain components are replaced with variables. If you compare the query contained within the curly brackets following WHERE to the full RDF statement that was discussed earlier, you can see that the literal value that was used as a predicate (i.e., the actual name associated with the subject) has been replaced with a variable named ?name.
You are free to choose the name of the variables yourself. In the SPARQL language, variables always start with a question mark. It is advisable to work with meaningful names, such as ?place
for places or ?date
for dates.
SELECT and WHERE#
To change a sentence into an actual SPARQL query, you also need to add the keywords SELECT and WHERE. The curly brackets that follow WHERE contain one or more ‘incomplete’ RDF statements, or statements in which one of the central components (subject, object, and predicate) are replaced with variables. In the example above, the request is to ‘fill in the blanks’, or, in other words, to find the correct value for the variable.
In the SELECT clause, you specify the variables whose values you would like to see. In this example there is only one variable, namely, ?name
. The SPARQL query will return a table. The number of columns will be the same as the number of variables you mention after SELECT. Each value that can be found to complete one of the RDF statements in the WHERE clause will be displayed will generate a new row in this table.
As was explained in module 2, SPARQL queries can be executed at a SPARQL endpoint. An endpoint is essentially a web application on which you can access an institution’s RDF data.
If you run the query above in the RKD SPARQL endpoint, available at https://data.netwerkdigitaalerfgoed.nl/rkd/rkdartists/sparql/rkdartists, the SPARQL engine will then try to complete the query. The values that are found are assigned to the variables. If everything goes well, you will see that the identifier that was supplied is associated with Vincent van Gogh, the artist associated with RKD identifier 32439. This is the basis of working with SPARQL.
Multiple variables#
If you want to see more details for a specific identifier, you can simply add more RDF statements containing variables in the WHERE clause. According to the rules of the SPARQL syntax, each individual RDF triple needs to end in a full stop.
PREFIX rkdo: <http://data.rkd.nl/def#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX schema: <http://schema.org/>
SELECT ?name ?gender ?date_birth
WHERE {
<https://data.rkd.nl/artists/32439> schema:name ?name .
<https://data.rkd.nl/artists/32439> schema:gender ?gender .
<https://data.rkd.nl/artists/32439> rkdo:Birth ?birth .
?birth schema:startDate ?date_birth .
}
This query requests the name, the gender and the birth date of the artist identified by the URI that is mentioned. The variable ?date_birth
is slightly confusing at first sight. The property rkdo:Birth initially returns a rkdo:Birth
resource. This is a general resource that can describe the details of a specific birth. Such a birth event can be described using a schema:startDate property. To find the actual birth date of Vincent van Gogh in the RKD database, you need to work with these two separate queries.
LIMIT#
Following the principles that have been discussed so far, you can start to build more complicated queries, or queries that produce longer lists of results. The query below, for example, requests the name and the gender of all the artists that are described in the RKD database. It firstly creates a variable named ?id, which represents all the identifiers that are assigned to people. In this example, the SPARQL endpoint will be able to find many potential matches for the ?id variable. For each of the identifiers, it tries to obtain the name and the gender.
PREFIX rkdo: <http://data.rkd.nl/def#>
PREFIX schema: <http://schema.org/>
SELECT ?id ?name ?gender ?date_birth ?date_death
WHERE {
?id a schema:Person .
?id schema:name ?name .
?id schema:gender ?gender. }
LIMIT 100
As you can imagine, this query would return quite an extensive list of results. When the result list consists of several hundreds of items, it may take some time to load. SPARQL endpoints may also produce time-out errors in the case of such long lists of results. To avoid such errors, we can work with the LIMIT keyword, which needs to be followed by a number. Adding such a LIMIT section will have the effect that the length of the result list will not exceed the number that is specified.
FILTER#
You can add a FILTER clause to add some criteria for the results to be displayed. You can do this, for instance, if you do not want to see all the artists, but only those artists that were born in a specific decade (in between 1890 and 1900, for example). The FILTER keyword is followed by a set or parenthesis. Within these parentheses, you can add Boolean expressions which define criteria for the variables you work with. To see the artists who were born in the last decade of the 19th century, you can work with the query below.
PREFIX rkdo: <http://data.rkd.nl/def#>
PREFIX schema: <http://schema.org/>
PREFIX xsd: http://www.w3.org/2001/XMLSchema#
SELECT ?id ?name ?gender ?date_birth
WHERE {
?id a schema:Person .
?id schema:name ?name .
?id schema:gender ?gender .
?id rkdo:Birth ?birth .
?birth schema:startDate ?date_birth .
FILTER ( ?date_birth >= "1890-01-01"^^xsd:date && ?date_birth < "1900-01-01"^^xsd:date)
}
ORDER#
The ORDER keyword can be used to arrange the query results. ORDER needs to be included in the query after the closing bracket following WHERE. After ORDER, you can refer to any of the variables you have defined.
PREFIX rkdo: <http://data.rkd.nl/def#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX schema: <http://schema.org/>
SELECT ?id ?name ?gender ?date_birth
WHERE {
?id a schema:Person .
?id schema:name ?name .
?id schema:gender ?gender .
?id rkdo:Birth ?birth .
?birth schema:startDate ?date_birth .
FILTER ( ?date_birth >= "1890-01-01"^^xsd:date && ?date_birth < "1900-01-01"^^xsd:date)
}
ORDER BY ?name
Publishing SPARQL queries as APIs#
You can run SPARQL queries by entering the SPARQL commands in a SPARQL endpoint. Alternatively, to make the results of these queries available directly on the web, it can also be very useful to publish your SPARQL queries and to make their results available through Web API. When you do this, you share the data with others more easily. Other people interested in working with the data don’t longer need to learn SPARQL to access the data. You can create such Web APIs using the grlc tool, which was developed as part of CLARIAH {}. grcl stands for git repository linked data API constructor, and is available at http://grlc.io/.
To work with grlc, you can follow the steps below.
First of all, you need to set up a github repository containing your SPARQL queries. Each query needs to be saved in separate file with the .rq extension. You need to add values for a number of parameters before the actual SPARQL query:
#+ description: Provide a brief description of the type of data generated by the SPARQL query.
#+ endpoint: http://example.com/sparql: Specify the SPARQL endpoint of the query. Examples of such rq files can be found in a guthub repository named qrl-queries. https://github.com/CLARIAH/grlc-queries/blob/master/description.rq
Next, you need create a URL for the Web API, using the grlc tool. To the base URL http://grlc.io/, you need to add your github user account and the name of github repository containing the queries, as follows: http://grlc.io//api/github_username/repository_name. `
When you open this URL in a browser, you should see that the tool creates separate APIs for all the .rq files you have added to your github repository. If everything went well, the data can be accessed via a URL with the following structure:
If you want more information, you can read the grlc tutorial.