Previous Up Next
Validating RDF data

Chapter 2  The RDF Ecosystem

This chapter includes a short overview of the RDF data model and the Turtle notation, as well as some technologies like SPARQL, RDF Schema, and OWL that form part of the RDF ecosystem.

Readers that are already familiar with these technologies may skip this chapter and jump into the next chapter that describes the RDF validation problem.

2.1  RDF History

The first draft of RDF was published in 1997 [68] and became a W3C recommendation in 1999 along with an XML syntax [69].

A class hierarchy which allows to describe a reasoning like if Socrates is human and all humans are mortals, then Socrates is mortal and property domains and ranges followed a year later. It is perhaps unfortunate that this came under the name RDF Schema (RDFS) as it didn’t offer any of the data constraints available in other schema languages like SQL’s DDL or W3C XML Schema.

In hindsight, this development path was clearly in tension with the priorities of everyday programmers and systems architects who care primarily about creating and accessing well-structured data, and perhaps secondarily about inference. Four years after RDFS, OWL extended the facilities provided by RDFS into an expressive ontology language that could describe the information required for instances of classes.

However, once again, the language was oriented toward a healthy, distributed information infrastructure and not that last mile which permits developers to confidently produce and consume data. While OWL could detect errors when one used a literal with the wrong datatype, or infer a subclass relationship between two classes explicitly declared disjoint, it simply would not complain if one says that every vehicle registration has an owner’s first name and last name, and then fails to supply those values. OWL is designed for an open world semantics, which means that it won’t conclude anything (e.g., signaling missing data) based on the absence of provided data. The absence of evidence is not evidence of absence.

Another four years later in 2008, the RDF community assembled to deliver a query language (SPARQL) to meet the most elementary of application needs, accessing data. The language met immediately with overwhelming acceptance and adoption. This ability to query led to the development of many new applications, as well as databases and libraries designed to facilitate application development. This energy led to the expansion of SPARQL 1.1 into update (analogous to SQL DDL) and HTTP protocol. It did not, however, elegantly solve the problem of RDF data description and verification.

2.2  RDF Data Model

When RDF was created in 1997, XML was quickly becoming a popular format. It had a strong influence on the RDF syntax which was called RDF/XML. That format is quite verbose and there appeared several proposals to have a more human-readable syntax for RDF.

In 2014, RDF 1.1 [25] was published as a revised version which maintained most of the data model and added support for other serialization formats like Turtle [78], Trig [18], or JSON-LD [5].

In this section we give a short overview of the RDF data model following the RDF 1.1 definitions and using Turtle as the serialization format.

The RDF data model is based on the concept of triples. Each triple consists of a subject, a predicate and an object. RDF triples are usually depicted as a directed arc connecting two nodes (subject and object) by an edge (predicate) (see Figure 2.2).


Figure 2.1: Example of RDF triple.

An RDF triple asserted means that some relationship, indicated by the predicate, holds between the resources denoted by the subject and object. This is known as an RDF statement. The predicate is an IRI that denotes a property. An RDF statement can be thought of as a binary relation identified by the property between the subject and object.

There can be three kinds of nodes: IRIs, literals, and blank nodes.

An RDF graph is a set of RDF triples. Notice that the edges of RDF graphs can only be IRIs. This is an important feature of RDF that enables to globally identify the predicates asserted by triples. The subjects can only be IRIs or blank nodes, while the objects can be IRIs, blank nodes or literals.

Example 1  Simple RDF file in Turtle

The following code represents an RDF graph in Turtle. The first three lines are prefix declarations and the rest represent a sequence of RDF triples separated by dots.

prefix ex: <http://example.org/> prefix schema: <http://schema.org/> prefix dbr: <http://dbpedia.org/resource/> prefix xsd: <http://www.w3.org/2001/XMLSchema#> ex:alice schema:knows ex:bob . ex:bob schema:knows ex:carol . ex:bob schema:name "Robert" . ex:bob schema:birthDate "1980-03-10"^^xsd:date . ex:bob schema:birthPlace dbr:Oviedo . ex:carol schema:knows ex:alice . ex:carol schema:knows ex:bob . ex:carol schema:birthPlace dbr:Oviedo .

The corresponding RDF graph has been depicted in Figure 2.2. Rounded boxes represent IRIs while orange rectangles represent literals.


Figure 2.3: Example of an RDF graph.

Blank nodes can be used to make assertions about some elements whose IRIs are not known.

Example 2  Blank nodes in RDF

The following RDF Turtle code declares that ex:alice knows someone who knows ex:dave, and that ex:carol knows someone who was born in the same place as dave, whose age is 23. The graph is depicted in figure 2.

prefix ex: <http://example.org/> prefix schema: <http://schema.org/> prefix dbr: <http://dbpedia.org/resource/> ex:alice schema:knows _:x . _:x schema:knows ex:dave . ex:carol schema:knows _:y . _:y schema:birthPlace _:z ; schema:age "23"^^xsd:integer . ex:dave schema:birthPlace _:z .

Figure 2.4: Example of an RDF graph with blank nodes.

An important feature of RDF graphs is that two independent RDF graphs can automatically be merged to obtain a larger RDF graph formed by the union on their sets of triples. Given the global nature of IRIs, nodes with the same IRI are automatically unified. Using shared IRIs makes the powerful statement the entities and relationships in one graph carry the same intent as they do in the other graphs using the same identifiers. In a sense, the use of RDF gets rid of the data merging problem and lets us focus on the hard problems of establishing shared entities and vocabularies.

For example, the union of the RDF graphs from Figures 2.2 and 2 is depicted in Figure 2.2. Turtle contains several simplifications to facilitate readability.


Figure 2.5: Merged RDF graph.

The RDF data model is very simple. This simplicity if part of its power as it enables RDF to be used as a data representation language in a lot of scenarios.

2.3  Shared Entites and Vocabularies

One of RDF strengths is to promote the use of IRIs instead of plain strings to facilitate merging data from heterogeneous sources and to avoid ambiguity. This poses the challenge of agreeing on common entities and relationships. Usually, those sets of entities and relationships are grouped in vocabularies which can be general-purpose or domain specific.

There are several well-known vocabularies like schema.org which is a collaborative, community activity founded by Google, Microsoft, Yahoo, and Yandex that promotes the use of common structured data on the internet.

An interesting project is the LOV (Linked Open Vocabularies)1 project that collects open vocabularies and provides a vocabulary search engine.

Shared identifiers are frequently minted by some authority releasing data using those identifiers followed by community uptake of those identifiers. Services like http://identifiers.org/ publish these identifiers and, in the frequent case where multiple identifiers exist for the same entity, map between them. The property owl:sameIndividualAs can be used to assert that mapping.

Consensus on vocabularies is typically by communities producing human-readable specifications, which is accompanied by some descriptions of the terms in the vocabulary using RDF Schema (see Section 2.4.2). Ontologies take this a step further by providing much more powerful inference and can be used to detect some errors in the conceptual model (for instance, if a vehicle registration conflated a car with its owner).

As we share more models, we implicitly raise our expectations for the accuracy of these models. George Box stated in 1976 that all models are wrong but some are useful [15]. Raising the bar for these models means we expect them to be useful in more situations than they were originally designed for.

Something as apparently simple as schema.org’s schema:gender offers a simple model for a complex issue. For at least 90% of the population, the model’s terms schema:Male and schema:Female suffice. Extending that to 99% or 99.9% of the population we see these terms are insufficient for the many variations in both identity and biology. Schema.org extends the model for these cases by permitting a string value. FHIR HL7 (see Section 6.2) standards use a concept of administrative gender, which adds two other possibilities.

For simplicity in this chapter and the next, we will use a notion of gender which is constrained to male and female. In later chapters we will use this to show how RDF validation languages can use the extended value set to provide coverage for more use cases.

2.4  Technologies Related with RDF

RDF was created as a language on which other technologies could be based on. The semantic web stack (also called layer cake) illustrates a hierarchy of technologies where RDF plays a central role. Although that stack is still evolving, there are two concepts that are worth mentioning: SPARQL and inference systems.

2.4.1  SPARQL

SPARQL (SPARQL Protocol and RDF Query Language) is an RDF query language which is able to retrieve and manipulate data stored in RDF. SPARQL 1.0 became a recommendation in 2008 [79] and SPARQL 1.1 was published in 2013 [44].

SPARQL is based on the notion of Basic Graph Patterns which are sets of triple patterns. A triple pattern is an extension of an RDF triple where some of the elements can be variables which are denoted by a question mark.

A Basic Graph Pattern matches a subgraph of the RDF data when RDF terms from that subgraph may be substituted for the variables and the result is an RDF graph equivalent to the subgraph.

Example 6  Simple SPARQL query

The following SPARQL query retrieves the nodes ?x whose birth place is dbr:Oviedo and the nodes ?y that are known by them.

prefix : <http://example.org/> prefix schema: <http://schema.org/> prefix dbr: <http://dbpedia.org/resource/> SELECT ?x ?y WHERE { ?x schema:birthPlace dbr:Oviedo . ?x schema:knows ?y }

Applying the previous SPARQL query to the RDF data defined in Example 2.2, a SPARQL processor would return the results shown in Table 2.2.


Table 2.2: Results of SPARQL query
?x ?y
:carol :alice
:carol :bob
:bob :carol

SPARQL queries consist of three parts [73].

Example 7  SPARQL query with Filter and Counts

The following SPARQL query returns people who know only one value.

SELECT ?person ?known { ?person schema:knows ?known . { SELECT ?person (count(*) as ?countKnown) { ?person schema:knows ?known . } GROUP BY ?person } FILTER (?countKnown = 1) }

It contains a nested query (lines 3–5) which groups each element with the number of known entries and a filter (line 8) which removes those elements whose counter is different to one.

A full introduction to SPARQL is out of the scope of this book. For the interested reader, we recommend [33].

SPARQL is a very expressive language which can be used to describe very complex queries. It can also be employed to validate the structure of complex RDF graphs [55]. In Section 23, we describe how SPARQL can be used to validate RDF.

2.4.2  Inference Systems: RDF Schema and OWL

RDF was designed so it could be used as a central piece for knowledge representation in the Web. The goal is that agents can automatically infer new knowledge in the form of new RDF statements from existing RDF graphs. To that end, several technologies were proposed to increase RDF expressiveness. In this section we will briefly review two of the most popular: RDF Schema and OWL.

RDF Schema

RDF Schema was proposed as a data-modeling vocabulary for RDF data. The first public working draft of RDF Schema appeared in 1998 [16] and was accepted as a recommendation in 2004 [26].

It is a semantic extension of RDF which provides mechanisms to describe groups of resources and relationships between them. It defines a set of common classes and properties.

The main classes defined in RDFS are:

RDFS contains several properties like rdfs:label, rdfs:comment, rdfs:domain, rdfs:range, rdf:type, rdfs:subClassOf and rdfs:subPropertyOf.

Example 8  RDF Schema

The following snippet contains some description about teachers and people using RDF Schema terms. It declares that schema:Person is an rdfs:Class, as well as :Teacher. It also declares that the :Teacher class is a subclass of schema:Person which could be read as saying that every instance of :Teacher is also an instance of schema:Person.

Finally, it declares that :teaches is a property that relates instances of :Teacher with instances of :Course, i.e., any two elements related by the property :teaches will satisfy that the first is an :Teacher and the second a :Course.

schema:Person a rdfs:Class . :Teacher a rdfs:Class ; rdfs:subClassOf schema:Person . :teaches a rdfs:Property ; rdfs:domain :Teacher ; rdfs:range :Course .

RDF Schema processors contain several rules that enable them to infer new RDF data.

For example, for any C rdfs:subClassOf D and x a C they can infer x a D, and for any p rdfs:domain C and x p y they can infer x a C.

If we apply those rules to the following data:

:alice a :Person . :bob a :Teacher . :carol :teaches :algebra .

An RDFS processor could infer that :bob and :carol have rdf:type :Person and that :algebra has rdf:type :Course.

OWL

OWL (Web Ontology Language) defines a vocabulary for expressing ontologies based on description logics. It was published as a W3C recommendation in 2004 [29] and a new version, OWL 2, was accepted in 2009 [70]. OWL has several syntaxes: an RDF-based syntax, functional-style Syntax, manchester syntax, etc., and a formally defined meaning. We will use RDF syntax in the following examples with Turtle notation.

An ontology can be defined as a vocabulary of terms, usually about a specific domain and shared by a community of users. Ontologies specify the definitions of terms by describing their relationships with other terms in the ontology.

The main concepts in OWL are as follows.

Example 9  OWL example

In the following example we declare two classes :Man and :Woman that have a property :gender with the value :Male or :Female, respectively.

:Man a owl:Class ; owl:equivalentClass [ owl:intersectionOf ( :Person [ a owl:Restriction ; owl:onProperty :gender ; owl:hasValue :Male ] ) ] . :Woman a owl:Class ; owl:equivalentClass [ owl:intersectionOf ( :Person [ a owl:Restriction ; owl:onProperty :gender ; owl:hasValue :Female ] ) ] .

Now, we can define :Person as the union of the :Man and :Woman classes, and to declare that those classes are disjoint.

:Person owl:equivalentClass [ rdf:type owl:Class ; owl:unionOf ( :Woman :Man ) ] . [ a owl:AllDisjointClasses ; owl:members ( :Woman :Man ) ] .

Given the previous declarations, if we add the following instance data:

:alice a :Woman ; :gender :Female . :bob a :Man .

An OWL reasoner can infer the following triples:

:alice a :Person . :bob a :Person . :bob :gender :Male .

OWL can be used to define ontologies in several domains and there are several tools like the Protégé editor [66] which provide facilities for the creation and visualization of large ontologies.

2.4.3  Linked Data, JSON-LD, Microdata, and RDFa

As we mentioned in Section 1.1, one of the principles of linked data is to provide useful information when dereferencing a URI, using standards such as RDF. The goal is to return not only human-readable content like HTML that a machine can only represent in a browser, but also some machine understandable content in RDF which can be automatically processed.

There are two main possibilities: return different representations of the same resource using content negotiation, or return the same representation with RDF embedded.

The first approach can be easier to implement because developers have several mechanisms to transform a resource to different representations on the fly. A popular format nowadays is JSON-LD which is a JSON-based representation of RDF.

Example 10  JSON-LD example

The Turtle Example 1 can be represented in JSON-LD as:

{"@context": { "ex": "http://example.org/", "schema": "http://schema.org/", "dbr": "http://dbpedia.org/resource/", "xsd": "http://www.w3.org/2001/XMLSchema#", "name": { "@id": "schema:name" }, "birthDate": { "@id": "schema:birthDate", "@type": "xsd:date" }, "birthPlace": { "@id": "schema:birthPlace" }, "knows": { "@id": "schema:knows" } }, "@graph": [ { "@id": "ex:alice", "knows": {"@id": "ex:bob" } }, {"@id": "ex:bob", "name": "Robert", "knows": {"@id": "ex:carol"}, "birthDate": "1980-03-10", "birthPlace": {"@id": "dbr:Oviedo" } }, { "@id": "ex:carol", "knows": [{"@id": "ex:alice" }, {"@id": "ex:bob"}], "birthPlace": {"@id": "dbr:Oviedo" } } ] }

An alternative approach is to embed RDF content in HTML.

RDFa can be used to embed RDF in HTML attributes.

<div xmlns:schema="http://schema.org/" xmlns:ex="http://example.org/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" typeof="schema:Person" about="[ex:alice]"> My name is <span property="schema:name">Alice</span>. <p>I was born on <span property="schema:birthDate" content="1974-12-01" datatype="xsd:date">a Sunday, some time ago</span>, and I am a professor at the <span about="[ex:uniovi]" typeof="schema:Organization" property="schema:name" rel="schema:member" resource="[ex:alice]">University of Oviedo</span> </p> </div>

An HTML browser visualizes the information:

My name is Alice. I was born on a Sunday, some time ago, and I am a professor at the University of Oviedo

while an RDFa processor obtains the following RDF data:

ex:alice a schema:Person; schema:birthDate "1974-12-01"^^xsd:date; schema:name "Alice" . ex:uniovi a schema:Organization; schema:member ex:alice; schema:name "University of Oviedo" .

Another alternative is to use microdata:

<div itemscope itemtype="http://schema.org/Person" itemid="http://example.org/alice"> Home page of <span itemprop="name">Alice</span>. <p>I was born on <time itemprop="birthDate" datetime="1974-12-01">a Sunday, some time ago</time>, and I am a <span itemprop="jobTitle">Professor</span> at the <span itemscope itemprop="affiliation" itemtype="http://schema.org/Organization"> itemid="http://example.org/uniovi" <span itemprop="name">University of Oviedo</span> </span> </p> </div>

Which represents the same information as the RDFa example.

2.5  Summary

2.6  Suggested Reading

Official online documents:

There are several books introducing the concepts of RDF and Semantic Web in general like:

And about particular topics:


1

Previous Up Next