Chapter 1 Introduction

1.1 RDF and the Web of Data

These days more and more devices generate data automatically and it is relatively easy to develop applications in different domains backed by databases and exposing data to the Web. The amount and diversity of data produced clearly exceeds our capacity to consume it.

The term big data has emerged to name data that is so large and complex that traditional data processing applications can’t handle it. Big data has been described by at least three words starting by V: volume, velocity, variety. Although volume and velocity are the most visible features, variety is a key concern which prevents data integration and generates lots of interoperability problems.

RDF was proposed as a graph-based data model which became part of the Semantic Web vision. Its reliance on the global nature of URIs offered a solution to the data integration problem as RDF datasets produced by different means can seamlessly be integrated with other data. Data integration using RDF is faster and more robust than traditional solutions in the face of schema changes.

RDF is also a key enabler of linked data. Linked data [46] was proposed as a set of best practices to publish data on the Web. It was introduced by Tim Berners-Lee [8] and was based on four main principles. RDF is mentioned in the third principle as one of the standards that provides useful information. The goal is that information must be useful not only for humans navigating through browsers (for which HTML would be enough) but also for other agents that may automatically process that data.

The linked data principles became popular and several initiatives were created to publish data portals. The size of data on the Web increased significantly in the last years. For example, the LODStats project [36] aggregates around 150 billion triples from 2,973 datasets.

1.2 RDF: The Good Parts

RDF has been acknowledged as the language for the Web of Data and it has several advantages like the following.

Disambiguation. The use of IRIs to identify predicates and to make assertions about resources enables the user to globally identify the property that is being asserted as well as the resources involved in the statement. Those global properties can be identified by automated agents which can recognize the data that they must understand in a non-ambiguous way.
RDF as an integration language. RDF is compositional in the sense that two RDF graphs obtained from independent sources can automatically be merged to obtain a larger graph. This property facilitates the integration of data from heterogeneous sources.
One of the biggest challenges of the current era related with computer science is how to solve the interoperability problem between different applications that manipulate data that comes from heterogeneous sources. RDF is a step forward to partially solve this problem as RDF data can automatically be integrated even if it has been produced by different parties.
RDF as a lingua franca for semantic web and linked data. The simplicity and generality of the RDF data model enables its use to model any kind of data that can be easily integrated with other data.
RDF is at the core of the semantic web stack or layer cake and is mentioned in the linked data principles and in the five-star model.
RDF data stores and SPARQL. SPARQL was proposed as a query language for RDF in 2008. The language met an overwhelming acceptance and adoption by the RDF community. The ability to query led to the development of many new applications as well as databases and libraries. RDF data stores began to popularize and some companies started using RDF internally to represent their data. Some of those applications chose RDF just for practical reasons, even without reference to the semantic web. Storing RDF and querying it using SPARQL offers a very flexible model which can adapt very quickly to data model changes. RDF data stores can be seen as part of the NoSQL movement and there are solutions for RDF data stores with high capabilities that can work with very large databases [67].
Extensibility. When one starts to develop an application to solve some problem, it is necessary to record information in a format with room to grow, which enables the data model to evolve and increasingly adapt to new needs. The extensible graph model of RDF makes it very easy to add more statements to any graph.
Flexibility. While a change in a relational database may be difficult to accomplish. RDF embraces flexibility and these changes are usually a matter of updating the triples.
Open by default. The semantic web approach to knowledge representation promoted what is called Open World Assumption (OWA) instead of the Closed World Assumption (CWA) which was popular in previous knowledge representation systems. The CWA considers that what is not known to be true must be false, while the OWA considers that what is not known is just unknown.
The CWA is usually applied in systems that have complete information while the OWA is more natural for incomplete information systems like the Web.
Given that RDF was applied for the semantic web, most of the applications based on RDF also adopt the Open World Assumption adapting to the appearance of new data.
Although RDF and related technologies employ the Open World Assumption by default, this does not mean that every application must adopt that assumption. In some contexts, it may be necessary to take the opposite view and consider that a system contains all the information on some topic in order to operate.

1.3 Challenges for RDF Adoption

In spite of all the advantages of RDF, its widespread adoption is not yet a reality. Some reasons for this can be guessed.

RDF is mistakenly identified as a complex language. Some people consider RDF as a theoretical, knowledge representation language which does not appeal to practical web developers. However, the RDF data model is very simple and can be understood by almost any person in less than an hour. In its simplicity lies its power and the advantages that we enumerated in previous sections. It is true that some of the technologies built on top of RDF, like OWL, have a more theoretical foundation based on description logics which diverge from this simplicity.
We consider that it is necessary to separate the RDF data model from its more powerful and complex relatives. This is not to say that these technologies are not useful or practical, but that the people who will manage them are different than the people who develop applications. Web developers are not so much interested in ontological discussions, they have more mundane concerns like what are the arcs expected to have for some node, what datatypes are allowed, which data structures can be used to represent some nodes, etc.
Ugly syntax. The RDF data model was defined along with an XML syntax in 1999. At that time XML was a popular syntax and that decision made sense. RDF/XML syntax was not human-friendly (it was difficult to write RDF/XML by hand) and it was also difficult to process (it needed specialized libraries and parsers). The difference between the hierarchical, tree-based XML model and the graph-based RDF data model makes necessary to serialize the RDF graph to be represented in XML. The same RDF graph could be serialized in many ways making very difficult to use standard XML tools like XSLT or XPath to process RDF.
There were several attempts to define a more human-friendly syntax. Notation3 was proposed as a human-friendly language that was able to extend RDF and express other logical operations and rules. Turtle was later proposed as a subset of Notation3 for only expressing RDF. Turtle became popular in the semantic web community although not so much between web developers. Given that it is a special format, it requires a separate parser and tools.
In 2013, RDF 1.1 promotes also JSON-LD for developers who are familiar with JSON and RDFa which enables to embed RDF annotations along HTML content.
Although these efforts can help popularize RDF adoption between the developer community, some extra work is still needed to better understand the role of RDF in the Web development and publishing pipeline.
RDF production/consumption dilemma. It is necessary to find ways that data producers can generate their data so it can be handled by potential consumers. The return of inversion for data producers comes when there are agents consuming their data.
There is some structure of the data that publishers have and want to transmit. For example, they may want to declare that some nodes have some properties with some specific values. Data consumers need to know that structure to develop applications to consume the data.
Although RDF is a very flexible schema-less language, enterprise and industrial applications may require an extra level of validation before processing for several reasons like security, performance, etc.

Veteran users of RDF and SPARQL have confronted the problem of composing or consuming data with some expectations about the structure of that data. They may have described that structure in a schema or ontology, or in some human-readable documentation, or maybe expected users to learn the structure by example. Ultimately, users of that application need to understand the graph structure that the application expects.

While it can be trivial to synchronize data production and consumption within a single application, consuming foreign data frequently involves a lot of defensive programming, usually in the form of SPARQL queries that search out data in different structures. Given lots of potential representations of that data, it is difficult to be confident that we have addressed all of the intended ways our application may encounter its information.

Grammars are a common tool for defining data structures and the languages that convey them. Every data structure with sufficient complexity and precision relies on some formal convention for enumerating groups of properties and expressing data types, cardinalities, and relationships between structures. The need for such a representation grows with the complexity of the language.

To illustrate this, consider the specifications for RDF and SPARQL. RDF is a simple data model consisting of graphs made of triples composed from three types of nodes. Because of this simplicity, it does not need a defining grammar (though most academic papers about RDF include one). By contrast, the SPARQL language would be enormously complicated or impossible to define without a systematic grammar.

This book describes two languages for implementing constraints on RDF data. They can enumerate RDF properties and identify permissible data types, cardinalities, and groups of properties. These languages can be used for documentation, user interface generation, or validation during data production or consumption.

Shape Expressions (ShEx) were proposed as a user-friendly and high-level language for RDF validation. Initially proposed as a human-readable syntax for OSLC Resource Shapes [86], ShEx grew to embrace more complex user requirements coming from clinical and library use cases. ShEx now has a rigorous semantics and interchangeable representations: JSON-LD, RDF, and the one meant for human eyes.

Another technology, SPIN, was used for RDF validation, principally in TopQuadrant’s TopBraid Composer. This technology, influenced from OSLC Resource Shapes as well, evolved into both an implementation and definition of the Shapes Constraint Language (SHACL), which was adopted by the W3C Data Shapes Working Group.

Although both ShEx and SHACL have similar goals and share some similarities they solve the problem from different perspectives and formalisms. At the time of this writing the W3C Data Shapes Working Group has been unable to obtain a compromise solution that brings together both proposals so it seems that they will evolve as different technologies in the future.

This book describes the main features of both ShEx and SHACL from a user perspective and also offers a comparison of the technologies. Throughout this book, we develop a small number of examples that typify validation requirements and demonstrate how they can be met with ShEx and SHACL. The book is not intended as a formal specification of the languages, for which the interested reader can consult the corresponding documents, but as an introduction to the technologies with some background about the rationale of their design and some comparison between them.

1.4 Structure of the Book

Chapter 2 presents a short overview of the RDF data model and RDF-related technologies. This chapter could be skipped by any reader who already knows RDF or Turtle.

Chapter 3 helps us understand what to expect from data validation. It describes the problem of RDF validation and some approaches that have been proposed. In this book, we will further review two of them: Shape Expressions (ShEx) and SHACL.

The next two chapters focus on two proposals: Shape Expressions (Chapter 4) and Shapes Constraint Language (Chapter 5). The description of both languages is more intended to be a practical introduction to them using examples than a formal specification. Once we present both languages, Chapter 6 presents some applications using either ShEx, SHACL or both. Finally, Chapter 7 compares ShEx and SHACL and presents some conclusions.

The goal of this book is to serve as a practical introduction to ShEx and SHACL using examples. We omitted formal definitions or specifications and just added a section at the end of each chapter with references to further reading.

The intended audience is anyone interested in data representation and quality. We give a quick overview of some background and related technologies so readers without RDF knowledge can follow the book contents. Also, it is not necessary to have any prior knowledge on programming or ontologies to understand RDF validation technologies.

1.5 Conventions and Notation

We provide a short introduction to RDF and Turtle in Chapter 2 and from that point on, we use Turtle for the rest of the book.

Once a prefix declaration is presented in Turtle and ShEx, it is omitted thereafter to simplify the examples unless needed for clarity. The prefix declarations and namespaces used are shown in Table 1.1. Most examples in the book will need to be prepended with prefix declarations in order to run correctly.

Table 1.1: Common prefix declarations

Alias Namespace

prefix : <http://example.org/>

prefix cex: <http://purl.org/weso/computex/ontology#>

prefix cdt: <http://example.org/customDataTypes#>

prefix dbr: <http://dbpedia.org/resource/>

prefix ex: <http://example.org/>

prefix qb: <http://purl.org/linked-data/cube#>

prefix org: <http://www.w3.org/ns/org#>

prefix owl: <http://www.w3.org/2002/07/owl#>

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

prefix schema: <http://schema.org/>

prefix sh: <http://www.w3.org/ns/shacl#>

prefix sx: <http://shex.io/ns/shex#>

RDF is being applied to lots of domains, some of them highly specialized. We opted to present examples using concepts from familiar domains like people, courses, companies, etc. that we think will be familiar to any reader. Most of the examples use properties borrowed from schema.org,¹ which provides lots of concepts from familiar domains. The examples are just for illustration purposes and do not pretend to check schema.org rules. Nevertheless, validating schema.org using ShEx or SHACL can be an interesting exercise for readers.

For examples that involve validation of a node against a shape, we use the following notation:

:good schema:name "Valid node" . #Passes as a :Shape :bad schema:name "Bad node" . #Fails as a :Shape

which means that node :good validates against shape :Shape, while node :bad does not.

The examples have been tested using the different tools available. We maintain a public repository where we keep the examples used in this book. The URL is: https://github.com/labra/validatingRDFBookExamples.

1: http://schema.org

Alias	Namespace
`prefix` `:`	`<http://example.org/>`
`prefix` `cex:`	`<http://purl.org/weso/computex/ontology#>`
`prefix` `cdt:`	`<http://example.org/customDataTypes#>`
`prefix` `dbr:`	`<http://dbpedia.org/resource/>`
`prefix` `ex:`	`<http://example.org/>`
`prefix` `qb:`	`<http://purl.org/linked-data/cube#>`
`prefix` `org:`	`<http://www.w3.org/ns/org#>`
`prefix` `owl:`	`<http://www.w3.org/2002/07/owl#>`
`prefix` `rdf:`	`<http://www.w3.org/1999/02/22-rdf-syntax-ns#>`
`prefix` `rdfs:`	`<http://www.w3.org/2000/01/rdf-schema#>`
`prefix` `schema:`	`<http://schema.org/>`
`prefix` `sh:`	`<http://www.w3.org/ns/shacl#>`
`prefix` `sx:`	`<http://shex.io/ns/shex#>`