Previous Up Next
Validating RDF data

Chapter 1  Introduction

1.1  RDF and the Web of Data

These days more and more devices generate data automatically and it is relatively easy to develop applications in different domains backed by databases and exposing data to the Web. The amount and diversity of data produced clearly exceeds our capacity to consume it.

The term big data has emerged to name data that is so large and complex that traditional data processing applications can’t handle it. Big data has been described by at least three words starting by V: volume, velocity, variety. Although volume and velocity are the most visible features, variety is a key concern which prevents data integration and generates lots of interoperability problems.

RDF was proposed as a graph-based data model which became part of the Semantic Web vision. Its reliance on the global nature of URIs offered a solution to the data integration problem as RDF datasets produced by different means can seamlessly be integrated with other data. Data integration using RDF is faster and more robust than traditional solutions in the face of schema changes.

RDF is also a key enabler of linked data. Linked data [46] was proposed as a set of best practices to publish data on the Web. It was introduced by Tim Berners-Lee [8] and was based on four main principles. RDF is mentioned in the third principle as one of the standards that provides useful information. The goal is that information must be useful not only for humans navigating through browsers (for which HTML would be enough) but also for other agents that may automatically process that data.

The linked data principles became popular and several initiatives were created to publish data portals. The size of data on the Web increased significantly in the last years. For example, the LODStats project [36] aggregates around 150 billion triples from 2,973 datasets.

1.2  RDF: The Good Parts

RDF has been acknowledged as the language for the Web of Data and it has several advantages like the following.

1.3  Challenges for RDF Adoption

In spite of all the advantages of RDF, its widespread adoption is not yet a reality. Some reasons for this can be guessed.

Veteran users of RDF and SPARQL have confronted the problem of composing or consuming data with some expectations about the structure of that data. They may have described that structure in a schema or ontology, or in some human-readable documentation, or maybe expected users to learn the structure by example. Ultimately, users of that application need to understand the graph structure that the application expects.

While it can be trivial to synchronize data production and consumption within a single application, consuming foreign data frequently involves a lot of defensive programming, usually in the form of SPARQL queries that search out data in different structures. Given lots of potential representations of that data, it is difficult to be confident that we have addressed all of the intended ways our application may encounter its information.

Grammars are a common tool for defining data structures and the languages that convey them. Every data structure with sufficient complexity and precision relies on some formal convention for enumerating groups of properties and expressing data types, cardinalities, and relationships between structures. The need for such a representation grows with the complexity of the language.

To illustrate this, consider the specifications for RDF and SPARQL. RDF is a simple data model consisting of graphs made of triples composed from three types of nodes. Because of this simplicity, it does not need a defining grammar (though most academic papers about RDF include one). By contrast, the SPARQL language would be enormously complicated or impossible to define without a systematic grammar.

This book describes two languages for implementing constraints on RDF data. They can enumerate RDF properties and identify permissible data types, cardinalities, and groups of properties. These languages can be used for documentation, user interface generation, or validation during data production or consumption.

Shape Expressions (ShEx) were proposed as a user-friendly and high-level language for RDF validation. Initially proposed as a human-readable syntax for OSLC Resource Shapes [86], ShEx grew to embrace more complex user requirements coming from clinical and library use cases. ShEx now has a rigorous semantics and interchangeable representations: JSON-LD, RDF, and the one meant for human eyes.

Another technology, SPIN, was used for RDF validation, principally in TopQuadrant’s TopBraid Composer. This technology, influenced from OSLC Resource Shapes as well, evolved into both an implementation and definition of the Shapes Constraint Language (SHACL), which was adopted by the W3C Data Shapes Working Group.

Although both ShEx and SHACL have similar goals and share some similarities they solve the problem from different perspectives and formalisms. At the time of this writing the W3C Data Shapes Working Group has been unable to obtain a compromise solution that brings together both proposals so it seems that they will evolve as different technologies in the future.

This book describes the main features of both ShEx and SHACL from a user perspective and also offers a comparison of the technologies. Throughout this book, we develop a small number of examples that typify validation requirements and demonstrate how they can be met with ShEx and SHACL. The book is not intended as a formal specification of the languages, for which the interested reader can consult the corresponding documents, but as an introduction to the technologies with some background about the rationale of their design and some comparison between them.

1.4  Structure of the Book

Chapter 2 presents a short overview of the RDF data model and RDF-related technologies. This chapter could be skipped by any reader who already knows RDF or Turtle.

Chapter 3 helps us understand what to expect from data validation. It describes the problem of RDF validation and some approaches that have been proposed. In this book, we will further review two of them: Shape Expressions (ShEx) and SHACL.

The next two chapters focus on two proposals: Shape Expressions (Chapter 4) and Shapes Constraint Language (Chapter 5). The description of both languages is more intended to be a practical introduction to them using examples than a formal specification. Once we present both languages, Chapter 6 presents some applications using either ShEx, SHACL or both. Finally, Chapter 7 compares ShEx and SHACL and presents some conclusions.

The goal of this book is to serve as a practical introduction to ShEx and SHACL using examples. We omitted formal definitions or specifications and just added a section at the end of each chapter with references to further reading.

The intended audience is anyone interested in data representation and quality. We give a quick overview of some background and related technologies so readers without RDF knowledge can follow the book contents. Also, it is not necessary to have any prior knowledge on programming or ontologies to understand RDF validation technologies.

1.5  Conventions and Notation

We provide a short introduction to RDF and Turtle in Chapter 2 and from that point on, we use Turtle for the rest of the book.

Once a prefix declaration is presented in Turtle and ShEx, it is omitted thereafter to simplify the examples unless needed for clarity. The prefix declarations and namespaces used are shown in Table 1.1. Most examples in the book will need to be prepended with prefix declarations in order to run correctly.


Table 1.1: Common prefix declarations
AliasNamespace
prefix : <http://example.org/>
prefix cex: <http://purl.org/weso/computex/ontology#>
prefix cdt: <http://example.org/customDataTypes#>
prefix dbr: <http://dbpedia.org/resource/>
prefix ex: <http://example.org/>
prefix qb: <http://purl.org/linked-data/cube#>
prefix org: <http://www.w3.org/ns/org#>
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix schema: <http://schema.org/>
prefix sh: <http://www.w3.org/ns/shacl#>
prefix sx: <http://shex.io/ns/shex#>

RDF is being applied to lots of domains, some of them highly specialized. We opted to present examples using concepts from familiar domains like people, courses, companies, etc. that we think will be familiar to any reader. Most of the examples use properties borrowed from schema.org,1 which provides lots of concepts from familiar domains. The examples are just for illustration purposes and do not pretend to check schema.org rules. Nevertheless, validating schema.org using ShEx or SHACL can be an interesting exercise for readers.

For examples that involve validation of a node against a shape, we use the following notation:

:good schema:name "Valid node" . #Passes as a :Shape :bad schema:name "Bad node" . #Fails as a :Shape

which means that node :good validates against shape :Shape, while node :bad does not.

The examples have been tested using the different tools available. We maintain a public repository where we keep the examples used in this book. The URL is: https://github.com/labra/validatingRDFBookExamples.


1

Previous Up Next