These days more and more devices generate data automatically and it is relatively easy to develop applications in different domains backed by databases and exposing data to the Web. The amount and diversity of data produced clearly exceeds our capacity to consume it.
The term big data has emerged to name data that is so large and complex that traditional data processing applications can’t handle it. Big data has been described by at least three words starting by V: volume, velocity, variety. Although volume and velocity are the most visible features, variety is a key concern which prevents data integration and generates lots of interoperability problems.
RDF was proposed as a graph-based data model which became part of the Semantic Web vision. Its reliance on the global nature of URIs offered a solution to the data integration problem as RDF datasets produced by different means can seamlessly be integrated with other data. Data integration using RDF is faster and more robust than traditional solutions in the face of schema changes.
RDF is also a key enabler of linked data. Linked data [46] was proposed as a set of best practices to publish data on the Web. It was introduced by Tim Berners-Lee [8] and was based on four main principles. RDF is mentioned in the third principle as one of the standards that provides useful information. The goal is that information must be useful not only for humans navigating through browsers (for which HTML would be enough) but also for other agents that may automatically process that data.
The linked data principles became popular and several initiatives were created to publish data portals. The size of data on the Web increased significantly in the last years. For example, the LODStats project [36] aggregates around 150 billion triples from 2,973 datasets.
RDF has been acknowledged as the language for the Web of Data and it has several advantages like the following.
One of the biggest challenges of the current era related with computer science is how to solve the interoperability problem between different applications that manipulate data that comes from heterogeneous sources. RDF is a step forward to partially solve this problem as RDF data can automatically be integrated even if it has been produced by different parties.
RDF is at the core of the semantic web stack or layer cake and is mentioned in the linked data principles and in the five-star model.
The CWA is usually applied in systems that have complete information while the OWA is more natural for incomplete information systems like the Web.
Given that RDF was applied for the semantic web, most of the applications based on RDF also adopt the Open World Assumption adapting to the appearance of new data.
Although RDF and related technologies employ the Open World Assumption by default, this does not mean that every application must adopt that assumption. In some contexts, it may be necessary to take the opposite view and consider that a system contains all the information on some topic in order to operate.
In spite of all the advantages of RDF, its widespread adoption is not yet a reality. Some reasons for this can be guessed.
We consider that it is necessary to separate the RDF data model from its more powerful and complex relatives. This is not to say that these technologies are not useful or practical, but that the people who will manage them are different than the people who develop applications. Web developers are not so much interested in ontological discussions, they have more mundane concerns like what are the arcs expected to have for some node, what datatypes are allowed, which data structures can be used to represent some nodes, etc.
There were several attempts to define a more human-friendly syntax. Notation3 was proposed as a human-friendly language that was able to extend RDF and express other logical operations and rules. Turtle was later proposed as a subset of Notation3 for only expressing RDF. Turtle became popular in the semantic web community although not so much between web developers. Given that it is a special format, it requires a separate parser and tools.
In 2013, RDF 1.1 promotes also JSON-LD for developers who are familiar with JSON and RDFa which enables to embed RDF annotations along HTML content.
Although these efforts can help popularize RDF adoption between the developer community, some extra work is still needed to better understand the role of RDF in the Web development and publishing pipeline.
There is some structure of the data that publishers have and want to transmit. For example, they may want to declare that some nodes have some properties with some specific values. Data consumers need to know that structure to develop applications to consume the data.
Although RDF is a very flexible schema-less language, enterprise and industrial applications may require an extra level of validation before processing for several reasons like security, performance, etc.
Veteran users of RDF and SPARQL have confronted the problem of composing or consuming data with some expectations about the structure of that data. They may have described that structure in a schema or ontology, or in some human-readable documentation, or maybe expected users to learn the structure by example. Ultimately, users of that application need to understand the graph structure that the application expects.
While it can be trivial to synchronize data production and consumption within a single application, consuming foreign data frequently involves a lot of defensive programming, usually in the form of SPARQL queries that search out data in different structures. Given lots of potential representations of that data, it is difficult to be confident that we have addressed all of the intended ways our application may encounter its information.
Grammars are a common tool for defining data structures and the languages that convey them. Every data structure with sufficient complexity and precision relies on some formal convention for enumerating groups of properties and expressing data types, cardinalities, and relationships between structures. The need for such a representation grows with the complexity of the language.
To illustrate this, consider the specifications for RDF and SPARQL. RDF is a simple data model consisting of graphs made of triples composed from three types of nodes. Because of this simplicity, it does not need a defining grammar (though most academic papers about RDF include one). By contrast, the SPARQL language would be enormously complicated or impossible to define without a systematic grammar.
This book describes two languages for implementing constraints on RDF data. They can enumerate RDF properties and identify permissible data types, cardinalities, and groups of properties. These languages can be used for documentation, user interface generation, or validation during data production or consumption.
Shape Expressions (ShEx) were proposed as a user-friendly and high-level language for RDF validation. Initially proposed as a human-readable syntax for OSLC Resource Shapes [86], ShEx grew to embrace more complex user requirements coming from clinical and library use cases. ShEx now has a rigorous semantics and interchangeable representations: JSON-LD, RDF, and the one meant for human eyes.
Another technology, SPIN, was used for RDF validation, principally in TopQuadrant’s TopBraid Composer. This technology, influenced from OSLC Resource Shapes as well, evolved into both an implementation and definition of the Shapes Constraint Language (SHACL), which was adopted by the W3C Data Shapes Working Group.
Although both ShEx and SHACL have similar goals and share some similarities they solve the problem from different perspectives and formalisms. At the time of this writing the W3C Data Shapes Working Group has been unable to obtain a compromise solution that brings together both proposals so it seems that they will evolve as different technologies in the future.
This book describes the main features of both ShEx and SHACL from a user perspective and also offers a comparison of the technologies. Throughout this book, we develop a small number of examples that typify validation requirements and demonstrate how they can be met with ShEx and SHACL. The book is not intended as a formal specification of the languages, for which the interested reader can consult the corresponding documents, but as an introduction to the technologies with some background about the rationale of their design and some comparison between them.
Chapter 2 presents a short overview of the RDF data model and RDF-related technologies. This chapter could be skipped by any reader who already knows RDF or Turtle.
Chapter 3 helps us understand what to expect from data validation. It describes the problem of RDF validation and some approaches that have been proposed. In this book, we will further review two of them: Shape Expressions (ShEx) and SHACL.
The next two chapters focus on two proposals: Shape Expressions (Chapter 4) and Shapes Constraint Language (Chapter 5). The description of both languages is more intended to be a practical introduction to them using examples than a formal specification. Once we present both languages, Chapter 6 presents some applications using either ShEx, SHACL or both. Finally, Chapter 7 compares ShEx and SHACL and presents some conclusions.
The goal of this book is to serve as a practical introduction to ShEx and SHACL using examples. We omitted formal definitions or specifications and just added a section at the end of each chapter with references to further reading.
The intended audience is anyone interested in data representation and quality. We give a quick overview of some background and related technologies so readers without RDF knowledge can follow the book contents. Also, it is not necessary to have any prior knowledge on programming or ontologies to understand RDF validation technologies.
We provide a short introduction to RDF and Turtle in Chapter 2 and from that point on, we use Turtle for the rest of the book.
Once a prefix declaration is presented in Turtle and ShEx, it is omitted thereafter to simplify the examples unless needed for clarity. The prefix declarations and namespaces used are shown in Table 1.1. Most examples in the book will need to be prepended with prefix declarations in order to run correctly.
Alias Namespace prefix
:
<
http
://
example
.
org
/>
prefix
cex
:
<
http
://
purl
.
org
/
weso
/
computex
/
ontology
#>
prefix
cdt
:
<
http
://
example
.
org
/
customDataTypes
#>
prefix
dbr
:
<
http
://
dbpedia
.
org
/
resource
/>
prefix
ex
:
<
http
://
example
.
org
/>
prefix
qb
:
<
http
://
purl
.
org
/
linked
-
data
/
cube
#>
prefix
org
:
<
http
://
www
.
w3
.
org
/
ns
/
org
#>
prefix
owl
:
<
http
://
www
.
w3
.
org
/2002/07/
owl
#>
prefix
rdf
:
<
http
://
www
.
w3
.
org
/1999/02/22-
rdf
-
syntax
-
ns
#>
prefix
rdfs
:
<
http
://
www
.
w3
.
org
/2000/01/
rdf
-
schema
#>
prefix
schema
:
<
http
://
schema
.
org
/>
prefix
sh
:
<
http
://
www
.
w3
.
org
/
ns
/
shacl
#>
prefix
sx
:
<
http
://
shex
.
io
/
ns
/
shex
#>
RDF is being applied to lots of domains, some of them highly specialized.
We opted to present examples using concepts from familiar domains like people, courses, companies, etc. that we think will be familiar to any reader.
Most of the examples use properties borrowed from
schema.org
,1
which provides lots of concepts from familiar domains.
The examples are just for illustration purposes and do not pretend to
check schema.org
rules.
Nevertheless, validating schema.org
using ShEx or SHACL can be an interesting exercise for readers.
For examples that involve validation of a node against a shape, we use the following notation:
:good schema:name "Valid node" . #Passes as a :Shape :bad schema:name "Bad node" . #Fails as a :Shape |
which means that node
:
good
validates against shape
:
Shape
, while node
:
bad
does not.
The examples have been tested using the different tools available. We maintain a public repository where we keep the examples used in this book. The URL is: https://github.com/labra/validatingRDFBookExamples.