Chapter 3 Data Quality

People have been using computers to record and reason about data for many decades. Typically, this reasoning is less esoteric than artificial intelligence tasks like classification.

A data modeler usually has some structure of the data that she is trying to model. That structure must be explicitly defined and communicated using some technology that can at the same time be understood by other people and also be processed by automatic systems that can check and enforce it. Using natural language for that is not enough as it can have ambiguities and is difficult to process by machines. On the other hand, enforcing that structure using some procedural programming language is difficult to maintain by other people. The right balance is usually to have some declarative language that can be readable by humans but at the same time parsed and checked by machines.

Rigorous data validation is like a contract that offers advantages to several different parties.

Consumers have an easier time understanding the semantics of data. For instance, a data structure that requires either a full name or a given and family name has a simple intuition while one that has optional full, given and family names leaves the consumer unsure about the many combinations she may encounter in the data.
Programmers have to do much less “defensive coding” when working with predictable data. A programmer need not write special cases for permutations like no name, a full name and a given name, etc. Introducing quality control into data workflows can reduce security exploits and catch systematic errors when they first occur rather than years later when someone stumbles across inconsistent data. For instance, a process may erroneously insert multiple primary addresses if no system enforces that a person should have no more than one primary address.
Producers can precisely define and validate their output. This allows them to test consistency with business processes, perform quality control, and unambiguously communicate their assets to other parties.
Queriers can tailor the sophistication of their queries to address a constrained set of possibilities. Queriers are a specific kind of consumers who are especially vulnerable to systematic data errors. Unexpected variations in data structures can result in missing query results. Possibly worse, a single accidental duplication of a record can result in it being counted many times, once for each combination of attributes in the original and duplicate record.

3.1 Non-RDF Schema Languages

While RDF is a relative newcomer to the data scene, most widely-used structured data languages have a way to describe and enforce some form of data consistency. Examining UML, SQL, XML, JSON, and CSV allows us to set expectations for RDF validation.

3.1.1 UML

The Unified Modeling Language (UML) is a general-purpose visual modeling language that can be used to provide a standard way to visualize the design of a system [85]. In 2005, the Object Management Group (OMG) published UML 2, a revision largely based on the same diagram notations, but using a modeling infrastructure specified using Meta-Object Facility (MOF). UML contains 14 types of diagrams, which are classified in three categories: structure, behavior and interaction. The most popular diagram is the UML class diagram, which defines the logical structure of a system in terms of classes and relationships between them. Given the Object Oriented tradition of UML, classes are usually defined in terms of sets of attributes and operations.

UML class diagrams are employed to visually represent data models.

Example 11 UML Class diagram

Figure 11 represents an example of a UML class diagram. In this case, there are two classes, User and Course with several attributes and two relationships. The relation enrolledIn establishes that a user can be enrolled in a course. The cardinalities 0..* means that a user may be enrolled in several courses while a cardinality 1..* means that a course must have at least one user enrolled. The other relationship is instructor which means that a course must have one instructor (cardinality 1) while a user can be the instructor of 0 or several courses. There is another relationship ( knows) between users.

Figure 3.1: Example of UML class diagram.

UML diagrams are typically not refined enough to provide all the relevant aspects of a specification. There is, among other things, a need to describe additional constraints about the objects in the model. OCL (Object Constraint Language)¹ has been proposed as a declarative language to define this kind of constraints. It can also be used to define well-formedness rules, pre- and post-conditions, model transformations, etc.

OCL contains a repertoire of primitive types (Integer, Real, Boolean, String) and several constructs to define compound datatypes like tuples, ordered sets, sequences, bag and sets.

Example 12 OCL constraints

The following code represents some constraints in OCL: that the gender must be 'Male' or 'Female', that a user does not know itself and that the start date of a course must be bigger that the end date. Notice that we are using a hypothetical operator < to compare dates while in OCL dates are not primitive types.

course User inv: self.gender->forAll(g | Set{'Male','Female'}->includes(g) ) self.knows->forAll(k | k <> self) context Course inv: self.startDate < self.endDate

3.1.2 SQL and Relational Databases

Probably the largest deployment of machine-actionable data is in relational databases, and certainly the most popular access to relational data is by Structured Query Language (SQL). One challenge in describing SQL is the difference between the ISO standard and deployed implementations.

SQL is designed to capture tabular data, with some implementations enforcing referential integrity constraints for consistent linking between tables. SQL’s Data Definition Language (DDL) is used to lay out a table structure; SQL is used to populate and query those tables. The SQL implementations that do enforce integrity constraints do so when data is inserted into tables.

The concept of DDL was introduced in the Codasyl database model to write the schema of a database describing the records, fields and sets of the user data model. It was later used to refer to a subset of SQL for creating tables and constraints. DDL statements list the properties in a particular table, their associated primitive datatypes, and list uniqueness and referential constraints.

Example 13 DDL

CREATE TABLE User ( id INTEGER PRIMARY KEY NOT NULL, name VARCHAR(40) NOT NULL, birthDate DATE, birthPlace VARCHAR(50), gender ENUM('male','female') ); CREATE TABLE Course ( id INTEGER PRIMARY KEY, StartDate DATE not null, EndDate DATE not null, Instructor INTEGER FOREIGN KEY REFERENCES User(id) ) CREATE TABLE EnrolledIn ( studendId INTEGER FOREIGN KEY REFERENCES User(id), courseId INTEGER FOREIGN KEY REFERENCES Course(id), )

While implementation support for constraints and datatypes varies, popular datatypes include numerics like various precisions of integer or float, characters, dates and strings.

Two popular constraints in DDL are for primary and foreign keys. In SQL and DDL, attribute values are primitive types, which is to say that a user’s course is not a course record, but instead typically an integer that is unique in some table of courses.

Figure 3.2: Example of two tables.

Because RDF is a graph, one would typically bypass this reference convention and create a graph where a user’s course is a course instead of a reference.

3.1.3 XML

XML was proposed by the W3C as an extensible markup language for the Web around 1996 [98]. XML derives from SGML [42], a meta-language that provides a common syntax for textual markup systems and from which the first versions of HTML were also derived. Given its origins in typesetting, the XML model is adapted to represent textual information that contains mixed text and markup elements.

The XML model is known as the XML Information Set (XML InfoSet) and consists of a tree structure, where each node of the tree is defined to be an information item of a particular type. Each item has a set of type-specific properties associated with it. At the root there is a document item, which has exactly one element as its child. An element has a set of attribute items and a list of child elements or text nodes. Attribute items may contain character items or they may contain typed data such as name tokens, identifiers and references. Element identifiers and references may be used to connect nodes transforming the underlying tree into a graph.

Example 14 XML example

An example of a course representation in XML can be:

<course name="Algebra"> <student id="alice"> <name>Alice</name> <gender>Female</gender> <comments>Friend of <person ref="bob">Robert</person></comments> </student> <student id="bob"> <name>Robert</name> <gender>Male</gender> <birthDate>1981-09-24</birthDate> </student> </course>

Figure 3.3: Tree structure of an XML document.

XML became very popular in industry and a lot of technologies were developed to query and transform XML. Among them, XPath was a simple language to select parts of XML documents that was embedded in other technologies like XSLT or XQuery.

The next XPath snippet finds the names of all students whose gender is "Female":

//student[gender = "Female"]/name

XML defines the notion of well-formed documents and valid documents. Well-formed documents are XML documents with a correct syntax while valid documents are documents that in addition of being well-formed, conform to some schema definition.

If one decides to define a schema, there are several possibilities.

Document Type Definition (DTD). The XML specification [98] declares a basic mechanism to define the schema of XML documents, which was inherited from SGML and is called DTD. It allows to define the structure of a family of XML documents

Example 15 DTD example

A DTD to validate the XML file in Example 14 could be:

<!ELEMENT course (student*)> <!ELEMENT student (name,gender,birthDate?)> <!ELEMENT name (#PCDATA)> <!ELEMENT gender (#PCDATA)> <!ELEMENT birthDate (#PCDATA)> <!ATTLIST student id ID #REQUIRED> <!ATTLIST course name CDATA #IMPLIED>

DTD defines the structure of XML using a basic form of regular expressions. However, DTDs have a limited support for datatypes. For example, it is not possible to validate that the birth date of a student has the shape of a date.

XML Schema. This specification was divided in two parts. The first part specifies the structure of XML documents [89] and the second part a repertoire of XML Schema datatypes [9].

Example 16 XML Schema example

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'> <xs:element name="course"> <xs:complexType> <xs:sequence> <xs:element name="student" minOccurs='1' maxOccurs='100' type="Student"/> </xs:sequence> <xs:attribute name="name" type="xs:string" /> </xs:complexType> </xs:element> <xs:complexType name="Student"> <xs:sequence> <xs:element name="name" type="xs:string" /> <xs:element name="gender" type="Gender" /> <xs:element name="birthDate" type="xs:date" minOccurs='0'/> </xs:sequence> <xs:attribute name="id" type="xs:ID" use='required'/> </xs:complexType> <xs:simpleType name="Gender"> <xs:restriction base="xs:token"> <xs:enumeration value="Male"/> <xs:enumeration value="Female"/> </xs:restriction> </xs:simpleType> </xs:schema>

An XML Schema validator decorates each structure of the XML document with additional information called the Post-Schema Validation Infoset, or PSVI. This structure contains information about the validation process that can be later employed by other XML tools.

RelaxNG [20] was developed within the Organization for the Advancement of Structured Information Standards (OASIS) as an alternative for XML Schema. RelaxNG has two syntaxes: an XML-based one and a compact one. RelaxNG is grammar based and its semantics is formally defined by means of axioms and inference rules.

Example 17 RelaxNG example

The following code contains a RelaxNG schema to validate Example 14 using the RelaxNG compact syntax.

element course { element student { element name { xsd:string }, element gender { "Male" | "Female" }, element birthDate { xsd:date }?, attribute id { xsd:ID } }* , attribute name { xsd:string } }

The same example can be expressed in XML as:

<element name="course" xmlns="http://relaxng.org/ns/structure/1.0" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"> <zeroOrMore> <element name="student"> <element name="name"> <data type="string"/> </element> <element name="gender"> <choice> <value>Female</value> <value>Male</value> </choice> </element> <optional> <element name="birthDate"> <data type="date"/> </element> </optional> <attribute name="id"> <data type="ID"/> </attribute> </element> </zeroOrMore> <attribute name="name"> <data type="string"/> </attribute> </element>

Schematron [50] is a rule-based language based on patterns, rules, and assertions. An assertion contains an XPath expression and an error message. The error message is displayed when the XPath expression fails. A rule groups various assertions together and defines a context in which assertions are evaluated using an XPath expression. Finally, patterns group various rules together.

Schematron has more expressive power than other schema languages like DTDs, RelaxNG or XML Schema as it can express complex constraints that are impossible with them. In fact, it is often used to define business rules.

Although Schematron can be used as a stand-alone, it is commonly used in cooperation with other schema languages which define the document structure.

Example 18 Schematron example

If we have XML documents containing course grades like the following:

<course name="Algebra"> <student id="S234"> <name>Alice</name> <grade>8</grade> </student> <student id="B476"> <name>Robert</name> <grade>5</grade> </student> <average>9</average> </course>

We can define the following Schematron file to validate.

That student IDs start by S (lines 4–8).
That the value of <average> is the mean of the grades.

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron"> <sch:pattern name="Check Ids"> <sch:rule context="student"> <sch:assert test="starts-with(@id,'S')" >IDs must start by S</sch:assert> </sch:rule> </sch:pattern> <sch:pattern name="Check mean"> <sch:rule context="average"> <sch:assert test="sum(//student/grade) div count(//student/grade) = ." >Value of <sch:name/> does not match mean </sch:assert> </sch:rule> </sch:pattern> </sch:schema>

Schematron is more expressive than other schema languages like DTDs, XML Schema, or RelaxNG as it can define business rules and co-occurrence constraints at the same time that it can also define structural constraints like the other ones. Nevertheless, Schematron rules can become complex to define and debug. A popular approach is to combine both approaches, defining the XML document structure with a traditional schema language and complementing it with schematron rules.

Other schema languages for XML has been SchemaPath was proposed as a simple extension of XML Schema with conditional constraints [22]. Bonxai [62] has been recently proposed. It also contains a readable syntax inspired by RelaxNG.

Invoking validation in XML.

Different approaches have been proposed to indicate how an XML document has to be validated against a schema. Some of those approaches are the following.

Embedded schema. DTDs can directly be embedded in XML documents:
<!DOCTYPE course [ <!ELEMENT course (student*) > <!ELEMENT student (name,grade)> <!ATTLIST student id CDATA #REQUIRED> ]> <course name="Algebra"> ... </course>
Directly associate instance data with XML Schema. It can be done, for example, using the xsi:schemaLocation or xsi:noNamespaceSchemaLocation attributes.
For example, the following XML document directly declares that it follows the schema identified by http://example.org/ns/Course which is located at http://example.org/course.xsd:
<course xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://example.org/ns/Course http://example.org/course.xsd"> ... </course>
The XML processing instruction <?xml-model ?> has been proposed to associate an XML document with a schema [43].
<?xml-model href="http://example.org/course.rng" ?> <?xml-model href="http://example.org/course.xsd" ?> <course name="Algebra"> ... </course>

Note that the XML model processing instruction enables to use multiple schemas for the same document.
In WSDL [19] it is possible to associate documents or predetermined nodes in a document with arbitrary XML Schema types.

As can be seen XML provides several ways to associate XML data with schemas for their validation.

3.1.4 JSON

JSON was proposed by Douglas Crockford around 2001 as a subset of Javascript (the original acronym was Javascript Object Notation). It has evolved as an independent data-interchange format with its own ECMA specification [35].

A JSON value, or JSON document, can be defined recursively as follows.

true, false and null are JSON values.
Any decimal number is also a JSON value.
Any string of Unicode characters enclosed by " is also a JSON value, called a string value.
If k₁, k₂, …, k_n are distinct string values and v₁, v₂, …, v_n are JSON values, then {k₁: v₁, k₂: v₂, …, k_n: v_n} are JSON values, called objects. In this case, each k_i: v_i is a key-value pair. The order of the key-value pairs is not significant.
If v₁, v₂, …, v_n are JSON values, then [v1,v2,…,v_n] are JSON values, called arrays. The order of the array elements is significant.

Note that in the case of arrays and objects the values v_i can again be objects or arrays, thus allowing the documents an arbitrary level of nesting. In this way, the JSON data model can be represented as a tree [14].

Example 19 JSON example

The following example contains a JSON object with two keys: name and students. The value of name is a string while the value of students is an array of two objects.

{ "name": "Algebra" , "students": [ { "name": "Alice", "gender": "Female", "age": 18 }, { "name": "Robert", "gender": "Male", "birthDate": "1981-09-24" } ] }

Figure 19 shows a tree representation of the previous JSON value.

Figure 3.4: Tree structure of JSON.

JSON Schema [101] was proposed as an Schema language for JSON with a role similar to XML Schema for XML. It is written itself using JSON syntax and is programming language agnostic. It contains the following predefined datatypes: null, Boolean, object, array, number and string, and allows to define constraints on each of them.

In JSON Schema, it is possible to have reusable definitions which can later be referenced. Recursion is not allowed between references [74].

Example 20 JSON Schema example

The following example contains a JSON schema that can be used to validate Example 19. It declares student as an object type with four properties: name, gender, birthDate and age. The first two are required and some constraints can be added on their values.

The JSON value has type object and contains two properties: name, which must be a string value, and students which must be an array, whose items conform to the student definition.

{ "$schema": "http://json-schema.org/draft-04/schema#", "definitions": { "student": { "type": "object", "properties": { "name": {"type": "string" }, "gender": {"type": "string", "enum":["Male","Female"]}, "birthDate": {"type": "string", "format": "date" }, "age": {"type": "integer","minimum": 1 } }, "required": ["name","gender"] } }, "type": "object", "properties": { "name": { "type": "string" }, "students" : { "type": "array", "items": { "$ref": "#/definitions/student" } } }, "required": ["name","students"] }

3.1.5 CSV

Comma-Separated Values (CSV) and Tab-Separated Values (TSV) files have historically had no format-specific schema language. A common use case for CSV (and TSV) is to import it into a relational database, where it is subject to the same integrity constraints as any other SQL data. However, wide-ranging practices for documenting table structure and semantics have historically made it hard for consumers of CSV to consume published CSV data with confidence. Column headings and meanings may appear as rows in the CSV file, columns in an auxiliary CSV or flat file, or be omitted entirely.

Spreadsheets are another common generator and consumer of CSV data. Some spreadsheets may have hand-tooled integrity constraints but they offer no standard schema language.

While traditionally schema-less, a recent standard, CSV on the Web (CSVW) attempts to describe the majority of deployed CSV data. This includes semantics (e.g., mapping to an ontology), provenance, XML Schema length and numeric value facets (e.g., minimum length, max exclusive value), and format and structural constraints like foreign keys and datatypes.

CSVW describes a wide corpus of existing practice for publishing CSV documents. Because of it’s World Wide Web orientation, it includes internationalization and localization features not found in other schema languages. Where most data languages standardize the lexical representation of datatypes like dateTime or integer, CSVW describes a wide range of region or domain-specific datatypes. For instance, the following can all be representations of the same numeric value: 12345.67, 12,345.67, 12.345,67, 1,23,45.67.

CSVW is also unusual in that it can be used to describe denormalized data. Because of this, it includes separator specifiers to aid in micro-parsing individual data cells into sequences of atomic datatypes.

CSVW is a very new specification and applies to a domain with historically no standard schema language. Tools like CSVLint² are adopting CSVW as a way to offer interoperable schema declarations to enable data quality tests.

3.2 Understanding the RDF Validation Problem

As we can see in Table 3.1, most data technologies have some description and validation technology which enables users to describe the desired schema of the data and to check if some existing data conforms with that schema.

Table 3.1: Data validation approaches

Data format Validation technology

Relational databases DDL

XML DTD, XML Schema, RelaxNG, Schematron

CSV CSV on the Web

JSON JSON Schema

RDF ShEx/SHACL

Although there have been several previous attempts to define RDF validation technologies (see Section 3.3) this book focuses on ShEx and SHACL.

In this section we describe what are the particular concepts of RDF that have to be taken into account for its validation:

Graph data model

RDF is composed of triples, which have arcs (predicates) between nodes. We can describe:

the form of a node (the mechanisms for doing this will be called “node constraints”);
the number of possible arcs incoming/outgoing from a node; and
the possible values associated with those arcs.

Figure 3.2 presents an RDF node and its corresponding Shape.

Figure 3.5: RDF node and its shape.

Unordered arcs

A difference between RDF and XML with regards to their data model is that while in RDF, the arcs are unordered, in XML, the sub-elements form an ordered sequence. RDF validation languages must not assume any order on how the arcs of a node will be treated, while in XML, the order of the elements affect the validation process.

From a theoretical point of view, the arcs related with a node in RDF can be represented as a bag or multiset, i.e., a set which allows duplicate elements.

RDF Validation ≠ Ontology ≠ Instance data

Notice that RDF validation is different from ontology definition and also different from instance data.

Ontologies are usually focused on real-world things or at least objects from some domain. The semantic web community has put a lot of emphasis on defining ontologies for different domains and there are several vocabularies like OWL, RDFS, etc. that can be used to that end. People concerned with this level are ontology engineers which must have skills to understand how to represent the knowledge of some domain.
Instance data refers to the data of some situation or problem at any given point. That data can be obtained from different sources and is materialized in some data representation language. In our case, instance data refers to RDF graphs that are created by developers and programmers, or generated automatically from other sources like sensors.
RDF validation is an intermediate process that can check if that instance data conforms to some desired schema. In the case of RDF, it is focused on RDF graph features which are at a lower level than ontology features. The people interested in RDF data description and validation are data engineers and have concerns that are different from those of ontology engineers. Data engineers are more worried about how to model data so the developers can effectively and efficiently produce or consume it.

Figure 3.2 represents the difference between instance data, ontology definitions, and RDF validation.

Figure 3.6: RDF validation vs. ontology definition.

Shapes ≠ Types

Given the open and flexible nature of RDF, nodes in RDF graphs can have zero, one or many rdf:type arcs.

Some application can use nodes of type schema:Person with some properties while another application can use nodes with the same type but different properties. For example, schema:Person can represent friend, invitee, patient,...in different applications or even in different contexts of the same application. The same types can have different meanings and different structure depending on the context.

While from an ontology point of view a concept has a single meaning, applications that are using that same concept may select different properties and values and thus, the corresponding representations may differ.

Nodes in RDF graphs are not necessarily annotated with fully discriminating types. This implies that it is not possible to validate the shape of a node by just looking at its rdf:type arc.

We should be able to define specific validation constraints in different contexts.

Inference

Validation can be performed before or after inference. Validation after inference (or validation on a backward-chaining store that does inference on the fly) checks the correctness of the implications. An inference testing service could use an input schema describing the contents of the input RDF graph and an output schema describing the contents of the expected inferred RDF graph. The service can check that instance data conforms to the input schema before inference and that after applying a reasoner, the resulting RDF graph with inferred triples, conforms to the output schema.

Example 21 Suppose we have a schema with two shapes, each with one requirement:

PersonShape requires an rdf:type of :Person
TeacherShape requires an rdf:type of :Teacher

If we validate the following RDF graph without inference, only :alice would match PersonShape. However, if we validate the RDF graph that results of applying RDF Schema inference, then both :bob and :carol would also match PersonShape.

:teaches rdfs:domain :Teacher . :Teacher rdfs:subClassOf :Person . :alice a :Person . :bob a :Teacher . :carol :teaches :algebra .

Validation workflows will likely perform validation both before and after validation. Systems which perform possibly incomplete inference can use this to verify that their light-weight, partial inference is producing the required triples.

RDF flexibility

RDF was born as a schema-less language, a feature which provided a series of advantages in terms of flexibility and adaptation of RDF data to different scenarios.

The same property, can have different types of values. For example, a property like schema:creator can have as value a string literal or a more complex resource.

:angie schema:creator "Keith Richards" , [ a schema:Person ; schema:givenName "Mick" ; schema:familyName "Jagger" ] .

Repeated properties

Sometimes, the same property is used for different purposes in the same data. For example, a book can have two codes with different structure.

:book schema:name "Moby Dick"; schema:productID "ISBN-10:1503280780"; schema:productID "ISBN-13:978-1503280786" .

This is a natural consequence of the re-use of general properties,³ which is especially common in domains where many kinds of data are represented in the same structure.

Example 22 Repeated properties example in clinical records

Repeated properties which require different model for each value appear frequently in real-life scenarios. For example, FHIR (see Section 6.2 for a more detailed description) represents clinical records using a generic observation object. This means that a blood pressure measurement is recorded using the same data structure as a temperature. The challenge is that while a temperature observation has one value:⁴

:Obs1 a fhir:Observation ; fhir:Observation.code fhir:LOINC8310-5 ; fhir:Observation.valueQuantity 36.5 ; fhir:Observation.valueUnit "Cel" .

a blood pressure observation has two:⁵

:Obs2 a fhir:Observation ; fhir:Observation.code fhir:LOINC55284-4 ; fhir:Observation.component [ fhir:Observation.component.code fhir:LOINC8480-6 ; fhir:Observation.component.valueQuantity 107 ; fhir:Observation.component.valueUnit "mm[Hg]" ]; fhir:Observation.component [ fhir:Observation.component.code fhir:LOINC8462-4 ; fhir:Observation.component.valueQuantity 60 ; fhir:Observation.component.valueUnit "mm[Hg]" ] .

We can see that a blood pressure observation must have two instances of the fhir:Observation.component property, one with a code for a systolic measurement and the other with a code for a diastolic measurement.

Treating these two constraints on the property fhir:Observation.component individually would cause the systolic constraint to reject the diastolic measurement and the diastolic constraint to reject the systolic measurement—both constraints must be considered as being satisfied if one of the components satisfies one and the other component satisfies the other.

Closed Shapes

The RDF dictum of anyone can say anything about anything is in tension with conventional data practices which reject data with any assertions that are not recognized by the schema. For SQL schemas, this is enforced by the data storage itself; there’s simply no place to record assertions that does not correspond to some attribute in a table specified by the DDL. XML Schema offers some flexibility with constructs like <xs:any processContents="skip"> but these are rare in formats for the exchange of machine-processable data. Typically the edict is if you pass me something I do not understand fully, I will reject it.

For shapes-based schema languages, a shape is a collection of constraints to be applied to some node in an RDF graph and if it is closed, every property attached to that node must be included in the shape.

Even if the receiver of the data permits extra triples, it may not be able to store or return them. For instance, a Linked Data container may accept arbitrary data, search for sub-graph which it recognizes, and ignore the rest. A user expecting to put data in such a container and retrieve it will have a rude surprise when he gets back only a subset of the submitted data. Even if the receiver does not validate with closed shapes, the user may wish to pre-emptively validate their data against the receiver’s schema, flagging any triples not recognized by the schema.

Another value of closed shapes is that it can be used to detect spelling mistakes. If a shape in a schema includes an optional rdfs:label and a user has accidentally included an rdf:label, the schema has no way to detect that mistake unless all unknown properties are reported.

Like with repeated properties, the validation of closed shapes must consider property constraints as a whole, rather than examining each individually.

3.3 Previous RDF Validation Approaches

In this section we review some previous approaches that have already been proposed to validate RDF.

3.3.1 Query-based Validation

Query-based approaches use a query Language to express validation constraints. One of the earliest attempts in this category was Schemarama [63], by Libby Miller and Dan Brickley, which applied Schematron to RDF using the Squish query language. That approach was later adapted to use TreeHuger which reinterpreted XPath syntax to describe paths in the RDF model [95].

Once SPARQL appeared in scene, it was also adopted for RDF validation. SPARQL has a lot of expressiveness and can be used to validate numerical and statistical computations [55].

Example 23 Using SPARQL to validate RDF

If we want to validate that an RDF node has a schema:name property with a xsd:string value and a schema:gender property whose value must be one of schema:Male or schema:Female in SPARQL, we can do the following query:

ASK { { SELECT ?Person { ?Person schema:name ?o . } GROUP BY ?Person HAVING (COUNT(*)=1) } { SELECT ?Person { ?Person schema:name ?o . FILTER ( isLiteral(?o) && datatype(?o) = xsd:string ) } GROUP BY ?Person HAVING (COUNT(*)=1) } { SELECT ?Person (COUNT(*) AS ?c1) { ?Person schema:gender ?o . } GROUP BY ?Person HAVING (COUNT(*)=1) } { SELECT ?Person (COUNT(*) AS ?c2) { ?Person schema:gender ?o . FILTER ((?o = schema:Female || ?o = schema:Male)) } GROUP BY ?Person HAVING (COUNT(*)=1) } FILTER (?c1 = ?c2) }

Using plain-SPARQL queries for RDF validation has the following benefits.

It is very expressive and can handle most RDF validation needs.
SPARQL is ubiquitous: most of RDF products already have support for SPARQL.

But it also has the following problems.

Being very expressive, it is also very verbose. SPARQL queries can be difficult to write and debug by non-experts.
It can be idiomatic in the sense that there can be more than one way to encode the same constraint.
For all but the simplest data structures, it is complex to exhaustively write SPARQL queries which accept all valid permutations and reject all incorrect structures. This exhaustive enumeration is essentially the job of the approaches described below.

SPARQL Inferencing Notation (SPIN)[51] was introduced by TopQuadrant as a mechanism to attach SPARQL-based constraints and rules to classes. SPIN also contained templates, user-defined functions and template libraries. SPIN rules are expressed as SPARQL ASK queries where true indicates an error or CONSTRUCT queries that produce violations. SPIN uses the expressiveness of SPARQL plus the semantics of the variable ?this standing for the current focus node (the subject being validated).

SPIN has heavily influenced the design of SHACL. The Working Group has decided to offer a SPARQL based semantics and the second part of the working draft also contains a SPIN-like mechanism for defining SPARQL native constraints, templates and user-defined functions. There are some differences like the renaming of some terms and the addition of more core constraints like disjunction, negation or closed shapes. The following document describes how SHACL and SPIN relate (http://spinrdf.org/spin-shacl.html).

There have been other proposals using SPARQL combined with other technologies. Fürber and Hepp [39] proposed a combination between SPARQL and SPIN as a semantic data quality framework, Simister and Brickley [90] propose a combination between SPARQL queries and property paths which is used by Google and Kontokostas et al. [53] proposed RDFUnit a Test-driven framework which employs SPARQL query templates that are instantiated into concrete quality test queries.

3.3.2 Inference-based Approaches

Inference based approaches adapt RDF Schema or OWL to express validation semantics. The use of Open World and Non-unique name assumption limits the validation possibilities. In fact, what triggers constraint violations in closed world systems leads to new inferences in standard OWL systems. Motik, Horrocks, and Sattler [64] proposed the notion of extended description logics knowledge bases, in which a certain subset of axioms were designated as constraints.

In [72], Peter F. Pater-Schneider, separates the validation problem in two parts: integrity constraint and closed-world recognition. He shows that description logics can be implemented for both by translation to SPARQL queries.

In 2010, Tao et al. [96] had already proposed the use of OWL expressions with Closed World Assumption and a weak variant of Unique Name Assumption to express integrity constraints.

Their work forms the bases of Stardog ICV [21] (Integrity Constraint Validation), which is part of the Stardog database. It allows to write constraints using OWL syntax but with a different semantics based on a closed world and unique name assumption. The constraints are translated to SPARQL queries. As an example, a User could be specified as follows.

Example 24 Validation constraints using Stardog ICV

The following code declares several integrity constraints in Stardog ICV. It declares that nodes that are instances of schema:Person must have at exactly one value of schema:name (it is a functional property) which must be a xsd:string, an optional value of schema:gender which must be either schema:Male or schema:Female, and zero or more values of schema:knows which must be instances of schema:Person.

schema:Person a owl:Class ; rdfs:subClassOf [ owl:onProperty schema:name ; owl:minCardinality 1 ] , [ owl:onProperty schema:gender; owl:minCardinality 0 ] [ owl:onProperty schema:knows ; owl:minCardinality 0 ] . schema:name a owl:DatatypeProperty , owl:FunctionalProperty; rdfs:domain schema:Person ; rdfs:range xsd:string . schema:gender a owl:ObjectProperty , owl:FunctionalProperty; rdfs:domain schema:Person ; rdfs:range :Gender . schema:knows a owl:ObjectProperty ; rdfs:domain schema:Person ; rdfs:range schema:Person . schema:Female a :Gender . schema:Male a :Gender .

Instance nodes are required to have an rdf:type declaration whose value is schema:Person.

3.3.3 Structural Languages

While SPARQL and OWL Closed World were existing languages which were applied to RDF validation, some novel languages have been designed specifically to that task.

OSLC Resource Shapes [86] have been proposed as a high level and declarative description of the expected contents of an RDF graph expressing constraints on RDF terms.

Example 25 OSLC example

Example 23 can be represented in OSLC as:

:user a rs:ResourceShape ; rs:property [ rs:name "name" ; rs:propertyDefinition schema:name ; rs:valueType xsd:string ; rs:occurs rs:Exactly-one ; ] ; rs:property [ rs:name "gender" ; rs:propertyDefinition schema:gender ; rs:allowedValue schema:Male, schema:Female ; rs:occurs rs:Zero-or-one ; ].

Dublin Core Application Profiles [23] also define a set of validation constraints using Description Templates

Fischer et al. [38] proposed RDF Data Descriptions as another domain specific language that is compiled to SPARQL. The validation is class based in the sense that RDF nodes are validated against a class C whenever they contain an rdf:type C declaration. This restriction enables the authors to handle the validation of large datasets and to define some optimization techniques which could be applied to shape implementations.

3.4 Validation Requirements

In this section we collect the different validation requirements that we have identified for an RDF validation language.

Some of this requirements have been borrowed from the SHACL Use Cases and Requirements document [91]. Other collections of validation requirements have also been proposed [13].

3.4.1 General Requirements

VR 1. High-level language: The schema must be defined using a high-level language that uses concepts familiar to the users that intend to validate RDF.
VR 2. Concise: Schemas must be easy to understand, read, and write by humans. Verbose languages tend to be neglected by their users.
VR 3. Formal: It must be based on a formal language that can be automatically processed by machines without ambiguity. The schemas must be parsed and processed by automatic means and the semantics of the different terms must be defined in a non-ambiguous way.
VR 4. Implementation independence: The schema definition must be implementation independent so processors can be implemented using different programming languages and technologies
VR 5. Feasibility: The validation algorithm that a schema processor has to implement must be feasibly computed. It is necessary to check that suitable algorithms are available to check if RDF datasets comply with some schema. Otherwise, if the validation requires too many computational resources, there will not be interest in its application in practical scenarios.
VR 6. Least power: The schema language must be able to do its job well but no more than that. Although one could use whole procedural languages like Java or Python to validate RDF, doing it in this way will be cumbersome as the validation rules will be interspersed with the code [97]. This principle states that a declarative language should be preferred over a procedural one.

3.4.2 Graph-based Requirements

Given that the RDF data model is a graph model. An RDF validation language must be able to describe graph structures. The following set of requirements could be applied to any validation language related with graphs.

VR 7. Focus identification: A validation process must identify the graph nodes that are expected match constraints. Unlike tree structures like XML or JSON, graphs like RDF have no “root” node. For RDF, the focii would be IRIs, literals and blank nodes which are subject to validation.
VR 8. Properties: A schema language must be able to describe which arcs relate with which nodes. In the case of RDF, arcs between nodes are called properties or predicates and are IRIs. The schema language must be able to describe the properties that depart from some nodes.
VR 9. Repeated properties: Some of the arcs that depart from a node may be repeated and the nodes that they point to could have different structure. The schema language must be able to declare that some properties can appear repeated but with different contents.
VR 10. Inverse properties: It must be possible to describe the incoming arcs of a node, which are also called inverse properties.
VR 11. Paths: The schema language must be able to describe the paths that relate two given nodes in a graph. SPARQL 1.1 contains a language to describe paths in an RDF graph. For example, the transitive traversal of the rdfs:subClassOf property can be expressed as rdfs:subClassOf*.

3.4.3 RDF Data Model Requirements

The schema language must be able to check the different types of contents that appear in the RDF data model.

VR 12. Node kinds: The RDF data model contains three kinds of nodes: IRIs, Literals, and BNodes. The schema language must be able to describe the kind of some specific nodes
VR 13. Datatypes: The schema language must be able to describe which are the datatypes that some nodes have.
VR 14. Datatype facets: The XML Schema datatypes are the most popular datatypes employed in RDF datasets. Those datatypes can be qualified with facets which constrain the possible values. For example, one can say that a value is an xsd:integer between 10 and 20.
VR 15. Language tags: The schema language can describe the language tag associated with literals of type rdf:langString.

3.4.4 Data-modeling-based Requirements

This set of requirements are common to technologies that model data.

VR 16. Conjunction: It must be possible to declare that some content must satisfy all the constraints in a set.
VR 17. Disjunction: It must be possible to declare that some content must satisfy some of the constraints in a set.
VR 18. Addition: It must be possible to declare that some content must be the addition of some content. In the case of RDF graphs, one may want to declare that a node must have some content and some other content.
VR 19. Regular cardinalities: . The schema must support regular cardinalities like optional, zero or more, one or more.
VR 20. Numerical cardinalities: . The schema must support numerical cardinalities like repetitions between m and n, or at least m repetitions.
VR 21. Negation: It must be possible to declare that some content must not satisfy some constraint.
VR 22. Recursion: It must be possible to declare that some group of constraints that depend on another group in a recursive way.
VR 23. OneOf: It must be possible to declare that some content can have one of several structures. For example, a person can have either a full name or a combination of first name and last name, but not both.
VR 24. Open/Closed models: The schema language must be able to define that some content is open and admits other features apart from the declared structure or closed and does not admit other features.
VR 25. Co-occurrence constraints: The schema language must be able to declare that the appearance of some content affects other content.

3.4.5 Expressiveness of Schema Language

VR 26. Comparisons: The schema language must describe comparisons between values like declaring that a value is less than or equal to another one.
VR 27. Arithmetic: The schema language can perform arithmetic expressions for constraint checking. For example, to describe the area of a rectangle as the product of its declared base by its declared height it must perform that multiplication.
VR 28. Expressions: The schema language can define complex expressions to enable further constraint checking. This requirement can contradict VR1 so it is necessary to find a balance between both requirements.
VR 29. Composition: The schema language provides mechanisms to define constraints that are composed of other constraints.
VR 30. Abstraction: The schema language provides mechanisms to define abstractions with parameters that can later be reused. This feature is usually implemented by functions, macros, or templates.
VR 31. Modularity: The schema definitions can be done in a modular way so they can be reused and imported from external sources.
VR 32. Specialization: The schema language can define a group of constraints that extends another group of constraints with some further refinements.

3.4.6 Validation Invocation Requirements

The following requirements refer to the relationship between schema and instance data, and to the mechanism by which the validation process is triggered.

VR 33. Whole dataset: The schema language can define constraints that must be satisfied by a whole RDF dataset.
VR 34. Single node: It must be possible to validate a single node in an RDF graph against a set of constraints.
VR 35. Selection: There are mechanisms to select which nodes in an RDF graph are selected for validation against which sets of constraints.
VR 36. Reuse: It should be possible to reuse a set of constraints in different contexts.

3.4.7 Usability Requirements

The following set of requirements refer to the usability of the schema language.

VR 37. Error reporting: Validation processors complying to the schema language can generate a report of the different violation errors that appeared during validation.
VR 38. Validation report: The schema language can generate a report of the nodes that have been validated and the set of constraints they satisfy.
VR 39. Annotations: It is possible to provide annotations with some extra information that does not affect validation but can be used for different purposes such as searching, browsing, UI generation, etc.
VR 40. Familiar syntax: The schema language supports a syntax that is familiar to its intended audience. In the case of RDF validation, a familiar syntax could be RDF.
VR 41. Profiles: The schema language can include the notion of profiles with different expressiveness so that certain processors implement a subset of the validation functionalities.

3.5 Summary

In this chapter we learned which are the main motivations for validating RDF. We started describing what do other technologies do for validation with an overview of UML, SQL, XML, JSON, and so on. This section was aimed to present those technologies and to gather some list of validation requirements that are common to all of them.

We also described some of the previous RDF validation approaches and collected a list of validation requirements that a good schema language for RDF validation must fulfil. Notice that some of them contradict each other, so it is necessary to reach some compromise solution.

3.6 Suggested Reading

Non-RDF schema languages

The following book contains a good overview of non-RDF validation approaches: Abiteboul, Manolescu, Rigaux, Rousset, and Senellart,2012 [2]
Glushko,2013 [41]
Murata, Lee, Mani, and Kawaguchi,2005 [65]
Overview of JSON Schema: Bourhis, Reutter, Suárez, and Vrgoč,2017 [14]

RDF validation approaches

Tao, Sirin, Bao, and McGuinness,2010 [96]
Bosch, Acar, Nolle, and Eckert,2015 [13]
SHACL use cases and requirements: Simon Steyskal and Karen Coyle,2016 [91]

1: See: http://www.omg.org/spec/OCL/
2: See: https://csvlint.io/
3: Those familiar with the Protégé Pizza Tutorial will recall that it uses a has topping property rather than a has pizza topping property.
4: Simplified from http://build.fhir.org/observation-example-body-temperature.ttl.html.
5: Simplified from http://build.fhir.org/observation-example-bloodpressure.ttl.html.

Data format	Validation technology
Relational databases	DDL
XML	DTD, XML Schema, RelaxNG, Schematron
CSV	CSV on the Web
JSON	JSON Schema
RDF	ShEx/SHACL