Skip to content

πŸ—Ž Apache Avro

A.L. Kleijngeld edited this page Dec 2, 2022 · 11 revisions

In this document an informal description will be provided of how an Apache Avro schema is generated from a SHACL shapes graph.

Roughly speaking we will be mapping SHACL node shapes onto Avro records or enum schemas. There will need to be a designated root node shape to start the transformation on, which corresponds to the root of the Avro schema.

Used standards

Mappings

Primitive types

XSD Avro Notes
xsd:boolean boolean
xsd:int
xsd:integer
int
xsd:float float
xsd:long long
xsd:double double
xsd:decimal bytes annotated with logical type decimal
xsd:string string
xsd:duration fixed annotated with logical type duration
xsd:dateTime
xsd:date
xsd:time[1]
string conforming to ISO 8601

Node shapes

SHACL Avro
sh:NodeShape
  sh:in
enum
  sh:property record

Note there's either exactly one statement with predicate sh:in, or at least one with predicate sh:property, but not both.

Shape conjunction

sh:and

Using sh:and one can specify a list of shapes, all of which need to be conformed to. Currently, our implementation supports node shapes only.

For purposes in the context of Avro schema generation sh:and is interpreted to mean recursively combining all the properties of all the specified node shapes. As soon as there is no sh:and left to follow, the recursion bottoms out.

Furthermore:

  • At most one sh:and statement is expected.

Enumerations

Each stated individual in the list value of the sh:in statement becomes an enum symbol.

SHACL Avro
sh:NodeShape enum
  sh:targetClass   name
  sh:in   symbols

Records

If there are sh:property statements about a node shape, it is mapped onto the Avro record type. Each of these property shapes are themselves mapped onto Avro record fields.

SHACL Avro
sh:NodeShape record
  sh:targetClass   name
  sh:property   field
    sh:path     name
sh:minCount, sh:maxCount
1, 1 0, 1 1, > 1 0, > 1
    sh:node node shape union(null, …) array(…) union(null, array(…))
    sh:datatype primitive

Limitations and notes

Mapping a SHACL shapes graph onto an Avro schema means transforming a graph structure into a tree. Also, both have their own peculiarities. These lead to certain implications.

Root node shape

Since Avro schemas are trees, they have a root. It is therefore necessary to indicate what node shape represents this root.

There must be exactly one designated root node shape.

Ignored statements

Any node shapes, property shapes and in fact all statements that do not belong to any subgraph of the root node shape, will be ignored.

Designating a node shape to be the root node shape is currently done by stating a rdfs:comment with value "RootObject" for it.

Structural loops

Named types in Avro allow referring to an earlier defined type by its name. So, if a record D occurs more than once in the schema, only the first time will it be defined, and all subsequent times it is referred to by its name (D).

Note, however, that Avro does not support forward referencing: there is no way to use the name D in advance, the record must already be defined. A particular consequence of this is that during the definition of the D record schema - i.e. prior to having finished that definition - no reference to it can be made.

Now imagine the following example case where there's a structural loop in the shapes graph.

Node shape A has a property that refers to node shape B, which in turn has a property that refers to node shape A again. When we generate a record schema for A, at some point we'll generate the record B with a field that refers to the A record again. However, since we haven't finished defining A yet, we can't reference it. In practice this leads to the application redefining the A record, which for the same reasons causes B to also be redefined (assuming it too wasn't defined earlier), which causes another redefinition of A, and so on. The program hangs and probably runs into a stack overflow at some point.

To eliminate this issue, properties that cause it are simply ignored in the transformation.

Limited cardinality support

Avro schemas only support cardinalities of 0, 1 and * (more than 1). The mapping table shows how to deal with SHACL's finer grained cardinalities.

Aliases

Avro aliases can be used to rename fields and improve compatibility between schemas. For this reason, Metamorph will never map something onto the alias concept.

Easy schema evolution

Metamorph chooses to make all fields optional to make schema evolution fully compatible and easy. See Schema Evolution Considerations for more information.

Footnotes

  1. It is also possible to map the XSD datetime fields to Avro int or long types, annotated with logical type timestamp-millis or timestamp-micros, but this has been found to be confusing to developers.

Clone this wiki locally