🗎 Apache Avro

In this document an informal description will be provided of how an Apache Avro schema is generated from a SHACL shapes graph.

Roughly speaking we will be mapping SHACL node shapes onto Avro records or enum schemas. There will need to be a designated root node shape to start the transformation on, which corresponds to the root of the Avro schema.

Used standards

Mappings

Primitive types

XSD	Avro	Notes
`xsd:boolean`	`boolean`
`xsd:int` `xsd:integer`	`int`
`xsd:float`	`float`
`xsd:long`	`long`
`xsd:double`	`double`
`xsd:decimal`	`bytes`	annotated with logical type `decimal`
`xsd:string`	`string`
`xsd:duration`	`fixed`	annotated with logical type `duration`
`xsd:dateTime` `xsd:date` `xsd:time`^[1]	`string`	conforming to ISO 8601

Node shapes

SHACL	Avro
`sh:NodeShape` `sh:in`	`enum`
`sh:property`	`record`

Note there's either exactly one statement with predicate sh:in, or at least one with predicate sh:property, but not both.

Shape conjunction

`sh:and`

Using sh:and one can specify a list of shapes, all of which need to be conformed to. Currently, our implementation supports node shapes only.

For purposes in the context of Avro schema generation sh:and is interpreted to mean recursively combining all the properties of all the specified node shapes. As soon as there is no sh:and left to follow, the recursion bottoms out.

Furthermore:

At most one sh:and statement is expected.

Enumerations

Each stated individual in the list value of the sh:in statement becomes an enum symbol.

SHACL	Avro
`sh:NodeShape`	`enum`
`sh:targetClass`	`name`
`sh:in`	`symbols`

Records

If there are sh:property statements about a node shape, it is mapped onto the Avro record type. Each of these property shapes are themselves mapped onto Avro record fields.

SHACL	Avro
`sh:NodeShape`	`record`
`sh:targetClass`	`name`
`sh:property`	`field`
`sh:path`	`name`
	`sh:minCount, sh:maxCount`
	`1, 1`	`0, 1`	`1, > 1`	`0, > 1`
`sh:node`	node shape	`union(null, …)`	`array(…)`	`union(null, array(…))`
`sh:datatype`	primitive

Limitations and notes

Mapping a SHACL shapes graph onto an Avro schema means transforming a graph structure into a tree. Also, both have their own peculiarities. These lead to certain implications.

Root node shape

Since Avro schemas are trees, they have a root. It is therefore necessary to indicate what node shape represents this root.

There must be exactly one designated root node shape.

Ignored statements

Any node shapes, property shapes and in fact all statements that do not belong to any subgraph of the root node shape, will be ignored.

Designating a node shape to be the root node shape is currently done by stating a rdfs:comment with value "RootObject" for it.

Structural loops

Named types in Avro allow referring to an earlier defined type by its name. So, if a record D occurs more than once in the schema, only the first time will it be defined, and all subsequent times it is referred to by its name (D).

Note, however, that Avro does not support forward referencing: there is no way to use the name D in advance, the record must already be defined. A particular consequence of this is that during the definition of the D record schema - i.e. prior to having finished that definition - no reference to it can be made.

Now imagine the following example case where there's a structural loop in the shapes graph.

Node shape A has a property that refers to node shape B, which in turn has a property that refers to node shape A again. When we generate a record schema for A, at some point we'll generate the record B with a field that refers to the A record again. However, since we haven't finished defining A yet, we can't reference it. In practice this leads to the application redefining the A record, which for the same reasons causes B to also be redefined (assuming it too wasn't defined earlier), which causes another redefinition of A, and so on. The program hangs and probably runs into a stack overflow at some point.

To eliminate this issue, properties that cause it are simply ignored in the transformation.

Limited cardinality support

Avro schemas only support cardinalities of 0, 1 and * (more than 1). The mapping table shows how to deal with SHACL's finer grained cardinalities.

Aliases

Avro aliases can be used to rename fields and improve compatibility between schemas. For this reason, Metamorph will never map something onto the alias concept.

Easy schema evolution

Metamorph chooses to make all fields optional to make schema evolution fully compatible and easy. See Schema Evolution Considerations for more information.

Footnotes

It is also possible to map the XSD datetime fields to Avro int or long types, annotated with logical type timestamp-millis or timestamp-micros, but this has been found to be confusing to developers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🗎 Apache Avro

Used standards

Mappings

Primitive types

Node shapes

Shape conjunction

`sh:and`

Enumerations

Records

Limitations and notes

Root node shape

Ignored statements

Structural loops

Limited cardinality support

Aliases

Easy schema evolution

Footnotes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally