-
Notifications
You must be signed in to change notification settings - Fork 2
π Apache Avro
In this document an informal description will be provided of how an Apache Avro schema is generated from a SHACL shapes graph.
Roughly speaking we will be mapping SHACL node shapes onto Avro records or enum schemas. There will need to be a designated root node shape to start the transformation on, which corresponds to the root of the Avro schema.
| XSD | Avro | Notes |
|---|---|---|
xsd:boolean |
boolean |
|
xsd:intxsd:integer
|
int |
|
xsd:float |
float |
|
xsd:long |
long |
|
xsd:double |
double |
|
xsd:decimal |
bytes |
annotated with logical type decimal
|
xsd:string |
string |
|
xsd:duration |
fixed |
annotated with logical type duration
|
xsd:dateTimexsd:datexsd:time[1]
|
string |
conforming to ISO 8601 |
| SHACL | Avro |
|---|---|
sh:NodeShape ββ sh:in
|
enum |
ββsh:property
|
record |
Note there's either exactly one statement with predicate sh:in, or at least one with predicate sh:property, but not both.
Using sh:and one can specify a list of shapes, all of which need to be conformed to. Currently, our implementation supports node shapes only.
For purposes in the context of Avro schema generation sh:and is interpreted to mean recursively combining all the properties of all the specified node shapes. As soon as there is no sh:and left to follow, the recursion bottoms out.
Furthermore:
- At most one
sh:andstatement is expected.
Each stated individual in the list value of the sh:in statement becomes an enum symbol.
| SHACL | Avro |
|---|---|
sh:NodeShape |
enum |
ββsh:targetClass
|
ββname
|
ββsh:in
|
ββsymbols
|
If there are sh:property statements about a node shape, it is mapped onto the Avro record type. Each of these property shapes are themselves mapped onto Avro record fields.
| SHACL | Avro | |||
sh:NodeShape |
record |
|||
ββsh:targetClass
|
ββname
|
|||
ββsh:property
|
ββfield
|
|||
ββββsh:path
|
ββββname
|
|||
sh:minCount, sh:maxCount |
||||
1, 1 |
0, 1 |
1, > 1 |
0, > 1 |
|
ββββsh:node
|
node shape | union(null, β¦) |
array(β¦) |
union(null, array(β¦)) |
ββββsh:datatype
|
primitive | |||
Mapping a SHACL shapes graph onto an Avro schema means transforming a graph structure into a tree. Also, both have their own peculiarities. These lead to certain implications.
Since Avro schemas are trees, they have a root. It is therefore necessary to indicate what node shape represents this root.
There must be exactly one designated root node shape.
Any node shapes, property shapes and in fact all statements that do not belong to any subgraph of the root node shape, will be ignored.
Designating a node shape to be the root node shape is currently done by stating a rdfs:comment with value "RootObject" for it.
Named types in Avro allow referring to an earlier defined type by its name. So, if a record D occurs more than once in the schema, only the first time will it be defined, and all subsequent times it is referred to by its name (D).
Note, however, that Avro does not support forward referencing: there is no way to use the name D in advance, the record must already be defined. A particular consequence of this is that during the definition of the D record schema - i.e. prior to having finished that definition - no reference to it can be made.
Now imagine the following example case where there's a structural loop in the shapes graph.
Node shape A has a property that refers to node shape B, which in turn has a property that refers to node shape A again. When we generate a record schema for A, at some point we'll generate the record B with a field that refers to the A record again. However, since we haven't finished defining A yet, we can't reference it. In practice this leads to the application redefining the A record, which for the same reasons causes B to also be redefined (assuming it too wasn't defined earlier), which causes another redefinition of A, and so on. The program hangs and probably runs into a stack overflow at some point.
To eliminate this issue, properties that cause it are simply ignored in the transformation.
Avro schemas only support cardinalities of 0, 1 and * (more than 1). The mapping table shows how to deal with SHACL's finer grained cardinalities.
Avro aliases can be used to rename fields and improve compatibility between schemas. For this reason, Metamorph will never map something onto the alias concept.
Metamorph chooses to make all fields optional to make schema evolution fully compatible and easy. See Schema Evolution Considerations for more information.