Copyright (C) 2025 The Open Library Foundation
This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.
This module provides bulk import functionality for RDF data graphs into the mod-linked-data application.
It reads RDF subgraphs in Bibframe 2 format, transforms them into the Builde vocabulary, and delivers them to mod-linked-data via Kafka.
This software uses the following Weak Copyleft (Eclipse Public License 1.0 / 2.0) licensed software libraries:
- Upload the RDF file to the S3 bucket specified by the
S3_BUCKETenvironment variable. - Inside that bucket, place the file within the subdirectory corresponding to the target tenant ID.
- Trigger the import by calling the following API:
POST /linked-data-import/start?fileUrl={fileNameInS3}&contentType=application/ld+json
x-okapi-tenant: {tenantId}
x-okapi-token: {token}
- The file must be in JSON Lines (jsonl) format.
- Each line must contain a complete subgraph of a Bibframe Instance resource, as defined by the Bibframe 2 ontology.
For an example of a valid import file containing two RDF instances, see docs/example-import.jsonl.
- Only RDF data serialized as
application/ld+jsonis supported. Support for additional formats (e.g., XML, N-Triples) may be added in the future. - Only Bibframe Instances and their connected resources can be imported. Standalone resources—such as a Person not linked to any Instance—cannot be processed.
File contents are processed in batches. You can configure batch processing using following environment variables:
- CHUNK_SIZE: Number of lines read from the input file per chunk
- OUTPUT_CHUNK_SIZE: Number of Graph resources sent to Kafka per chunk
- PROCESS_FILE_MAX_POOL_SIZE: Maximum threads used for parallel chunk processing
mod-linked-data uses the Builde vocabulary for representing graph data.
During import:
- This module transforms Bibframe 2 subgraphs into the equivalent Builde subgraph using the
lib-linked-data-rdf4ldlibrary. - The transformed subgraphs are published to the Kafka topic specified by the
KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPICenvironment variable. mod-linked-dataconsumes messages from this topic, performs additional processing, and persists the graph to its database.
This module is dependent on the following libraries:
mvn clean installSkip tests:
mvn clean install -DskipTestsThis module uses S3 storage for files. AWS S3 and Minio Server are supported for files storage.
It is also necessary to specify variable S3_IS_AWS to determine if AWS S3 is used as files storage. By default,
this variable is false and means that MinIO server is used as storage.
This value should be true if AWS S3 is used.
| Name | Default value | Description |
|---|---|---|
| S3_URL | http://127.0.0.1:9000/ | S3 url |
| S3_REGION | - | S3 region |
| S3_BUCKET | - | S3 bucket |
| S3_ACCESS_KEY_ID | - | S3 access key |
| S3_SECRET_ACCESS_KEY | - | S3 secret key |
| S3_IS_AWS | false | Specify if AWS S3 is used as files storage |
| CHUNK_SIZE | 1000 | Number of lines read from the input file per chunk |
| OUTPUT_CHUNK_SIZE | 100 | Number of Graph resources sent to Kafka per chunk |
| JOB_POOL_SIZE | 1 | Number of concurrent Import Jobs |
| PROCESS_FILE_MAX_POOL_SIZE | 1000 | Maximum threads used for parallel chunk processing |
| KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC | linked_data_import.output | Kafka topic where the transformed subgraph is published for mod-linked-data |