Skip to content

folio-org/mod-linked-data-import

Repository files navigation

mod-linked-data-import

Copyright (C) 2025 The Open Library Foundation

This software is distributed under the terms of the Apache License, Version 2.0. See the file "LICENSE" for more information.

Introduction

This module provides bulk import functionality for RDF data graphs into the mod-linked-data application. It reads RDF subgraphs in Bibframe 2 format, transforms them into the Builde vocabulary, and delivers them to mod-linked-data via Kafka.

Third party libraries used in this software

This software uses the following Weak Copyleft (Eclipse Public License 1.0 / 2.0) licensed software libraries:

How to Import Data

  1. Upload the RDF file to the S3 bucket specified by the S3_BUCKET environment variable.
  2. Inside that bucket, place the file within the subdirectory corresponding to the target tenant ID.
  3. Trigger the import by calling the following API:
POST /linked-data-import/start?fileUrl={fileNameInS3}&contentType=application/ld+json
x-okapi-tenant: {tenantId}
x-okapi-token: {token}

File Format & Contents

  1. The file must be in JSON Lines (jsonl) format.
  2. Each line must contain a complete subgraph of a Bibframe Instance resource, as defined by the Bibframe 2 ontology.

For an example of a valid import file containing two RDF instances, see docs/example-import.jsonl.

Limitations

  1. Only RDF data serialized as application/ld+json is supported. Support for additional formats (e.g., XML, N-Triples) may be added in the future.
  2. Only Bibframe Instances and their connected resources can be imported. Standalone resources—such as a Person not linked to any Instance—cannot be processed.

Batch processing

File contents are processed in batches. You can configure batch processing using following environment variables:

  1. CHUNK_SIZE: Number of lines read from the input file per chunk
  2. OUTPUT_CHUNK_SIZE: Number of Graph resources sent to Kafka per chunk
  3. PROCESS_FILE_MAX_POOL_SIZE: Maximum threads used for parallel chunk processing

Interaction with mod-linked-data

mod-linked-data uses the Builde vocabulary for representing graph data.

During import:

  1. This module transforms Bibframe 2 subgraphs into the equivalent Builde subgraph using the lib-linked-data-rdf4ld library.
  2. The transformed subgraphs are published to the Kafka topic specified by the KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC environment variable.
  3. mod-linked-data consumes messages from this topic, performs additional processing, and persists the graph to its database.

Dependencies on libraries

This module is dependent on the following libraries:

Compiling

mvn clean install

Skip tests:

mvn clean install -DskipTests

Environment variables

This module uses S3 storage for files. AWS S3 and Minio Server are supported for files storage. It is also necessary to specify variable S3_IS_AWS to determine if AWS S3 is used as files storage. By default, this variable is false and means that MinIO server is used as storage. This value should be true if AWS S3 is used.

Name Default value Description
S3_URL http://127.0.0.1:9000/ S3 url
S3_REGION - S3 region
S3_BUCKET - S3 bucket
S3_ACCESS_KEY_ID - S3 access key
S3_SECRET_ACCESS_KEY - S3 secret key
S3_IS_AWS false Specify if AWS S3 is used as files storage
CHUNK_SIZE 1000 Number of lines read from the input file per chunk
OUTPUT_CHUNK_SIZE 100 Number of Graph resources sent to Kafka per chunk
JOB_POOL_SIZE 1 Number of concurrent Import Jobs
PROCESS_FILE_MAX_POOL_SIZE 1000 Maximum threads used for parallel chunk processing
KAFKA_LINKED_DATA_IMPORT_OUTPUT_TOPIC linked_data_import.output Kafka topic where the transformed subgraph is published for mod-linked-data

About

Linked Data Import module

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •