Bachelor Thesis: Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog

Overview

This project implements a Python-based NLP workflow that automatically extracts key information from Large Language Model (LLM) research papers and populates the ORKG comparison "Generative AI Model Landscape".

The pipeline supports both API-based extraction (using KISSKI Chat AI API) and HPC cluster deployment (using GPU-based transformers on GWDG Grete).

Features

Automated paper fetching from ArXiv
PDF parsing and text extraction with fallback mechanisms
Dual extraction modes:
- API-based using KISSKI Chat AI API (free for thesis work)
- GPU-based using Hugging Face transformers on HPC cluster
Structured data mapping to ORKG templates
Automatic ORKG updates with duplicate detection
Multi-model extraction from single papers
Batch processing support for HPC environments

Project Structure

Bachelor-Arbeit-NLP/
├── .github/
│   └── workflows/
│       └── ci.yml                    # GitHub Actions CI/CD
├── docs/
│   ├── deployment/
│   │   ├── grete-setup.md           # HPC cluster setup guide
│   │   └── verify-jobs.md           # Job verification guide
│   ├── troubleshooting/
│   │   ├── phi3-cache-issues.md     # Phi-3 troubleshooting
│   │   └── distilgpt2-issues.md     # DistilGPT2 troubleshooting
│   └── archive/
│       └── cleanup-summary-2025-12.md
├── grete/                            # HPC cluster deployment
│   ├── extraction/
│   │   ├── grete_extract_paper.py
│   │   ├── grete_extract_from_url.py
│   │   └── grete_extract_paper_distilgpt2.py
│   ├── jobs/
│   │   ├── grete_extract_job.sh     # SLURM single job
│   │   ├── grete_extract_batch.sh   # SLURM batch array
│   │   └── grete_extract_url_job.sh
│   └── README.md                     # Grete-specific documentation
├── src/                              # Core pipeline
│   ├── __init__.py
│   ├── comparison_updater.py         # Update ORKG comparisons
│   ├── llm_extractor.py              # KISSKI API extraction
│   ├── llm_extractor_transformers.py # GPU-based extraction
│   ├── orkg_client.py                # ORKG API wrapper
│   ├── orkg_manager.py               # ORKG management
│   ├── paper_fetcher.py              # ArXiv paper fetching
│   ├── pdf_parser.py                 # PDF text extraction
│   ├── pipeline.py                   # Main orchestration
│   └── template_mapper.py            # ORKG template mapping
├── scripts/
│   ├── append_to_paper.py            # Production utilities
│   ├── sandbox_upload.py
│   └── debug/                        # Debug/testing scripts
│       ├── add_to_orkg_manual.py
│       ├── export_to_csv.py
│       └── force_new_paper_upload.py
├── tests/                            # Unit tests
│   ├── __init__.py
│   ├── test_llm_extractor.py
│   ├── test_orkg_append.py
│   ├── test_orkg_client.py
│   ├── test_pdf_parser.py
│   └── test_pipeline.py
├── examples/
│   └── example_usage.py              # Usage examples
├── data/
│   ├── papers/                       # Downloaded PDFs
│   ├── extracted/                    # Extracted JSON data
│   └── logs/                         # Processing logs
├── notebooks/
│   └── validate_extraction.ipynb     # Validation notebook
├── config/
│   └── config.yaml                   # Configuration
├── .env.example                      # Environment variables template
├── .gitignore
├── LICENSE                           # MIT License
├── pyproject.toml                    # Modern Python packaging
├── README.md
├── requirements.txt                  # Production dependencies
└── requirements-dev.txt              # Development dependencies

Installation

Clone the repository

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Copy .env.example to .env and fill in your credentials:
```
cp .env.example .env
```

Configuration

Environment Variables (`.env`)

Required variables:

ORKG_EMAIL=your.email@example.com
ORKG_PASSWORD=your_password_here
ORKG_HOST=sandbox  # or 'production'
KISSKI_API_KEY=your_kisski_api_key_here
KISSKI_API_ENDPOINT=https://kisski.de/api/chat

Pipeline Configuration (`config/config.yaml`)

Customize the extraction pipeline:

ORKG settings: host, template ID, comparison ID
KISSKI API: endpoint, model, temperature, max tokens, rate limiting
ArXiv settings: categories, download directory
Extraction fields: which fields to extract
Logging: level, format, output files

Usage

Basic Usage

from src.pipeline import ExtractionPipeline

pipeline = ExtractionPipeline()
pipeline.process_paper("2302.13971")  # ArXiv ID for Llama paper

Command Line

python -m src.pipeline --arxiv-id 2302.13971

Testing

Run All Tests

pytest tests/ -v

Run with Coverage

pytest tests/ -v --cov=src --cov-report=html

Run Specific Test File

pytest tests/test_pipeline.py -v

Continuous Integration

The project uses GitHub Actions for automated testing. See .github/workflows/ci.yml for the CI configuration.

Tests run automatically on:

Push to main or develop branches
Pull requests to main or develop

Deployment

Local Extraction (API-based)

Uses KISSKI Chat AI API for extraction. Fast and reliable, free for thesis work.

Prerequisites: KISSKI API key from your professor

See Usage section above for examples.

HPC Cluster (GPU-based)

Uses local transformers on GPU. No API costs, unlimited processing, but requires HPC setup.

Full deployment guide: docs/deployment/grete-setup.md

Quick start:

Upload code to Grete cluster
Set up conda environment with PyTorch + transformers

Submit SLURM jobs:

sbatch grete/jobs/grete_extract_job.sh 2302.13971

Monitoring: docs/deployment/verify-jobs.md

Troubleshooting

Common issues and solutions:

Phi-3 cache compatibility: docs/troubleshooting/phi3-cache-issues.md
DistilGPT2 JSON errors: docs/troubleshooting/distilgpt2-issues.md
Job verification: docs/deployment/verify-jobs.md

Development

Code Quality

# Format code
black src/

# Lint code
flake8 src/

# Type checking
mypy src/

# Sort imports
isort src/

Project Structure

src/: Core pipeline code
tests/: Unit tests
scripts/: Utility scripts
grete/: HPC cluster deployment files
docs/: Documentation
examples/: Usage examples
notebooks/: Jupyter notebooks for exploration

Contributing

Create a feature branch
Make your changes
Run tests: pytest tests/
Run linters: black src/ && flake8 src/
Submit a pull request

Resources

ORKG

Platform: https://orkg.org/
Sandbox: https://sandbox.orkg.org/
Python Client: https://orkg.readthedocs.io/en/latest/
LLM Template: https://orkg.org/templates/R609825
Target Comparison: https://orkg.org/comparisons/R1364660

APIs

KISSKI Chat AI API: Provided by university for thesis work
ArXiv API: https://info.arxiv.org/help/api/

HPC

GWDG Grete: https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:grete

Citation

If you use this code for your research, please cite:

@thesis{bachelor2025llm,
  title={Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog},
  author={Your Name},
  year={2025},
  school={Your University},
  type={Bachelor's Thesis}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Developed as part of a Bachelor thesis project
ORKG platform for knowledge graph infrastructure
GWDG for providing HPC resources and KISSKI API access
SAIA platform for Chat AI API services

Contact

For questions or issues, please open an issue on GitHub or contact your.email@example.com.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
examples		examples
grete		grete
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
EXTRACTION_IMPROVEMENT_SUMMARY.md		EXTRACTION_IMPROVEMENT_SUMMARY.md
KISSKI_SETUP.md		KISSKI_SETUP.md
LICENSE		LICENSE
MIGRATION_TO_KISSKI.md		MIGRATION_TO_KISSKI.md
PIPELINE_UPDATE_MODEL_FAMILY_GROUPING.md		PIPELINE_UPDATE_MODEL_FAMILY_GROUPING.md
README.md		README.md
TEST_PLAN.md		TEST_PLAN.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
test_kisski_api.py		test_kisski_api.py

License

sciknoworg/llm-atlas

Folders and files

Latest commit

History

Repository files navigation

Bachelor Thesis: Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog

Overview

Features

Project Structure

Installation

Configuration

Environment Variables (.env)

Pipeline Configuration (config/config.yaml)

Usage

Basic Usage

Command Line

Testing

Run All Tests

Run with Coverage

Run Specific Test File

Continuous Integration

Deployment

Local Extraction (API-based)

HPC Cluster (GPU-based)

Troubleshooting

Development

Code Quality

Project Structure

Contributing

Resources

ORKG

APIs

HPC

Citation

License

Acknowledgments

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Environment Variables (`.env`)

Pipeline Configuration (`config/config.yaml`)

Packages