This project implements a Python-based NLP workflow that automatically extracts key information from Large Language Model (LLM) research papers and populates the ORKG comparison "Generative AI Model Landscape".
The pipeline supports both API-based extraction (using KISSKI Chat AI API) and HPC cluster deployment (using GPU-based transformers on GWDG Grete).
- Automated paper fetching from ArXiv
- PDF parsing and text extraction with fallback mechanisms
- Dual extraction modes:
- API-based using KISSKI Chat AI API (free for thesis work)
- GPU-based using Hugging Face transformers on HPC cluster
- Structured data mapping to ORKG templates
- Automatic ORKG updates with duplicate detection
- Multi-model extraction from single papers
- Batch processing support for HPC environments
Bachelor-Arbeit-NLP/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions CI/CD
├── docs/
│ ├── deployment/
│ │ ├── grete-setup.md # HPC cluster setup guide
│ │ └── verify-jobs.md # Job verification guide
│ ├── troubleshooting/
│ │ ├── phi3-cache-issues.md # Phi-3 troubleshooting
│ │ └── distilgpt2-issues.md # DistilGPT2 troubleshooting
│ └── archive/
│ └── cleanup-summary-2025-12.md
├── grete/ # HPC cluster deployment
│ ├── extraction/
│ │ ├── grete_extract_paper.py
│ │ ├── grete_extract_from_url.py
│ │ └── grete_extract_paper_distilgpt2.py
│ ├── jobs/
│ │ ├── grete_extract_job.sh # SLURM single job
│ │ ├── grete_extract_batch.sh # SLURM batch array
│ │ └── grete_extract_url_job.sh
│ └── README.md # Grete-specific documentation
├── src/ # Core pipeline
│ ├── __init__.py
│ ├── comparison_updater.py # Update ORKG comparisons
│ ├── llm_extractor.py # KISSKI API extraction
│ ├── llm_extractor_transformers.py # GPU-based extraction
│ ├── orkg_client.py # ORKG API wrapper
│ ├── orkg_manager.py # ORKG management
│ ├── paper_fetcher.py # ArXiv paper fetching
│ ├── pdf_parser.py # PDF text extraction
│ ├── pipeline.py # Main orchestration
│ └── template_mapper.py # ORKG template mapping
├── scripts/
│ ├── append_to_paper.py # Production utilities
│ ├── sandbox_upload.py
│ └── debug/ # Debug/testing scripts
│ ├── add_to_orkg_manual.py
│ ├── export_to_csv.py
│ └── force_new_paper_upload.py
├── tests/ # Unit tests
│ ├── __init__.py
│ ├── test_llm_extractor.py
│ ├── test_orkg_append.py
│ ├── test_orkg_client.py
│ ├── test_pdf_parser.py
│ └── test_pipeline.py
├── examples/
│ └── example_usage.py # Usage examples
├── data/
│ ├── papers/ # Downloaded PDFs
│ ├── extracted/ # Extracted JSON data
│ └── logs/ # Processing logs
├── notebooks/
│ └── validate_extraction.ipynb # Validation notebook
├── config/
│ └── config.yaml # Configuration
├── .env.example # Environment variables template
├── .gitignore
├── LICENSE # MIT License
├── pyproject.toml # Modern Python packaging
├── README.md
├── requirements.txt # Production dependencies
└── requirements-dev.txt # Development dependencies
- Clone the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Copy
.env.exampleto.envand fill in your credentials:cp .env.example .env
Required variables:
ORKG_EMAIL=your.email@example.com
ORKG_PASSWORD=your_password_here
ORKG_HOST=sandbox # or 'production'
KISSKI_API_KEY=your_kisski_api_key_here
KISSKI_API_ENDPOINT=https://kisski.de/api/chatCustomize the extraction pipeline:
- ORKG settings: host, template ID, comparison ID
- KISSKI API: endpoint, model, temperature, max tokens, rate limiting
- ArXiv settings: categories, download directory
- Extraction fields: which fields to extract
- Logging: level, format, output files
from src.pipeline import ExtractionPipeline
pipeline = ExtractionPipeline()
pipeline.process_paper("2302.13971") # ArXiv ID for Llama paperpython -m src.pipeline --arxiv-id 2302.13971pytest tests/ -vpytest tests/ -v --cov=src --cov-report=htmlpytest tests/test_pipeline.py -vThe project uses GitHub Actions for automated testing. See .github/workflows/ci.yml for the CI configuration.
Tests run automatically on:
- Push to
mainordevelopbranches - Pull requests to
mainordevelop
Uses KISSKI Chat AI API for extraction. Fast and reliable, free for thesis work.
Prerequisites: KISSKI API key from your professor
See Usage section above for examples.
Uses local transformers on GPU. No API costs, unlimited processing, but requires HPC setup.
Full deployment guide: docs/deployment/grete-setup.md
Quick start:
- Upload code to Grete cluster
- Set up conda environment with PyTorch + transformers
- Submit SLURM jobs:
sbatch grete/jobs/grete_extract_job.sh 2302.13971
Monitoring: docs/deployment/verify-jobs.md
Common issues and solutions:
- Phi-3 cache compatibility: docs/troubleshooting/phi3-cache-issues.md
- DistilGPT2 JSON errors: docs/troubleshooting/distilgpt2-issues.md
- Job verification: docs/deployment/verify-jobs.md
# Format code
black src/
# Lint code
flake8 src/
# Type checking
mypy src/
# Sort imports
isort src/src/: Core pipeline codetests/: Unit testsscripts/: Utility scriptsgrete/: HPC cluster deployment filesdocs/: Documentationexamples/: Usage examplesnotebooks/: Jupyter notebooks for exploration
- Create a feature branch
- Make your changes
- Run tests:
pytest tests/ - Run linters:
black src/ && flake8 src/ - Submit a pull request
- Platform: https://orkg.org/
- Sandbox: https://sandbox.orkg.org/
- Python Client: https://orkg.readthedocs.io/en/latest/
- LLM Template: https://orkg.org/templates/R609825
- Target Comparison: https://orkg.org/comparisons/R1364660
- KISSKI Chat AI API: Provided by university for thesis work
- ArXiv API: https://info.arxiv.org/help/api/
- GWDG Grete: https://info.gwdg.de/docs/doku.php?id=en:services:application_services:high_performance_computing:grete
If you use this code for your research, please cite:
@thesis{bachelor2025llm,
title={Automated Knowledge Extraction for Large-Scale Generative AI Models Catalog},
author={Your Name},
year={2025},
school={Your University},
type={Bachelor's Thesis}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Developed as part of a Bachelor thesis project
- ORKG platform for knowledge graph infrastructure
- GWDG for providing HPC resources and KISSKI API access
- SAIA platform for Chat AI API services
For questions or issues, please open an issue on GitHub or contact your.email@example.com.