GenEC (Generic Extraction & Comparison) is a Python-based tool for extracting structured data from files or folders. It offers a flexible, one-size-fits-all extraction framework that you can tailor precisely using configuration parameters.
With presets and preset lists, you can easily repeat your extraction methods on single files or entire directories. Beyond extraction, GenEC can also compare the extracted data against reference files or folders to highlight differences.
Designed for users of all technical levels, GenEC supports both manual workflows and automated pipelines, making data analysis straightforward and accessible.
| Getting Started | Workflows | Text Filters | Configuration |
|---|---|---|---|
| Documentation | Basic | Regex | Output Formats |
| Setup Guide | Preset | Regex-list | Presets |
| Preset-list | Positional | Preset-lists |
GenEC requires Python 3.9 or higher
For execution:
pip install uv
uv sync
uv run python GenEC/main.py --helpFor developing:
uv sync --group dev # include dev packages
uv sync --group dist # include distribution packagesGenEC supports three workflow commands with different automation levels and use cases, as well as 3 different filter types:
Workflows
- basic - Interactive configuration at runtime. Perfect for learning and experimentation.
- preset - YAML-based automation for single files. Ideal for repeated analysis tasks.
- preset-list - Batch processing with multiple presets. Best for comprehensive analysis workflows.
Filter types
- Regex - Pattern-based matching using regex.
- Regex-list - Complex pattern-based matching using more than 1 regex.
- Positional - Position-based matching using line numbers and positions on the line itself.
| Argument | Short | Required | Description |
|---|---|---|---|
--source |
-s |
Yes | Path to the source for data extraction. |
--reference |
-r |
No | Path to the reference for comparison. |
--output-directory |
-o |
No | Directory to save output files (terminal-only by default). |
--output-types |
-t |
No | List of output file types to generate. Choices: csv, json, txt, yaml. Note that multiple can be selected. |
--only-show-differences |
No | When comparing source and reference, only show elements with non-zero differences. |
--source and --reference arguments accept file paths for the basic and preset workflows, and directory paths when using preset-list workflow.
--output-directory and --output-types must be used together. Without these parameters, results are displayed in terminal only.
| Workflow | Argument | Short | Required | Description |
|---|---|---|---|---|
| basic | (none additional) | Interactive configuration - Learn more → | ||
| preset | --preset |
-p |
Yes | YAML preset reference - Learn more → |
--presets-directory |
-d |
No | Directory containing preset YAML files (default: GenEC/presets/). |
|
| preset-list | --preset-list |
-l |
Yes | Batch processing configuration - Learn more → |
--presets-directory |
-d |
No | Directory containing preset YAML files (default: GenEC/presets/). |
|
--target-variables |
-v |
No | Key-value pairs (key=value) for variable substitution. Can be specified multiple times. |
|
--print-results |
No | Print results to CLI (disabled by default when output files are specified for performance). |
uv run python GenEC/main.py basic -s <source_file> [options]
uv run python GenEC/main.py basic -s <source_file> -r <reference_file> -o <output_directory> -t txt csv json yamluv run python GenEC/main.py preset -s <source_file> -p <file_name_without_extension/preset_name> -d <presets_directory> [options]uv run python GenEC/main.py preset-list -s <source_directory> -l <preset_list_file> -d <presets_directory> [options]
uv run python GenEC/main.py preset-list -s <source_directory> -l <preset_list_file> -d <presets_directory> -v myvar1=value1 myvar2=value2uv run python GenEC/main.py basic -s docs/demos/quick_start/source/data/file1.txt -r docs/demos/quick_start/source/data/file2.txtOutput files (-o docs/demos/quick_start/basic_output -t txt csv json yaml)
uv run python GenEC/main.py preset -s docs/demos/quick_start/source/data/file1.txt -r docs/demos/quick_start/source/data/file2.txt -p preset_config_B/preset_code_value -d docs/demos/quick_start/presets/Output files (-o docs/demos/quick_start/preset_output -t txt csv json yaml)
uv run python GenEC/main.py preset-list -s docs/demos/quick_start/source -r docs/demos/quick_start/reference/ -l preset-list_config -d docs/demos/quick_start/presets/ -v var1=file1 var2=file2 var3=file3Output files (-o docs/demos/quick_start/preset-list_output -t txt csv json yaml)
GenEC offers powerful configuration through YAML preset files that define extraction and comparison strategies. This enables consistent, repeatable analysis workflows.
Basic preset structure:
preset_name:
cluster_filter: '\n' # How to split text into processable chunks
text_filter_type: 'Regex' # Filter method: 'Regex', 'Regex-list', 'Positional'
text_filter: '\| ([A-Za-z]+) \|' # Extraction pattern (varies by filter type)
should_slice_clusters: false # Enable/disable cluster range selectionGenEC supports advanced features including:
- Variable substitution - Dynamic target paths in preset-lists
- Cluster slicing - Process specific sections of large files
- YAML inheritance - Reusable configuration templates
- Multiple output formats - JSON, CSV, TXT, YAML
Configuration Documentation:
Prerequisites: Ensure GenEC is properly installed → Setup and Installation
Run the test suite from the root directory. Requires dev packages to be installed
uv run pytestuv run pytest --cov=. --cov-branchuv run pytest -m system # Runs system-level tests
uv run pytest -m unit # Runs unit testsuv run pytest --count 10uv run flake8 # Code style and formatting
uv run mypy . # Type checking
uv run pylint GenEC --score=yes # Production code linting (strict)
uv run pylint tests --rcfile=tests/.pylintrc --score=yes # Test code linting (relaxed)uv run pre-commit # Run pre-commit hooks on staged files
uv run pre-commit run --all-files # Run pre-commit hooks on all filesCopyright [2025] [Remy Kroese]
Licensed under the Apache License, Version 2.0. See the LICENSE file for details.




