This repository hosts a ETL pipeline designed for a toy dataset mimicking a genomics dataset.
It transforms the dataset into an analysable format that enables easy querying for the discovery of overlaps between sequences.
When cleaning this data, a number of assumptions are made:
- The
idcolumn has no significance beyond being a database id - Duplicate entries have no significance
- Sequence-type pairs are unique
- When there are two
startevents, the earliestlocationis true - Events of type
unclear_readsignal that this sequence-type exists at this location - If the start event is less than the end event, treat them in reverse
- Install Miniconda
- Create an environment
conda create -n genomics_etl python=3.9 -y - Activate your new environment
conda activate genomics_etl - Install this package in editable mode
pip install -e .
To run the ETL pipeline with queries showing overlaps between sequence execute
python genomics_etl/__init__.py from a terminal
