pepti_map is a tool for mapping peptide sequences to their possible genomic loci. It does so based only on the sequence information, so that no further peptide information besides the amino acid sequence and no annotation of the genome is required, enabling a mapping to personal genomes. pepti_map utilizes RNA-seq reads of the same sample to facilitate the mapping: Each peptide is first matched to RNA-seq reads, which are then assembled into longer contigs and aligned onto the genome.
To setup pepti_map, first install all dependencies listed in the environment_<os>.yml (environment_macos.yml or environment_linux.yml). We recommend using Mamba. For example:
$ mamba env create -f environment_linux.yml
pepti_map relies on the following tools to be installed:
If you install the dependencies via the given environment_<os>.yml, both Trinity and GMAP should already be installed. To install PoGo, download the latest release from its GitHub page.
You will then need to add the path to the directory in which your PoGo installation is located in an .env file via the POGO_PATH environment variable, e.g.:
POGO_PATH=/Users/me/Tools/PoGo_v1.2.3/Linux
This is the only environment variable that needs to be set for pepti_map to work. You can, however, set additional environment variables. Below you will find a table listing all environment variables.
| Environment Variable Name | Usage |
|---|---|
POGO_PATH |
The path to the directory in which the PoGo installation is located. |
IO_N_PROCESSES |
The number of processes to use when generating the input files for PoGo. If not set, defaults to multiprocessing.cpu_count(). |
TRINITY_USE_DOCKER |
Whether to run a dockerized version of Trinity. Value must be True or False. If not set, defaults to False. If set to True, a dockerized version of Trinity must be installed on the system. |
TRINITY_PATH |
The path to the Trinity installation. If not given, expects Trinity to be executable from the working directory (e.g. by using an installation via a Mamba environment). |
TRINITY_N_PROCESSES |
The number of processes with which to run Trinity in parallel. If not set, defaults to multiprocessing.cpu_count() // TRINITY_N_CPUS (floor division). |
TRINITY_N_CPUS |
The number of CPUS to use for one Trinity process. Corresponds to the --CPU option of Trinity. If not set, defaults to 2. |
TRINITY_MAX_MEM |
The max memory to use for one Trinity process. Corresponds to the --max_memory option of Trinity. If not set, defaults to "1G". |
GMAP_N_THREADS |
The number of threads with which to run GMAP during the alignment. Corresponds to the -t option of gmap. If not set, defaults to multiprocessing.cpu_count(). |
GMAP_BATCH_MODE |
The batch mode in which to run GMAP during the alignment. Corresponds to the -B option of gmap. If not set, defaults to 2. |
TEMP_DIR_PATH |
The path to the folder in which temporary results are saved. If not set, defaults to ./temp. |
To run pepti_map, first activate your Mamba environment.
The general usage of pepti_map is as follows:
$ (peptimap-env) python -m pepti_map.main [OPTIONS]
A minimal command would look like this:
$ (peptimap-env) python -m pepti_map.main -p path/to/peptide/file -r path/to/rna/reads/file -g path/to/genome/fasta
In order for pepti_map to be able to run, at least the -p and -r options need to be set, as well as one of the -g or -x option.
For further specification of these and further options, see the table below or use
$ (peptimap-env) python -m pepti_map.main --help
Overview of the options for running pepti_map:
| Option | Usage |
|---|---|
-p / --peptide-file |
The path to the peptide file (for format, see below). |
-r / --rna-file |
The path to the RNA-seq file. In case of paired-end sequencing, this file is expected to be in forward orientation. |
-pa / --paired-end-file |
The path to the second RNA-seq file in case of paired-end sequencing. This file is expected to be in reverse orientation. If none is given, the RNA-seq file given with the -r option is assumed to result from single-end sequencing. |
-c / --cutoff |
The position of the last base in the reads after which a cutoff should be performed (starting at 1). The cutoff is applied to all reads. If the value is equal to or smaller than 0, no cutoff is performed. To define different cutoff values for the RNA-seq files in case of paired-end sequencing, you can supply two cutoff values by using the -m option twice (e.g. -m 80 -m 60). The first value is used for the file supplied with -r, whereas the second value is used for the file supplied with -pa. (Default: -1) |
-k / --kmer-length |
The k-mer size used during the mapping of peptides to RNA-seq reads. As the RNA-seq reads are 3-frame translated for the mapping, the k-mer size refers to amino acids. (Default: 7) |
-o / --output-dir |
The path to the output directory for all generated files. (Default: ./) |
-pi / --precompute-intersections |
If used, the intersection sizes for the Jaccard Index calculation are precomputed during the matching phase. |
-j / --jaccard-index-threshold |
Sets of matched RNA-seq reads per peptide will only be merged together if their Jaccard Index has a value above the given threshold. (Default: 0.5) |
-m / --merging-method |
Which merging method to use for sets of matched RNA-seq reads. Must be one of agglomerative-clustering, full-matrix. (Default: full-matrix) |
-cl / --min-contig-length |
Sets the --min_contig_length option for Trinity during assembly. A value below 100 is not possible. (Default: 100) |
-g / --genome |
The path to the genome file(s) to align to. In case of multiple files, the paths must be separated by comma. |
-x / --gmap-index |
The path to an existing GMAP index that should be used instead of building a new one. If this option is set, the -g / --genome option is ignored. |
This file should contain a list of peptides in form of amino acid sequences, with one peptide sequence per line. Optionally, the file may contain protein group information per peptide, with this information being on the same line as the peptide, separated by tab. A header is not needed. One line may thus look as follows:
<peptide sequence>
or
<peptide sequence> <protein group information>
If the protein group information is given, peptides with the same protein group will be grouped together, with matches to the RNA-seq reads being allocated per group. If not given, each peptide is treated as a separate group.
These files should contain the RNA-seq reads of the same sample/subject as the peptide data in FASTQ or gzipped FASTQ format.
These files should contain the genomic sequences to align to in FASTA format.
pepti_map will output a GTF (pepti_map_output.gtf) and a BED (pepti_map_output.bed) file containing the mappings of the peptides to genomic loci. Peptides that were not matched will not appear in these files. For further information on these formats, please see the PoGo documentation, as these outputs correspond directly to the respective PoGo output formats.
Additionally, pepti_map outputs a quantification file (peptide_read_quant.tsv), containing the group id assigned to each peptide, the number of reads that were matched to the peptide, and the number of reads that were matched to the group the peptide belongs to. This means, for all peptides belonging to the same group, their individual read counts add up exactly to the read count for the group. In case a peptide is too short (smaller than the k-mer length), it will be assigned a group id of -1 and excluded from further calculations. The format per line is as follows, with the individual values being tab-separated:
<peptide sequence> <numeric group id> <number of matched reads for peptide> <number of matched reads per group>