Skip to content

gobics/cocopye-database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoCoPyE Database

This is the database repository for CoCoPyE. It is mainly used as a download source. If you are looking for the tool itself, see https://github.com/gobics/cocopye.

Database Structure

CoCoPyE can be used with Pfam versions 24 and 28. It has to use a database that was build with the same version. They can be found in the corresponding folders (24 and 28). testdata contains several sequences from the reference that can be used with cocopye toolbox testrun. version.txt contains the current version (has to be incremented manually when a new release is published). info.txt contains information about the dataset source. scripts contains Python files that are used to build or update this database. They are not required for running CoCoPyE. (I decided to include them here for completeness reasons.)

How to update this database

These are step-by-step instructions (for myself or future maintainers of this repository) on how to update the database.

  1. Create a new environment and install the requirements.
# Assuming you are in the repository root
cd scripts

# With venv...
python -m venv .venv
source .venv/bin/activate

# ...or conda (only use one)
conda create -n cocopye-database python
conda activate cocopye-database

pip install -r requirements.txt
  1. Download required metadata.
python metadata.py

important metadata.py uses ncbitax2lin. On the systems I have used it, this tool seems to crash quite often. It might also clutter your RAM, so don't use it while you have other important things running. Restarting the system or changing the Python version to 3.8 (for example in a conda environment if 3.8 isn't your system installation) seemd to help sometimes, but not always. You just have to try until it works.

  1. Download reference sequences from RefSeq and filter them. (This step might take quite a long time, up to several days.)
python references.py

note references.py store some intermedate results in intermediate_files. If the script should crash for some reason you might be able to reuse some of it. (Try to determine which steps have been successful and comment them out to continue at the point where it stopped.)

  1. Build the databases for Pfam version 24 and 28.
cocopye database -i output/28/fasta -m output/28/metadata.csv -o output/28_db
cocopye --pfam24 database -i output/24/fasta -m output/24/metadata.csv -o output/24_db
  1. Copy all files from output/28_db into <repository_root>/28 and everything from output/24_db into <repository_root>/24, replacing the old files. (There will be no new files for model_comp.pickle and model_cont.pickle. This is intended; there is no need to replace them.)

  2. Change the download date in info.txt to the current date (or the date when you started reference.py, if it ran multiple days) and increase the version number in version.txt (in most cases increasing the minor version should be sufficient).

  3. Delete the intermediate files that were generated by the scripts and push the repository.

rm -r scripts/intermediate_files scripts/output
git add .
git commit -m "..."
git push
  1. On GitHub: Create a new release with the new version number. Attach the repository contents as database.zip. Mark the new release as latest.

important CoCoPyE expects that the database files and folders (like 28, 24, version.txt, ...) are at the top-level of database.zip. After creating the archive you should check that this is indeed the case.

  1. Run cocopye setup update-database to make sure everything went right.

General Database Build Process

The following section contains general information on how to build the database for some Pfam version (which results in the content of folder 24/28). This is primarily for people who want to build the database using their own data. If you just want to update this repository, it should be sufficient to follow the steps in the section above.

Input

You need a folder containing the sequences you want to use for the database (depending on how you choose them, these can be the same or different for both Pfam versions), one FASTA file for each sequence. FASTA headers in the files are ignored. You also need a csv-file with metadata, one row for each input sequence. The required columns are sequence, superkingdom, phylum, class, order, family, genus and species where sequence is the name of the corresponding FASTA file and the other columns are taxonomy data. All cells except sequence are allowed to be empty. However, the quality of the taxonomy prediction is directly related to the quality of the metadata. Therefore it is recommended to avoid empty cells if possible.

Building the database

The build process of the database is part of the CoCoPyE CLI. Just run cocopye database -i <folder with FASTA files> -m <metadata-file> -o <output folder>. If you want to build with Pfam 24, use cocopye --pfam24 database -i <folder with FASTA files> -m <metadata-file> -o <output folder>.

Additional files

The build command automatically creates most required files. However, you still have to manually add the files for the machine learning method (most likely a neural network). The files have to be named model_comp.pickle for the completeness predictor and model_cont.pickle for contamination. Additionally, you have to copy some FASTA files into a directory testdata for use with the cocopye toolbox testrun command (I would recommend to just use some sequences from your input set). You should also include a version.txt with some content that can be parsed as SemVer version, because CoCoPyE checks this file on startup.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages