MotvizPy - a protein motif discovery & visualization tool

This tool is a Python rendition of the Java-based tool Motviz1[1] published in 2011. MotvizPy uses Shannon Entropy at its core to calculate conservation from a MSA file generated from PSI-BLAST searches of the reference protein database. The Shannon Entropy part of the algorithm is novel, an improvement on the earlier version of the tool.

Motivation

The Java-based tool only used the * on a given vertical position in the alignment file to infer conservation and use that as a basis for conservation. This tool introduces information theory and statistics to infer probabilitistically the regions of amino acids sequences that show conservation. Shannon Entropy is known[2] to be a good measure for conservation. This cuts down on the large number of false positive results observed in the earlier tool.

After calculating the normalized data, a graph like the following is generated:

Goals

Here is the workflow for the tool:

Pass the PDB ID for a structure on the terminal
Use that PDB ID to retrieve both the structure and sequence and write the two in a specified directory.
Take the sequence file and run a PSI-BLAST search on the refseq_protein.00 NCBI database present locally on the computer. The output file is an XML file.
The XML file is parsed and then turned into a FASTA file ready to be run through a Clustal Omega multiple sequence alignment.
Here is the step that is novel in this tool: I introduce information theory into the algorithm. All the sequences are taken into a 2d list, which is then converted to a numpy array. From there, Shannon Entropy scores are calculated vertically for each amino acid posiition. Then the data is normalized between -1 and 1. Local minima are found such that the value is less than the mean of the distribution. I ten isolate stretches of sequences that continuously have values less than 1/4th of mean value.

This is then proposed as a potential motif. And passed as return value.
Finally, the positions are written on txt file to be then executed on PyMol. When When run on PyMol, the script will automatically fetch the PDB file and highlight the regions of interest in red, as is shown as follows:

Footnotes:

[1]. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5054496/

[2]. https://www.sciencedirect.com/science/article/pii/S1567134811003236

This project will be further developed in the future and will hopefully be in Rust or C++.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.idea		.idea
.vscode		.vscode
experimental		experimental
src		src
tests		tests
.gitignore		.gitignore
.gittignore		.gittignore
LICENSE		LICENSE
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MotvizPy - a protein motif discovery & visualization tool

Motivation

Goals

About

Uh oh!

Releases

Packages

Languages

License

nadzhou/MotvizPy

Folders and files

Latest commit

History

Repository files navigation

MotvizPy - a protein motif discovery & visualization tool

Motivation

Goals

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages