TXpredict：predicting microbial transcriptome using genome sequence

We present TXpredict, a transcriptome prediction tool that generalizes to novel microbial genomes. By leveraging information learned from a large protein language model (ESM2), TXpredict achieves an average Spearman correlation of 0.53 and 0.62 in predicting gene expression for new bacterial and fungal genomes. We further extend this framework to predict transcriptomes for 2,685 additional microbial genomes spanning 1,744 genera, a large proportion of which remain uncharacterized at the transcriptional level. Our analysis highlights conserved and divergent transcriptional programs across understudied genera, providing a powerful resource for uncovering microbial adaptation strategies and metabolic potential across the tree of life.

Installation

Python package dependencies:

torch 2.0.1
pandas 2.2.0
seaborn 0.13.2
biopython 1.81

We recommend using Conda to install our packages. For convenience, we have provided a conda environment file with package versions that are compatiable with the current version of the program. The conda environment can be setup with the following comments:

Clone this repository:

  git clone https://github.com/lingxusb/TXpredict.git
  cd TXpredict

Create and activate the Conda environment:

conda env create -f env.yml
conda activate TXpredict

Command lines

Embedding generation

The embeddings.py script:

Traverses subfolders looking for .fna/.fasta + .gff/.gtf pairs with matching base names
Extracts protein sequences from these pairs
Computes ESM-2 embeddings for proteins below the specified length threshold
Generates metadata (normalized length + amino acid proportions)
Saves the results while preserving the input folder structure

Usage

python embeddings.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-l LENGTH_THRESHOLD]

Arguments

Argument	Description
`-i`, `--input_dir`	Path to input directory containing subfolders with FASTA/GFF pairs (required)
`-o`, `--output_dir`	Path where output files will be saved (required)
`-l`, `--length_threshold`	Maximum protein length to include (default: 1500)

Output Files

For each processed FASTA/GFF pair, the script generates:

*_embeddings.txt: 2D array of protein embeddings
*_metadata.txt: 2D array of metadata (normalized length + 20 AA proportions)
*_genename.txt: 1D list of gene names

Model prediction

The predict.py script:

Finds matching sets of embedding, metadata, and gene name files in the input directory
Loads and combines the data (embeddings + metadata)
Applies a pre-trained model to make predictions
Saves the results as CSV files with gene names and their corresponding predictions

Usage

python predict.py --input_dir INPUT_DIRECTORY --model_dir MODEL_DIRECTORY --output_dir OUTPUT_DIRECTORY

Arguments

Argument	Description
`-i`, `--input_dir`	Directory containing the input files (required)
`-m`, `--model_dir`	Directory containing the trained model file (required)
`-o`, `--output_dir`	Directory where prediction results will be saved (required)

Input Files

The script expects sets of three files with matching names in the input directory:

{NAME}_embeddings.txt: File containing protein embeddings
{NAME}_metadata.txt: File containing metadata features
{NAME}_genename.txt: File containing gene names

Output

For each processed set of input files, the script generates:

{NAME}_predictions.csv: A CSV file with two columns:
- Gene_Name: The name of the gene
- Prediction: The model's prediction value

Jupyter notebooks

Data preprocessing

01_data_preprocessing.ipynb handles the preprocessing steps needed to prepare data for the TXpredict model. The notebook performs three main tasks:

Calculating normalized gene expression - Processes RNA-seq count data to compute TPM (Transcripts Per Million) values with z-score normalization.
Generating ESM embeddings - Uses the ESM-2 model to create protein embeddings from genome annotation files.
Preparing training data - Combines embedding data with expression data to create the final training dataset.

The notebook produces several output files including:

{strain}_filtered_log_tpm.csv - Normalized expression data
{strain}_filtered_embeddings.txt - Protein embeddings for model input
{strain}_filtered_meta.txt - Metadata for each protein

Model training

02_model_training.ipynb demonstrates how to train the TXpredict model for gene expression prediction. The notebook covers the complete workflow:

Data loading - Loads preprocessed embeddings and metadata from the previous preprocessing step
Model definition - Implements the model to learn from sequence embeddings
Training process - Trains the model with evaluation metrics
Model saving - Saves the trained model for later use in prediction

Colab notebooks

We have provided Colab notebooks for transcriptome prediction in the web browser. Please also check our Colab instruction. We also provided a Colab notebook for fungal transcriptome prediction.

The only required inputs are genome sequence file (.fna or .fasta) and the annotation file (.gtf, .gff or .gff3). Please check our example data
Please connect to a GPU instance (e.g. T4, Runtime -> Change runtime type -> T4 GPU).
It takes ~20min to predict transcriptome for a genome with 4k genes.

Trained models and datasets

Our transcriptome prediciton models are available from Huggingface.

TXpredictDB can be accessed from Huggingface.

Acknowledgement

We deeply appreciate the experimental works and datasets that make our work possible.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
example_data		example_data
models		models
notebooks		notebooks
scripts		scripts
Acknowledgement.md		Acknowledgement.md
Colab_instruction.md		Colab_instruction.md
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TXpredict：predicting microbial transcriptome using genome sequence

Table of Contents

Installation

Command lines

Embedding generation

Usage

Arguments

Output Files

Model prediction

Usage

Arguments

Input Files

Output

Jupyter notebooks

Data preprocessing

Model training

Colab notebooks

Trained models and datasets

Acknowledgement

References

About

Uh oh!

Releases 1

Packages

Languages

License

lingxusb/TXpredict

Folders and files

Latest commit

History

Repository files navigation

TXpredict：predicting microbial transcriptome using genome sequence

Table of Contents

Installation

Command lines

Embedding generation

Usage

Arguments

Output Files

Model prediction

Usage

Arguments

Input Files

Output

Jupyter notebooks

Data preprocessing

Model training

Colab notebooks

Trained models and datasets

Acknowledgement

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages