We present TXpredict, a transcriptome prediction tool that generalizes to novel microbial genomes. By leveraging information learned from a large protein language model (ESM2), TXpredict achieves an average Spearman correlation of 0.53 and 0.62 in predicting gene expression for new bacterial and fungal genomes. We further extend this framework to predict transcriptomes for 2,685 additional microbial genomes spanning 1,744 genera, a large proportion of which remain uncharacterized at the transcriptional level. Our analysis highlights conserved and divergent transcriptional programs across understudied genera, providing a powerful resource for uncovering microbial adaptation strategies and metabolic potential across the tree of life.
Python package dependencies:
- torch 2.0.1
- pandas 2.2.0
- seaborn 0.13.2
- biopython 1.81
We recommend using Conda to install our packages. For convenience, we have provided a conda environment file with package versions that are compatiable with the current version of the program. The conda environment can be setup with the following comments:
-
Clone this repository:
git clone https://github.com/lingxusb/TXpredict.git cd TXpredict -
Create and activate the Conda environment:
conda env create -f env.yml conda activate TXpredict
The embeddings.py script:
- Traverses subfolders looking for
.fna/.fasta+.gff/.gtfpairs with matching base names - Extracts protein sequences from these pairs
- Computes ESM-2 embeddings for proteins below the specified length threshold
- Generates metadata (normalized length + amino acid proportions)
- Saves the results while preserving the input folder structure
python embeddings.py -i INPUT_DIRECTORY -o OUTPUT_DIRECTORY [-l LENGTH_THRESHOLD]| Argument | Description |
|---|---|
-i, --input_dir |
Path to input directory containing subfolders with FASTA/GFF pairs (required) |
-o, --output_dir |
Path where output files will be saved (required) |
-l, --length_threshold |
Maximum protein length to include (default: 1500) |
For each processed FASTA/GFF pair, the script generates:
*_embeddings.txt: 2D array of protein embeddings*_metadata.txt: 2D array of metadata (normalized length + 20 AA proportions)*_genename.txt: 1D list of gene names
The predict.py script:
- Finds matching sets of embedding, metadata, and gene name files in the input directory
- Loads and combines the data (embeddings + metadata)
- Applies a pre-trained model to make predictions
- Saves the results as CSV files with gene names and their corresponding predictions
python predict.py --input_dir INPUT_DIRECTORY --model_dir MODEL_DIRECTORY --output_dir OUTPUT_DIRECTORY| Argument | Description |
|---|---|
-i, --input_dir |
Directory containing the input files (required) |
-m, --model_dir |
Directory containing the trained model file (required) |
-o, --output_dir |
Directory where prediction results will be saved (required) |
The script expects sets of three files with matching names in the input directory:
{NAME}_embeddings.txt: File containing protein embeddings{NAME}_metadata.txt: File containing metadata features{NAME}_genename.txt: File containing gene names
For each processed set of input files, the script generates:
{NAME}_predictions.csv: A CSV file with two columns:Gene_Name: The name of the genePrediction: The model's prediction value
01_data_preprocessing.ipynb handles the preprocessing steps needed to prepare data for the TXpredict model.
The notebook performs three main tasks:
- Calculating normalized gene expression - Processes RNA-seq count data to compute TPM (Transcripts Per Million) values with z-score normalization.
- Generating ESM embeddings - Uses the ESM-2 model to create protein embeddings from genome annotation files.
- Preparing training data - Combines embedding data with expression data to create the final training dataset.
The notebook produces several output files including:
{strain}_filtered_log_tpm.csv- Normalized expression data{strain}_filtered_embeddings.txt- Protein embeddings for model input{strain}_filtered_meta.txt- Metadata for each protein
02_model_training.ipynb demonstrates how to train the TXpredict model for gene expression prediction. The notebook covers the complete workflow:
- Data loading - Loads preprocessed embeddings and metadata from the previous preprocessing step
- Model definition - Implements the model to learn from sequence embeddings
- Training process - Trains the model with evaluation metrics
- Model saving - Saves the trained model for later use in prediction
We have provided Colab notebooks for transcriptome prediction in the web browser. Please also check our Colab instruction. We also provided a Colab notebook for fungal transcriptome prediction.
- The only required inputs are genome sequence file (.fna or .fasta) and the annotation file (.gtf, .gff or .gff3). Please check our example data
- Please connect to a GPU instance (e.g. T4, Runtime -> Change runtime type -> T4 GPU).
- It takes ~20min to predict transcriptome for a genome with 4k genes.
Our transcriptome prediciton models are available from Huggingface.
TXpredictDB can be accessed from Huggingface.
We deeply appreciate the experimental works and datasets that make our work possible.