Skip to content

bioinfocz/exp_heatmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ExP Heatmap

PyPI version

A powerful Python package and command-line tool for visualizing multidimensional population genetics data through intuitive heatmaps.

ExP Heatmap specializes in displaying cross-population data, including differences, similarities, p-values, and other statistical parameters between multiple groups or populations. This tool enables efficient evaluation of millions of statistical values in a single, comprehensive visualization.

ExP heatmap of LCT gene

ExP heatmap of the human lactose (LCT) gene showing population differences between 26 populations from the 1000 Genomes Project, displaying adjusted rank p-values for cross-population extended haplotype homozygosity (XPEHH) selection test. Create your own LCT heatmap with the Quick Start Guide

Developed by the Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics of the Academy of Sciences of the Czech Republic

Features

  • Multiple Statistical Tests: Support for XPEHH, XP-NSL, Delta Tajima's D, and Hudson's Fst
  • Flexible Input Formats: Work with VCF files, pre-computed statistics, or ready-to-plot p-values
  • Command-Line Interface: Easy-to-use CLI for standard workflows
  • Python API: Full programmatic control for custom analyses
  • Efficient Processing: Zarr-based data storage for fast computation
  • Customizable Visualization: Multiple color schemes and display options

Table of Contents

Installation

Requirements

  • Python ≥ 3.8
  • vcftools (for genomic data preparation - optional if using preprocessed data)

Install from PyPI

pip install exp_heatmap

Install from GitHub (latest version)

pip install git+https://github.com/bioinfocz/exp_heatmap.git

Quick Start

Get started with ExP Heatmap in three simple steps:

Step 1: Download the prepared results of the extended haplotype homozygosity (XPEHH) selection test for the part of human chromosome 2, 1000 Genomes Project data either directly via Zenodo or via command:

wget "https://zenodo.org/records/16364351/files/chr2_output.tar.gz"

Step 2: Decompress the downloaded folder in your working directory:

tar -xzf chr2_output.tar.gz

Step 3: Run the exp_heatmap plot command:

exp_heatmap plot chr2_output/ --start 136108646 --end 137108646 --title "LCT gene" --output LCT_xpehh

The exp_heatmap package will read the files from chr2_output/ folder and create the ExP heatmap and save it as LCT_xpehh.png file.


Usage

ExP Heatmap follows a simple three-step workflow: preparecomputeplot. Each step can be used independently depending on your data format.

Command-Line Interface

1. Data Preparation - prepare

Convert VCF files to efficient Zarr format for faster computation.

exp_heatmap prepare [OPTIONS] <vcf_file>
  • <vcf_file> [PATH]: Recoded VCF file
  • -o, --output [PATH]: Directory for output files

2. Statistical Analysis - compute

Calculate population genetic statistics across all genomic positions.

exp_heatmap compute [OPTIONS] <zarr_dir> <panel_file>

<zarr_dir> [PATH]: Directory with ZARR files from prepare step <panel_file>[PATH]: Population panel file

  • -o, --output: Directory for output files
  • -t, --test: Statistical test to compute
    • xpehh: Cross-population Extended Haplotype Homozygosity (default)
    • xpnsl: Cross-population Number of Segregating sites by Length
    • delta_tajima_d: Delta Tajima's D
    • hudson_fst: Hudson's Fst genetic distance
  • -c, --chunked: Use chunked array to avoid memory exhaustion

3. Visualization - plot

Generate heatmap visualizations from computed statistics.

exp_heatmap plot [OPTIONS] <input_dir>
  • <input_dir>: Directory with TSV files from compute step
  • -s, --start & -e, --end: Genomic coordinates for the region to display. Uses nearest available position if exact match not found in the input data.
  • -m, --mid: Alternative way to specify region. The start and end positions will be calculated (mid ± 500 kb)
  • -t, --title: Title of the heatmap
  • -o, -output: Output filename (without .png extension)
  • -c, --cmap: Matplotlib colormap - list of colormaps

Python Package

The Python API offers more flexibility and customization options. Choose the appropriate scenario based on your data format:

Scenario A: Ready-to-Plot Data

Use when: You have pre-computed p-values in a TSV file.

Data format: TSV file with columns: CHROM, POS, followed by pairwise p-value columns for population comparisons.

from exp_heatmap.plot import plot_exp_heatmap
import pandas as pd

# Load your p-values data
data = pd.read_csv("pvalues.tsv", sep="\t")

# Create heatmap
plot_exp_heatmap(
    data,
    begin=135287850,
    end=136287850,
    title="Population Differences in LCT Gene",
    cmap="Blues",
    output="lct_analysis",
    populations="1000Genomes"  # Predefined population set
)

Scenario B: Statistical Results to P-values

Use when: You have computed statistical test results that need conversion to p-values.

from exp_heatmap.plot import plot_exp_heatmap, create_plot_input

# Convert statistical results to ranked p-values
data_to_plot = create_plot_input(
    "results_directory/",      # Directory with test results
    begin=135287850, 
    end=136287850, 
    populations="1000Genomes",
    rank_pvalues="2-tailed"    # Options: "2-tailed", "ascending", "descending"
)

# Create heatmap
plot_exp_heatmap(
    data_to_plot,
    begin=135287850,
    end=136287850,
    title="XP-NSL Test Results",
    cmap="expheatmap",         # Custom ExP colormap
    output="xpnsl_results"
)

Scenario C: Complete VCF Workflow

Use when: Starting from raw VCF files. Combine CLI commands with Python plotting:

import subprocess
from exp_heatmap.plot import plot_exp_heatmap, create_plot_input

# 1. Prepare data (using CLI)
subprocess.run(["exp_heatmap", "prepare", "data_snps.recode.vcf", "data.zarr"])

# 2. Compute statistics (using CLI) 
subprocess.run(["exp_heatmap", "compute", "data.zarr", "populations.panel", "results/"])

# 3. Create custom plots (using Python)
data_to_plot = create_plot_input("results/", begin=47000000, end=49000000)
plot_exp_heatmap(data_to_plot, begin=47000000, end=49000000, 
                 title="Custom Analysis", output="custom_plot")

Advanced Customization

Fine-tune your visualizations with advanced options:

from exp_heatmap.plot import plot_exp_heatmap, prepare_cbar_params, superpopulations

# Custom colorbar settings
cmin, cmax, cbar_ticks = prepare_cbar_params(data_to_plot, n_cbar_ticks=6)

# Advanced plot with multiple customizations
plot_exp_heatmap(
    data_to_plot,
    begin=135000000,
    end=137000000,
    title="Selection Signals in African Populations",
    
    # Population filtering
    populations=superpopulations["AFR"],  # Focus on African populations
    # Available: ["AFR", "AMR", "EAS", "EUR", "SAS"] or custom list
    
    # Visual customizations
    cmap="expheatmap",                    # Custom ExP colormap
    display_limit=1.60,                   # Filter noise (values below limit = white)
    display_values="higher",              # Show values above display_limit
    
    # Annotations
    vertical_line=[                       # Mark important SNPs
        [135851073, "rs41525747"],        # [position, label]
        [135851081, "rs41380347"]
    ],
    
    # Colorbar customization
    cbar_vmin=cmin,
    cbar_vmax=cmax,
    cbar_ticks=cbar_ticks,
    
    # Output
    output="african_populations_analysis",
    xlabel="Custom region description"
)

Workflow Examples

Complete Analysis: SLC24A5 Gene

ExP heatmap of SLC24A5 gene

This example demonstrates a full workflow analyzing the SLC24A5 gene, known for its role in human skin pigmentation using 1000 Genomes Project data. SLC24A5 is also known to show strong selection signals, which makes it a suitable example.

#!/bin/bash

# Download 1000 Genomes data
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr15.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz" -O chr15.vcf.gz
wget "ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel" -O genotypes.panel

# Filter to SNPs only
vcftools --gzvcf chr15.vcf.gz \
    --remove-indels \
    --recode \
    --recode-INFO-all \
    --out chr15_snps

# Prepare data
exp_heatmap prepare chr15_snps.recode.vcf chr15_snps.recode.zarr

# Compute statistics
exp_heatmap compute chr15_snps.recode.zarr genotypes.panel chr15_snps_output

# Generate heatmap for SLC24A5 region
exp_heatmap plot chr15_snps_output \
    --begin 47924019 \
    --end 48924019 \
    --title "SLC24A5" \
    --cmap gist_heat \
    --out SLC24A5_heatmap

Gallery

Different P-value Computations

The same XP-EHH test data for the ADM2 gene region, showing different p-value calculation methods:

Two-tailed p-values: Two-tailed p-values

Ascending p-values: Ascending p-values

Descending p-values: Descending p-values

Noise Filtering

Using display_limit and display_values parameters to filter noisy data and highlight significant regions:

Filtered display

Same data as above, but with display_limit=1.60 to filter noise and highlight significant signals.

Contributing

We welcome contributions! Feel free to contact us or submit issues or pull requests.

Development Setup

git clone https://github.com/bioinfocz/exp_heatmap.git
cd exp_heatmap
pip install -e .

License

This project is licensed under Custom Non-Commercial License based on the MIT License - see the LICENSE file for details.

For commercial licensing under different terms, please contact: [email protected]

Contributors

Acknowledgments

GenoMat      IMG CAS      ELIXIR

If you use ExP Heatmap in your research, please cite our paper [citation details will be added upon publication].