HistoPath is a comprehensive toolkit for analyzing histopathology images by extracting quantitative features, performing feature selection, and clustering patient samples based on tissue characteristics. The project uses various machine learning techniques to identify patterns and relationships in histopathology data.
The goal of this project is to analyze histopathology images through a pipeline that includes:
- Sampling tissue patches from whole slide images
- Extracting morphological and textural features from these patches
- Aggregating features across multiple patches per patient
- Selecting the most discriminative features
- Clustering patients based on selected features
- Visualizing and analyzing the clustering results
- Integrating with MRI radiomics features
The following Python libraries are required to run this project:
numpy>=1.20.0
pandas>=1.3.0
scikit-learn>=0.24.0
scikit-image>=0.18.0
matplotlib>=3.4.0
seaborn>=0.11.0
openslide-python>=1.1.0
histomicstk>=1.0.0
umap-learn>=0.5.0
scipy>=1.7.0
Pillow>=8.0.0
- Clone the repository:
git clone https://github.com/BMGLab/HistoPath.git
cd HistoPath
- Install the required packages:
pip install -r requirements.txt
-
Install OpenSlide for whole slide image processing:
- For macOS:
brew install openslide - For Ubuntu:
sudo apt-get install openslide-tools
- For macOS:
-
Install HistomicsTK:
HistomicsTK requires a specific installation procedure due to its dependencies:
# Create a conda environment (recommended) conda create -n histopath python=3.8 conda activate histopath # Install HistomicsTK pip install histomicstk # If issues occur with dependencies, install them individually: pip install large_image pip install girder-client
For more detailed installation instructions and troubleshooting, see the HistomicsTK documentation.
The project is organized into several modules, each handling a specific step in the analysis pipeline:
HistoPath/
│
├── Sampling/ # Extract tissue patches from whole slide images
│ └── sampling.py
│
├── Feature_Extraction/ # Extract and aggregate features from images
│ ├── feature_extraction_density.py
│ └── mean_features.py
│
├── Feature_selection/ # Select most discriminative features
│ └── feature_selection.py
│
├── Clustering/ # Cluster samples based on selected features
│ └── clustering_selected_features.py
│
├── MR/ # MRI radiomic features processing
│ └── mr_feature_list.py
│
├── HistoandMR/ # Integration of histopathology and MRI features
│ └── all_features.py
│
├── Cluster_result_analysis/ # Analyze clustering results
│
├── Outputspdf/ # Output visualizations and results
│
└── README.md
The modules should be executed in the following sequence:
-
Tissue Sampling:
python Sampling/sampling.pyThis extracts tissue patches from whole slide images and saves them to the
slides/directory. -
Feature Extraction:
python Feature_Extraction/feature_extraction_density.pyThis processes the tissue patches, extracts features, and saves them to the
output_density/directory. -
Feature Aggregation:
python Feature_Extraction/mean_features.pyThis aggregates features across multiple patches and creates a combined feature file.
-
Feature Selection:
python Feature_selection/feature_selection.pyThis selects the most discriminative features using multiple methods and saves them to
selected_features.csv. -
Clustering and Visualization:
python Clustering/clustering_selected_features.pyThis performs hierarchical clustering on the selected features and generates visualizations.
-
MRI Feature Processing:
python MR/mr_feature_list.pyThis processes MRI radiomic features from CaPTk output files.
-
Histopathology-MRI Integration:
python HistoandMR/all_features.pyThis integrates histopathology features with MRI features for multimodal analysis.
The toolkit is designed to work with standard histopathology whole slide images (WSIs) in formats supported by OpenSlide (e.g., .svs, .ndpi, .tiff). Sample images are not included in the repository due to size constraints.
The project generates various output files:
- Extracted image patches in PNG format
- Feature CSV files for each image and patient
- Combined feature matrices
- Selected feature lists
- Clustering visualizations (dendrogram, PCA, UMAP plots)
- Heatmaps of feature patterns
- Statistical analysis of discriminative features
This project makes use of several open-source libraries and tools that deserve acknowledgement:
- HistomicsTK for histopathology image analysis
- OpenSlide for reading whole slide image formats
- scikit-learn for machine learning and feature selection algorithms
- UMAP for dimensionality reduction
- CaPTk (Cancer Imaging Phenomics Toolkit) for MRI feature extraction
- pandas and NumPy for data manipulation
- Matplotlib and seaborn for visualization
Special thanks to the developers and contributors of these libraries for making this research possible.