With the rapid advancement of multi-omics technologies, healthcare institutions now generate vast amounts of genomic, proteomic, and metabolomic data. While these datasets hold the key to personalized medicine, they typically reside in disconnected systems - from sequencing machines to EHRs to research databases. Traditionally, integrating this data required teams of bioinformaticians to build complex analysis pipelines, slowing down critical treatment decisions.
By leveraging mutation Data has led to discovery of novel disease variants. For example breast cancer patient management system serves as a powerful platform for physicians to discover and analyze novel disease variants by integrating clinical and genomic data. By systematically recording patient demographics, treatment outcomes, and mutation profiles, the system enables:
Pattern Recognition - Identifies recurring mutation clusters across specific age groups, ethnicities, and geographic locations, revealing potential founder mutations or environmental risk factors.
Variant Alert System - Automatically flags novel variants and compares them against global databases (COSMIC, ClinVar), highlighting mutations with predicted clinical significance.
Treatment Response Analysis - Correlates specific mutations with drug efficacy, helping physicians identify biomarkers for treatment resistance or sensitivity.
Collaborative Research - Facilitates secure data sharing across institutions, creating a crowdsourced knowledge base for rare variants and atypical presentations.
Predictive Modeling - Uses accumulated data to forecast disease progression patterns and suggest personalized therapeutic approaches based on mutation profiles.
The system transforms routine clinical documentation into a dynamic discovery tool, where each new patient record contributes to our understanding of breast cancer heterogeneity. Physicians gain real-time insights into how specific mutations influence: -
1. Metastatic patterns
2. Disease progression timelines
3. Survival outcomes
4. Therapeutic vulnerabilities
By making these correlations visible at the point of care, the platform accelerates the identification of novel disease variants and enables more precise, personalized treatment strategies - bridging the gap between genomic research and clinical practice.
Unified Data Lakehouse for harmonizing sequencing data, clinical records, and research repositories
Low-Code Transformations to clean and standardize omics data without extensive coding
Built-In ML Capabilities for running predictive models directly on Fabric notebooks
 1. Daniel Muthama (ML and Backend)
 2. Eunice Nduku (Data)
 3. Daniel Muruthi (Frontend)
This AI-powered oncology platform analyzes integrated genomic, proteomic, and metabolomic data to predict disease outcomes and generate personalized treatment recommendations. The system utilizes Microsoft Fabric for multimodal data orchestration, Azure AI Search for biomedical evidence retrieval, and Azure OpenAI for clinical insights generation. Specifically designed for breast cancer research, it identifies pathogenic mutation patterns, detects clinically significant genomic variants, and synthesizes comprehensive reports highlighting therapeutic implications derived from multi-omics analysis.
Precision in Terminology: Changed "disease recovery" to "disease outcomes" (more clinically accurate)
Oncology Focus: Added "specifically designed for breast cancer research"
Technical Clarity: Specified "pathogenic mutation patterns" and "clinically significant genomic variants"
Flow: Improved logical progression from data types → analysis → clinical outputs
Professional Tone: Used terms like "therapeutic implications" and "multi-omics analysis
   
  
  Figure 1: High-level architecture of the bioscience platform
1. ML Model - Training/ Classification/ Fine-tuning
2. RAG System - Bioscince RAG system
3. Data Collector - breast_cancer_recorder
4. Visualizations
GenomicAnalysisWorkspace/
│
├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
│   ├── (Eventhouse data)
│   └── (KQL Database)
│
├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
│
├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
│
├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
│
├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
│
├── Data_Engineering/                  # Notebook for data engineering tasks
│
├── Genomel_H/                         # Lakehouse for genomic data
│   ├── (Lakehouse data)
│   ├── Semantic model
│   └── SQL analytics endpoint
│
├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
│   ├── (Notebook)
│   └── (Experiment)
│
├── GenomicDataProcessing/             # Notebook for genomic data processing
│
└── model_deployment/                  # Notebook and experiment for model deployment
    ├── (Notebook)
    └── (Experiment)
   
  
  Figure 4: MS Fabric - GenomicAnalysisWorkspace
   
  
  Figure 5: MS Fabric - GenomicAnalysisWorkspace2
   
  
  Figure 2: Breast Cancer Recorder
   
  
  Figure 3: Breast Cancer Recorder with Records
breast-cancer-rag-app/
├── backend/
│   ├── __pycache__/          # Python cached bytecode
│   ├── venv/                 # Python virtual environment
│   ├── .env                  # Environment variables
│   ├── biospecimen_rag.py    # Core RAG implementation
│   ├── Dockerfile            # Backend container configuration
│   ├── main.py               # FastAPI entry point
│   └── requirements.txt      # Python dependencies
│
└── frontend/
    ├── node_modules/         # NPM packages
    ├── public/               # Static assets
    ├── src/
    │   ├── components/
    │   │   ├── QueryInterface.js  # Main query component
    │   │   ├── ResultsViewer.js   # Results display
    │   │   └── StatusIndicator.js # System status UI
    │   ├── services/
    │   │   └── apiService.js      # API communication
    │   ├── App.js           # Root React component
    │   ├── index.js         # React entry point
    │   ├── reportWebVitals.js # Performance tracking
    │   ├── styles.css       # Global styles
    │   └── .gitignore       # Frontend ignore rules
    ├── Dockerfile           # Frontend container config
    ├── package-lock.json    # Exact dependency tree
    └── package.json         # Project metadata (implied)
   
  
  Figure 10: MS Fabric - Frontend
   
  
  Figure 11: MS Fabric - Backend
   
  
  Figure 18: MS Fabric - Frontend for a Backend(service working)
		OPENAI_GPT4_DEPLOYMENT="gpt-4"
		OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
		OPENAI_API_KEY="<your-key>"
		OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
		KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
		KUSTO_DATABASE="BioEventHouse"
		KUSTO_TABLE="biospecimen_embeddings"
		AZURE_TENANT_ID="<tenant-id>"
		AZURE_CLIENT_ID="<client-id>"
		AZURE_CLIENT_SECRET="<client-secret>"
		APP_INSIGHTS_KEY=your-instrumentation-key
		SECRET_KEY=your-secret-key-for-flask
breast-cancer-recorder/
├── build/
├── static/
├── asset-manifest.json
├── index.html
├── manifest.json
├── robots.txt
├── node_modules/
├── public/
│   ├── index.html
│   ├── manifest.json
│   ├── robots.txt
│   └── reports/                      # New directory for research reports
│       └── breast_cancer_report.md   # Added markdown report
├── src/
│   ├── components/
│   │   ├── DataTable.js
│   │   └── PatientForm.js
│   ├── utils/
│   ├── App.js
│   ├── index.js
│   ├── reportWebVitals.js
│   ├── styles.css
│   └── report-components/            # New directory for report components
│       ├── ReportViewer.js           # Component to display the markdown
│       └── ReportGenerator.js        # Component to generate dynamic reports
├── .env
├── .gitignore
├── package.json
└── vercel.json
The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
   
  
  Figure 6: MS Fabric - Breast_Cancer_LakeHouse
   
  
  Figure 7: MS Fabric - Breast_Cancer_LakeHouse
   
  
  Figure 8: MS Fabric - Breast_Cancer_LakeHouse
   
  
  Figure 9: MS Fabric - Breast_Cancer_LakeHouse
Generated from Lakehouse Pipeline – {{date}}
- 
Total Patients: {{gold_df.count()}}
- 
Average Age: {{stage_analysis_df.select(avg("avg_age")).first()[0]}}years
- 
Weight Distribution: - Mean: {{stage_analysis_df.select(avg("avg_weight")).first()[0]}}kg
- Std Dev: {{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}kg
 
- Mean: 
| Stage | Patients (%) | Avg Age | Top Location | 
|---|---|---|---|
| 1 | {{count_stage1/total*100}}% | {{age_stage1}} | {{top_loc_stage1}} | 
| 2 | {{count_stage2/total*100}}% | {{age_stage2}} | {{top_loc_stage2}} | 
| ... | ... | ... | ... | 
Insight: Early-stage (1-2) diagnoses are most prevalent in {{top_location}}.
![Diagnosis Over Time]
- Peak Diagnoses: {{year_with_max_cases}}
- Recent Change: {{last_3_years_trend}}(↑/↓)
For Stage {{X}} Patients (Age {{Y}}):
"Patients at this stage should prioritize {{GPT-4_advice}}..."
- Complete Records: {{valid_records/total*100}}%
- Missing Data:
- {{null_cancer_stage}}missing stage labels
- {{null_weight}}missing weight entries
 
- Data Source: breast_cancer_patients.csv
- Pipeline:
- Bronze: Raw ingestion
- Silver: PII pseudonymization + cleaning
- Gold: Analytics + AI enrichment (GPT-4)
 
- Tools: Microsoft Fabric, Power BI, PySpark
- BioEventHouse: Eventhouse and KQL Database for genomic event data
- Genomel_H: Lakehouse for genomic data with semantic model and SQL analytics
- Multiple Jupyter notebooks for various genomic analysis tasks
- Experiments tracking for machine learning workflows
The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
   
  
  Figure 16: MS Fabric - GenomicAnalysisWorkspace
   
  
  Figure 17: MS Fabric - GenomicAnalysisWorkspace
The MLmodel file defines metadata for model deployment, including:
Dependencies: Conda/virtualenv environments to replicate training conditions.
Model Specifications: Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
Version Control: Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
Unique run_id and experiment IDs enable traceability across genomic analyses.
Implications: This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
   
  
  Figure 12: MS Fabric - Genomic Analysis Pipeline
   
  
  Figure 13: MS Fabric - Genomic Analysis Pipeline
   
  
  Figure 14: MS Fabric - Genomic Analysi.s Pipeline
   
  
  Figure 15: MS Fabric - Genomic Analysis Pipeline
- BiospecimenClassifier: ML model for biospecimen classification
- Model deployment experiments
- Biospecimen_RAG_System: Retrieval-Augmented Generation system
- Biospecimen_Report_Generator: Automated report generation
- Azure Account: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
- Python 3.8+: Install Python and required libraries.
- Power BI Desktop: For creating visualizations.
- Microsoft Fabric Workspace: With contributor permissions.
- Genomic Datasets: Access to required genomic data sources.
Install the required Python libraries:
Create a Fabric workspace and set up OneLake.
Set up an OpenAI resource and deploy a GPT-4 model.
Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
Use Power BI to create interactive dashboards for visualizing:
Molecular phenotyping profiles.
Top biomarkers for disease recovery.
Trends in gene expression.
Contributions are welcome! Please follow these steps:
Repository: https://github.com/danielmuthama23/Genomic_Analysis.git
This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.
Analysis Methods:
- Mutation frequency analysis of key cancer genes
- Protein-protein interaction networks to identify functional clusters
- Metabolic pathway mapping to detect dysregulated processes
Key Datasets:
- PDC_biospecimen_manifest_03272025_214257.csv
- Embedded mock genomic data for test and validation
![Mutation Counts]
Top Pathogenic Mutations:
| Gene | Mutation Count | Disease-Associated | Percentage | 
|---|---|---|---|
| TP53 | 8 | 8 | 100% | 
| PIK3CA | 5 | 5 | 100% | 
| BRCA1 | 4 | 4 | 100% | 
Insights:
- TP53 mutations were ubiquitous (100% disease-linked), indicating its role as a primary driver.
- PIK3CA and BRCA1/2 mutations showed strong disease associations.
  
  
  Figure 19: MS Fabric - Mutation
![Protein Network]
Critical Hubs (High Connectivity):
- TP53 (4 interactions)
- BRCA1 (3 interactions)
- PIK3CA (3 interactions)
Key Observations:
- Red nodes (PDC-identified proteins) formed central hubs.
- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).
  
  
  Figure 21: MS Fabric - Mutation
![Metabolic Pathways]
Most Dysregulated Pathways:
- Glycolysis (↑ Glucose-6-P, Fructose-1,6-BP)
- TCA Cycle (↓ Succinyl-CoA, ↑ Acetyl-CoA)
- Fatty Acid Synthesis (↑ Malonyl-CoA)
Top Dysregulated Metabolite:
- Acetyl-CoA (2.1-fold change, linked to PTEN mutations).
  
  
  Figure 20: MS Fabric - Metabolic Pathway
- Thresholds: Genes with >70% disease-associated mutations flagged as high-risk.
- Validation: Cross-referenced with COSMIC database.
- Prioritized hub genes (e.g., TP53) as biomarkers.
- Inhibition edges (red) highlighted drug targets (e.g., PTEN→AKT1).
- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.
- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.
From the analysis we can conclude:-
Diagnostic Markers:
- TP53 mutations as universal biomarkers.
- PIK3CA activation signals aggressive subtypes.
Therapeutic Targets:
- Target PIK3CA-AKT1 interactions.
- Explore metabolic inhibitors for Acetyl-CoA-overproducing tumors.
Future Works:
- Validate with clinical outcomes data.
- Expand analysis to RNA-seq.
| File | Description | 
|---|---|
| mutation_disease_counts.png | Top mutated genes with disease associations. | 
| protein_network.png | Protein interaction network with PDC hubs. | 
| metabolic_pathways.png | Dysregulated metabolic pathways. | 
Prepared by: Daniel Muthama
Date: April 2, 2025
Contact: (mailto:[email protected])
- Clinicians: Focus on TP53/PIK3CA status for patient stratification.
- Researchers: Explore metabolic pathways for novel drug combinations.
- Data Teams: Replicate pipeline using DataEngineering.tex.
This project is licensed under the MIT License. See the LICENSE file for details.
For questions or feedback, please contact:
Microsoft Fabric for data orchestration.
Azure AI Search for retrieval.
Azure OpenAI for natural language generation.