Skip to content

sbabyanusha/GU-Testing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Curation Assistant Tool

A Streamlit app for bioinformatics curation that combines:

  • πŸ“„ Document summarization & Q&A (PDF, DOCX, TXT, XLSX)
  • πŸ–ΌοΈ Figure extraction from PDFs (with OCR text search)
  • πŸ“Š Gene frequency calculation from supplementary tables
  • πŸ”Ž Gene lookup across cBioPortal-style files (data_mutations.txt, data_cna.txt, data_sv.txt)

Quickstart

1. Clone this repo

git clone https://github.com/<your-username>/<your-repo>.git
cd <your-repo>

2. Create environment & install dependencies

python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Set your OpenAI API key

You’ll need an API key from OpenAI.

export OPENAI_API_KEY="sk-..."

(On Windows PowerShell:)

$env:OPENAI_API_KEY="sk-..."

4. Run the app

streamlit run new_app.py

The app will open in your browser at http://localhost:8501.


Deployment on Streamlit Cloud

  1. Push this repo to GitHub.
  2. Go to Streamlit Cloud.
  3. Connect your repo and select new_app.py as the app file.
  4. Add your OpenAI API key as a Secret in Streamlit Cloud (OPENAI_API_KEY).
  5. Deploy β€” you’ll get a shareable public link like:
https://your-username-your-repo.streamlit.app

πŸ“‚ File Structure

your-repo/
β”œβ”€β”€ new_app.py          # Main Streamlit app
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ README.md           # This file
β”œβ”€β”€ runtime.txt         # (optional) Python version for Streamlit Cloud
β”œβ”€β”€ apt.txt             # (optional) system packages, e.g. for OCR
└── .streamlit/
    └── config.toml     # (optional) custom Streamlit theme

Features

  • Summarization & Q&A

    • Grounded strictly in uploaded files
    • Supports PDFs, Word, TXT, Excel
    • Cites file name, sheet, and page numbers
  • Figure Extraction

    • Extracts embedded images from PDFs
    • OCR text indexing for searchable captions
  • Gene Frequencies

    • Upload supplementary tables (.csv/.tsv/.txt/.xlsx)
    • Compute frequencies automatically or with custom denominator
    • Download results as CSV
  • Gene Lookup

    • Search for single or multiple genes (e.g., TP53, EGFR, KRAS)
    • Supports Hugo_Symbol in mutation/CNA files
    • Supports Site1_Hugo_Symbol / Site2_Hugo_Symbol in SV files
    • Reports per-file sample counts and percentages

Requirements

  • Python 3.10+
  • Internet access (for OpenAI API calls)

Install from requirements.txt:

pip install -r requirements.txt

πŸ™‹ Usage Example

  1. Upload your cBioPortal supplementary files:

    • data_mutations.txt
    • data_cna.txt
    • data_sv.txt
    • data_clinical_sample.txt
  2. Query genes:

    TP53, EGFR, KRAS
    
  3. See sample counts & % frequencies directly in the app.


πŸ‘¨β€πŸ’» Author

Built by Baby Anusha Satravada Β· Bioinformatics Software Engineer
Memorial Sloan Kettering Cancer Center (MSK)


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages