🧬 Curation Assistant Tool

A Streamlit app for bioinformatics curation that combines:

📄 Document summarization & Q&A (PDF, DOCX, TXT, XLSX)
🖼️ Figure extraction from PDFs (with OCR text search)
📊 Gene frequency calculation from supplementary tables
🔎 Gene lookup across cBioPortal-style files (data_mutations.txt, data_cna.txt, data_sv.txt)

Quickstart

1. Clone this repo

git clone https://github.com/<your-username>/<your-repo>.git
cd <your-repo>

2. Create environment & install dependencies

python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Set your OpenAI API key

You’ll need an API key from OpenAI.

export OPENAI_API_KEY="sk-..."

(On Windows PowerShell:)

$env:OPENAI_API_KEY="sk-..."

4. Run the app

streamlit run new_app.py

The app will open in your browser at http://localhost:8501.

Deployment on Streamlit Cloud

Push this repo to GitHub.
Go to Streamlit Cloud.
Connect your repo and select new_app.py as the app file.
Add your OpenAI API key as a Secret in Streamlit Cloud (OPENAI_API_KEY).
Deploy — you’ll get a shareable public link like:

https://your-username-your-repo.streamlit.app

📂 File Structure

your-repo/
├── new_app.py          # Main Streamlit app
├── requirements.txt    # Python dependencies
├── README.md           # This file
├── runtime.txt         # (optional) Python version for Streamlit Cloud
├── apt.txt             # (optional) system packages, e.g. for OCR
└── .streamlit/
    └── config.toml     # (optional) custom Streamlit theme

Features

Summarization & Q&A
- Grounded strictly in uploaded files
- Supports PDFs, Word, TXT, Excel
- Cites file name, sheet, and page numbers
Figure Extraction
- Extracts embedded images from PDFs
- OCR text indexing for searchable captions
Gene Frequencies
- Upload supplementary tables (.csv/.tsv/.txt/.xlsx)
- Compute frequencies automatically or with custom denominator
- Download results as CSV
Gene Lookup
- Search for single or multiple genes (e.g., TP53, EGFR, KRAS)
- Supports Hugo_Symbol in mutation/CNA files
- Supports Site1_Hugo_Symbol / Site2_Hugo_Symbol in SV files
- Reports per-file sample counts and percentages

Requirements

Python 3.10+
Internet access (for OpenAI API calls)

Install from requirements.txt:

pip install -r requirements.txt

🙋 Usage Example

Upload your cBioPortal supplementary files:
- data_mutations.txt
- data_cna.txt
- data_sv.txt
- data_clinical_sample.txt
Query genes:
```
TP53, EGFR, KRAS
```
See sample counts & % frequencies directly in the app.

👨‍💻 Author

Built by Baby Anusha Satravada · Bioinformatics Software Engineer
Memorial Sloan Kettering Cancer Center (MSK)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.devcontainer		.devcontainer
README.md		README.md
config.toml		config.toml
curation_assistant.py		curation_assistant.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Curation Assistant Tool

Quickstart

1. Clone this repo

2. Create environment & install dependencies

3. Set your OpenAI API key

4. Run the app

Deployment on Streamlit Cloud

📂 File Structure

Features

Requirements

🙋 Usage Example

👨‍💻 Author

About

Uh oh!

Releases

Packages

Languages

sbabyanusha/GU-Testing

Folders and files

Latest commit

History

Repository files navigation

🧬 Curation Assistant Tool

Quickstart

1. Clone this repo

2. Create environment & install dependencies

3. Set your OpenAI API key

4. Run the app

Deployment on Streamlit Cloud

📂 File Structure

Features

Requirements

🙋 Usage Example

👨‍💻 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages