-
Notifications
You must be signed in to change notification settings - Fork 1
Parse kraken by indentation #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the Kraken2 report parsing logic to use indentation-based depth calculation instead of rank-based hierarchical tracking. The change simplifies the lineage tracking by inferring taxonomic depth from leading spaces in the scientific name field.
Key Changes:
- Replaced rank-based parsing with indentation-based depth calculation
- Simplified lineage stack management using depth instead of rank order
- Removed rank filtering and global rank order dictionaries
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
scripts/summarize_kraken2_reports.py |
Refactored parsing to calculate depth from indentation, simplified lineage tracking logic, and moved Snakemake execution code into conditional block |
scripts/test_summarize_kraken2_reports.py |
Updated import path and added assertion for debugging |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| reports, _ = reports | ||
| report = reports[0] | ||
| parsed_report = parse_kraken2_tsv_report(open(report)) | ||
| assert False |
Copilot
AI
Oct 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert False statement will cause the test to always fail. Remove this debug assertion before merging.
| assert False |
|
|
||
| def consensus_lineage_str(lineage_stack: list[str]) -> str: | ||
| # Fill missing levels with empty placeholders if needed | ||
| full_lineage = lineage_stack + [f"__"] * (max(7 - len(lineage_stack), 0)) |
Copilot
AI
Oct 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 7 represents the expected taxonomic levels but is unexplained. Consider defining it as a named constant (e.g., EXPECTED_TAXONOMIC_LEVELS = 7) to improve code clarity.
| if len(lineage_stack) <= depth: | ||
| lineage_stack.extend(["__"] * (depth - len(lineage_stack) + 1)) | ||
| lineage_stack = lineage_stack[: depth + 1] | ||
| lineage_stack[depth] = f"__{name}" |
Copilot
AI
Oct 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lineage format is inconsistent with the placeholder format. Placeholders use '__' but actual entries use '__{name}', creating strings like '__Species_name'. This should likely be f'{rank}__{name}' to match standard taxonomic notation (e.g., 's__Species_name'), but the rank information is no longer available in the current implementation.
No description provided.