diff --git a/.github/workflows/draft-pdf.yml b/.github/workflows/draft-pdf.yml new file mode 100644 index 00000000..4c9be2a5 --- /dev/null +++ b/.github/workflows/draft-pdf.yml @@ -0,0 +1,28 @@ +name: Draft PDF +on: + push: + paths: + - joss/** + - .github/workflows/draft-pdf.yml + +jobs: + paper: + runs-on: ubuntu-latest + name: Paper Draft + steps: + - name: Checkout + uses: actions/checkout@v4 + - name: Build draft PDF + uses: openjournals/openjournals-draft-action@master + with: + journal: joss + # This should be the path to the paper within your repo. + paper-path: joss/paper.md + - name: Upload + uses: actions/upload-artifact@v4 + with: + name: paper + # This is the output path where Pandoc will write the compiled + # PDF. Note, this should be the same directory as the input + # paper.md + path: joss/paper.pdf \ No newline at end of file diff --git a/README.md b/README.md index 838e10b8..06337d79 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ [![CodeQL](https://github.com/MichaelCurrin/badge-generator/workflows/CodeQL/badge.svg)](https://github.com/KoslickiLab/YACHT/actions?query=workflow%3ACodeQL "Code quality workflow status") [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/KoslickiLab/YACHT/blob/main/LICENSE.txt) -YACHT is a mathematically rigorous hypothesis test for the presence or absence of organisms in a metagenomic sample, based on average nucleotide identity (ANI). +YACHT is a mathematically rigorous hypothesis test for the presence or absence of organisms in a metagenomic sample, based on Average Nucleotide Identity (ANI). Identifying whether a specific microbe is actually present in a metagenomic sample is often complicated by sequencing noise, low-abundance organisms, and high genomic similarity between species. Traditional profiling tools rely on simple thresholds that can lead to high false-positive rates. Various cohorts can utilize YACHT: microbiome researchers dealing with low-biomass samples, synthetic biologists needing to validate the composition of mock communities, and genomics researchers identifying specific metagenome-assembled genomes (MAGs) of interest within vast sequencing datasets. The associated publication can be found here: https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873 @@ -17,7 +17,7 @@ Please cite via:
-## Quick start +## Quick demonstration We provide a demo to show how to use YACHT. Please follow the command lines below to try it out: ```bash @@ -89,8 +89,7 @@ conda install -c conda-forge -c bioconda yacht ``` ### Manual installation -YACHT requires Python 3.6 or higher and Conda. We recommend using a virtual environment to ensure a clean and isolated workspace. This can be accomplished using either [Conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [Mamba](https://github.com/mamba-org/mamba) (a faster alternative to Conda). - +YACHT requires **Python >3.6** (and <3.12) with the following core genomics dependencies: `sourmash` (>=4.8.3), `sourmash_plugin_branchwater`, and `pytaxonkit`. The full list of dependencies can be found in the [environment configuration](https://github.com/KoslickiLab/YACHT/blob/main/env/yacht_env.yml). To ensure a clean and isolated workspace, we recommend using a virtual environment. This can be accomplished using either [Conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [Mamba](https://github.com/mamba-org/mamba), a faster alternative to Conda. #### Using Conda To create your Conda environment and install YACHT, follow these steps: diff --git a/joss/paper.bib b/joss/paper.bib new file mode 100644 index 00000000..01ea5e65 --- /dev/null +++ b/joss/paper.bib @@ -0,0 +1,358 @@ +@article{tian2025designed, + title={A designed synthetic microbiota provides insight to community function in Clostridioides difficile resistance}, + author={Tian, Shuchang and Kim, Min Soo and Zhao, Jingcheng and Heber, Kerim and Hao, Fuhua and Koslicki, David and Tian, Sangshan and Singh, Vishal and Patterson, Andrew D and Bisanz, Jordan E}, + journal={Cell Host \& Microbe}, + year={2025}, + doi={10.1016/j.chom.2025.02.007}, + publisher={Elsevier} +} +@article{hera2023deriving, + title={Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash}, + author={Hera, Mahmudur Rahman and Pierce-Ward, N Tessa and Koslicki, David}, + journal={Genome research}, + volume={33}, + number={7}, + pages={1061--1068}, + year={2023}, + doi={10.1101/gr.277651.123}, + publisher={Cold Spring Harbor Lab} +} +@article{blanca2022statistics, + title={The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches}, + author={Blanca, Antonio and Harris, Robert S and Koslicki, David and Medvedev, Paul}, + journal={Journal of Computational Biology}, + volume={29}, + number={2}, + pages={155--168}, + year={2022}, + doi={10.1089/cmb.2021.0431}, + publisher={Mary Ann Liebert, Inc., publishers 140 Huguenot Street, 3rd Floor New~…} +} +@article{irber2024sourmash, + title={sourmash v4: A multitool to quickly search, compare, and analyze genomic and metagenomic data sets}, + author={Irber, Luiz and Pierce-Ward, N Tessa and Abuelanin, Mohamed and Alexander, Harriet and Anant, Abhishek and Barve, Keya and Baumler, Colton and Botvinnik, Olga and Brooks, Phillip and Dsouza, Daniel and others}, + journal={Journal of Open Source Software}, + volume={9}, + number={98}, + pages={6830}, + year={2024}, + doi={10.21105/joss.06830} +} + +@article{koslicki2024yacht, + title={YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample}, + author={Koslicki, David and White, Stephen and Ma, Chunyu and Novikov, Alexei}, + journal={Bioinformatics}, + volume={40}, + number={2}, + pages={btae047}, + year={2024}, + doi={10.1093/bioinformatics/btae047}, + publisher={Oxford University Press} +} + +@article{ward2018metapoap, + title={MetaPOAP: presence or absence of metabolic pathways in metagenome-assembled genomes}, + author={Ward, Lewis M and Shih, Patrick M and Fischer, Woodward W}, + journal={Bioinformatics}, + volume={34}, + number={24}, + pages={4284--4286}, + year={2018}, + doi={10.1093/bioinformatics/bty510}, + publisher={Oxford University Press} +} + + +@article{marcelino2019metatranscriptomics, + title={Metatranscriptomics as a tool to identify fungal species and subspecies in mixed communities--a proof of concept under laboratory conditions}, + author={Marcelino, Vanesa R and Irinyi, Laszlo and Eden, John-Sebastian and Meyer, Wieland and Holmes, Edward C and Sorrell, Tania C}, + journal={IMA fungus}, + volume={10}, + pages={1--10}, + year={2019}, + doi={10.1186/s43008-019-0012-8}, + publisher={Springer} +} + +@article{pereira2024metatranscriptomics, + title={A metatranscriptomics strategy for efficient characterization of the microbiome in human tissues with low microbial biomass}, + author={Pereira-Marques, Joana and Ferreira, Rui M and Figueiredo, Ceu}, + journal={Gut Microbes}, + volume={16}, + number={1}, + pages={2323235}, + year={2024}, + doi={10.1080/19490976.2024.2323235}, + publisher={Taylor \& Francis} +} + +@article{godlewska2020metagenomic, + title={Metagenomic studies in inflammatory skin diseases}, + author={Godlewska, Urszula and Brzoza, Piotr and Kwiecie{\'n}, Kamila and Kwitniewski, Mateusz and Cichy, Joanna}, + journal={Current Microbiology}, + volume={77}, + pages={3201--3212}, + year={2020}, + doi={10.1007/s00284-020-02163-4}, + publisher={Springer} +} + +@article{mande2012classification, + title={Classification of metagenomic sequences: methods and challenges}, + author={Mande, Sharmila S and Mohammed, Monzoorul Haque and Ghosh, Tarini Shankar}, + journal={Briefings in bioinformatics}, + volume={13}, + number={6}, + pages={669--681}, + year={2012}, + doi={10.1093/bib/bbs054}, + publisher={Oxford University Press} +} + +@article{shakya2013comparative, + title={Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities}, + author={Shakya, Migun and Quince, Christopher and Campbell, James H and Yang, Zamin K and Schadt, Christopher W and Podar, Mircea}, + journal={Environmental microbiology}, + volume={15}, + number={6}, + pages={1882--1899}, + year={2013}, + doi={10.1111/1462-2920.12086}, + publisher={Wiley Online Library} +} + +@article{sczyrba2017critical, + title={Critical assessment of metagenome interpretation—a benchmark of metagenomics software}, + author={Sczyrba, Alexander and Hofmann, Peter and Belmann, Peter and Koslicki, David and Janssen, Stefan and Dr{\"o}ge, Johannes and Gregor, Ivan and Majda, Stephan and Fiedler, Jessika and Dahms, Eik and others}, + journal={Nature methods}, + volume={14}, + number={11}, + pages={1063--1071}, + year={2017}, + doi={10.1038/nmeth.4458}, + publisher={Nature Publishing Group US New York} +} + +@article{meyer2022critical, + title={Critical assessment of metagenome interpretation: the second round of challenges}, + author={Meyer, Fernando and Fritz, Adrian and Deng, Zhi-Luo and Koslicki, David and Lesker, Till Robin and Gurevich, Alexey and Robertson, Gary and Alser, Mohammed and Antipov, Dmitry and Beghini, Francesco and others}, + journal={Nature methods}, + volume={19}, + number={4}, + pages={429--440}, + year={2022}, + doi={10.1038/s41592-022-01431-4}, + publisher={Nature Publishing Group US New York} +} + +%USE CASE EXAMPLE REFERENCES +@article{hayden2022genome, + title={Genome capture sequencing selectively enriches bacterial DNA and enables genome-wide measurement of intrastrain genetic diversity in human infections}, + author={Hayden, Hillary S and Joshi, Snehal and Radey, Matthew C and Vo, Anh T and Forsberg, Cara and Morgan, Sarah J and Waalkes, Adam and Holmes, Elizabeth A and Klee, Sara M and Emond, Mary J and others}, + journal={Mbio}, + volume={13}, + number={5}, + pages={e01424--22}, + year={2022}, + doi={10.1128/mbio.01424-22}, + publisher={Am Soc Microbiol} +} + +@article{rajeev2023metagenome, + title={Metagenome sequencing and recovery of 444 metagenome-assembled genomes from the biofloc aquaculture system}, + author={Rajeev, Meora and Jung, Ilsuk and Lim, Yeonjung and Kim, Suhyun and Kang, Ilnam and Cho, Jang-Cheon}, + journal={Scientific data}, + volume={10}, + number={1}, + pages={707}, + year={2023}, + doi={10.1038/s41597-023-02622-0}, + publisher={Nature Publishing Group UK London} +} + +@article{zhang2022cultivation, + title={Cultivation and functional characterization of a deep-sea Lentisphaerae representative reveals its unique physiology and ecology}, + author={Zhang, Tianhang and Zheng, Rikuan and Liu, Rui and Li, Ronggui and Sun, Chaomin}, + journal={Frontiers in Marine Science}, + volume={9}, + pages={848136}, + year={2022}, + doi={10.3389/fmars.2022.848136}, + publisher={Frontiers Media SA} +} + +@article{schloss2020removal, + title={Removal of rare amplicon sequence variants from 16S rRNA gene sequence surveys biases the interpretation of community structure data}, + author={Schloss, Patrick D}, + journal={bioRxiv}, + year={2020}, + doi={10.1101/2020.12.11.422279}, + publisher={Cold Spring Harbor Laboratory} +} + +@article{jia2022sequencing, + title={Sequencing introduced false positive rare taxa lead to biased microbial community diversity, assembly, and interaction interpretation in amplicon studies}, + author={Jia, Yangyang and Zhao, Shengguo and Guo, Wenjie and Peng, Ling and Zhao, Fang and Wang, Lushan and Fan, Guangyi and Zhu, Yuanfang and Xu, Dayou and Liu, Guilin and others}, + journal={Environmental Microbiome}, + volume={17}, + number={1}, + pages={1--18}, + year={2022}, + doi={10.1186/s40793-022-00436-y}, + publisher={Springer} +} + +@article{kunin2008bioinformatician, + title={A bioinformatician's guide to metagenomics}, + author={Kunin, Victor and Copeland, Alex and Lapidus, Alla and Mavromatis, Konstantinos and Hugenholtz, Philip}, + journal={Microbiology and molecular biology reviews}, + volume={72}, + number={4}, + pages={557--578}, + year={2008}, + doi={10.1128/MMBR.00009-08}, + publisher={Am Soc Microbiol} +} + +@article{schlaberg2017validation, + title={Validation of metagenomic next-generation sequencing tests for universal pathogen detection}, + author={Schlaberg, Robert and Chiu, Charles Y and Miller, Steve and Procop, Gary W and Weinstock, George and Professional Practice Committee and Committee on Laboratory Practices of the American Society for Microbiology and Microbiology Resource Committee of the College of American Pathologists}, + journal={Archives of Pathology and Laboratory Medicine}, + volume={141}, + number={6}, + pages={776--786}, + year={2017}, + doi={10.5858/arpa.2016-0539-RA}, + publisher={the College of American Pathologists} +} + +@article{loeffler2020improving, + title={Improving the usability and comprehensiveness of microbial databases}, + author={Loeffler, Caitlin and Karlsberg, Aaron and Martin, Lana S and Eskin, Eleazar and Koslicki, David and Mangul, Serghei}, + journal={BMC biology}, + volume={18}, + pages={1--6}, + year={2020}, + doi={10.1186/s12915-020-0756-z}, + publisher={Springer} +} + +@article{marcelino2020ccmetagen, + title={CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data}, + author={Marcelino, Vanessa R and Clausen, Philip TLC and Buchmann, Jan P and Wille, Michelle and Iredell, Jonathan R and Meyer, Wieland and Lund, Ole and Sorrell, Tania C and Holmes, Edward C}, + journal={Genome biology}, + volume={21}, + pages={1--15}, + year={2020}, + doi={10.1186/s13059-020-02014-2}, + publisher={Springer} +} + +%Schloss PD. Removal of rare amplicon sequence variants from 16s rrna gene sequence surveys biases the interpretation of community structure data. bioRxiv, 2020, preprint: not peer reviewed. https://doi.org/10.1101/2020.12.11.422279. + +@article{Schloss, + title={Removal of rare amplicon sequence variants from 16s rrna gene sequence surveys biases the interpretation of community structure data}, + author={Patrick D. Schloss}, + journal={bioRxiv}, + doi={10.1101/2020.12.11.422279}, + year={2020} +} + +@article{jia2022sequencing, + title={Sequencing introduced false positive rare taxa lead to biased microbial community diversity, assembly, and interaction interpretation in amplicon studies}, + author={Jia, Yangyang and Zhao, Shengguo and Guo, Wenjie and Peng, Ling and Zhao, Fang and Wang, Lushan and Fan, Guangyi and Zhu, Yuanfang and Xu, Dayou and Liu, Guilin and others}, + journal={Environmental Microbiome}, + volume={17}, + number={1}, + pages={43}, + year={2022}, + doi={10.1186/s40793-022-00436-y}, + publisher={Springer} +} + +@article{jousset2017less, + title={Where less may be more: how the rare biosphere pulls ecosystems strings}, + author={Jousset, Alexandre and Bienhold, Christina and Chatzinotas, Antonis and Gallien, Laure and Gobet, Ang{\'e}lique and Kurm, Viola and K{\"u}sel, Kirsten and Rillig, Matthias C and Rivett, Damian W and Salles, Joana F and others}, + journal={The ISME journal}, + volume={11}, + number={4}, + pages={853--862}, + year={2017}, + doi={10.1038/ismej.2016.174}, + publisher={Oxford University Press} +} + +@article{hu2022tenebrionibacter, + title={Tenebrionibacter intestinalis gen. nov., sp. nov., a member of a novel genus of the family Enterobacteriaceae, isolated from the gut of the plastic-eating mealworm Tenebrio molitor L.}, + author={Hu, Lin and Yang, Yu}, + journal={International Journal of Systematic and Evolutionary Microbiology}, + volume={72}, + number={2}, + pages={005246}, + year={2022}, + doi={10.1099/ijsem.0.005246}, + publisher={Microbiology Society} +} + +@article{hardwick2018synthetic, + title={Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis}, + author={Hardwick, Simon A and Chen, Wendy Y and Wong, Ted and Kanakamedala, Bindu S and Deveson, Ira W and Ongley, Sarah E and Santini, Nadia S and Marcellin, Esteban and Smith, Martin A and Nielsen, Lars K and others}, + journal={Nature communications}, + volume={9}, + number={1}, + pages={3096}, + year={2018}, + doi={10.1038/s41467-018-05555-0}, + publisher={Nature Publishing Group UK London} +} + +@article{singer2016next, + title={Next generation sequencing data of a defined microbial mock community}, + author={Singer, Esther and Andreopoulos, Bill and Bowers, Robert M and Lee, Janey and Deshpande, Shweta and Chiniquy, Jennifer and Ciobanu, Doina and Klenk, Hans-Peter and Zane, Matthew and Daum, Christopher and others}, + journal={Scientific data}, + volume={3}, + number={1}, + pages={1--8}, + year={2016}, + doi={10.1038/sdata.2016.81}, + publisher={Nature Publishing Group} +} + +@article{van2023synthetic, + title={Synthetic microbial communities (SynComs) of the human gut: design, assembly, and applications}, + author={van Leeuwen, Pim T and Brul, Stanley and Zhang, Jianbo and Wortel, Meike T}, + journal={FEMS Microbiology Reviews}, + volume={47}, + number={2}, + pages={fuad012}, + year={2023}, + doi={10.1093/femsre/fuad012}, + publisher={Oxford University Press} +} + +@article{Irber2022FracMinHash, + author = {Irber, Luiz C. and Brooks, Patrick T. and Reiter, Travis E. and Pierce-Ward, Nathan T. and Hera, M. R. and Koslicki, David and Brown, C. Titus}, + title = {Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers}, + journal = {bioRxiv}, + year = {2022}, + doi = {10.1101/2022.01.11.475838}, + url = {https://doi.org/10.1101/2022.01.11.475838} +} + +@article{Singer2016MockCommunity, + author = {Singer, Esther and Bushnell, Brian and Coleman-Derr, Devin and Douglas, Gina M. and Bowman, Benjamin and Bowers, Robert M. and Levy, Adi and Gies, Esther and Cheng, Jan-Fang and Copeland, Alex and others}, + title = {Next generation sequencing data of a defined microbial mock community}, + journal = {Scientific Data}, + volume = {3}, + number = {1}, + pages = {1--8}, + year = {2016}, + publisher = {Nature Publishing Group} +} +@book{irber2020decentralizing, + title={Decentralizing indices for genomic data}, + author={Irber Jr, Luiz Carlos}, + year={2020}, + publisher={University of California, Davis} +} diff --git a/joss/paper.md b/joss/paper.md new file mode 100644 index 00000000..0ce5a430 --- /dev/null +++ b/joss/paper.md @@ -0,0 +1,123 @@ +--- +title: 'YACHT: Software for an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample' +tags: + - Python + - c++ + - metagenomics + - microbial +authors: + - name: Maksym Lupei + orcid: 0000-0003-3440-3919 + equal-contrib: true + affiliation: 1 + - name: Shaopeng Liu + orcid: 0000-0003-3112-4068 + equal-contrib: true + affiliation: 2 + - name: Chunyu Ma + orcid: 0000-0001-9731-5153 + equal-contrib: true + affiliation: 1 + - name: Adam Park + orcid: 0009-0003-4104-2960 + equal-contrib: true + affiliation: 1 + - name: Omar Hesham Rady + orcid: 0009-0005-1819-7643 + equal-contrib: true + affiliation: 1 + - name: Mahmudur Rahman Hera + orcid: 0000-0002-5992-9012 + equal-contrib: true + affiliation: 1 + - name: Judith S. Rodriguez + orcid: 0000-0002-5109-3054 + equal-contrib: true + affiliation: 2 + - name: Stephanie J. Won + orcid: 0000-0001-7288-5395 + equal-contrib: true + affiliation: 3 + - name: David Koslicki + orcid: 0000-0002-0640-954X + corresponding: true + affiliation: "1, 2, 3" +affiliations: + - name: School of Electrical Engineering and Computer Science, Pennsylvania State University, USA + index: 1 + ror: 04p491231 + - name: Huck Institutes of the Life Sciences, Pennsylvania State University, USA + index: 2 + ror: 04p491231 + - name: Department of Biology, Pennsylvania State University, USA + index: 3 + ror: 04p491231 +date: 10 February 2025 +bibliography: paper.bib +--- + +# Summary + +In metagenomics, identifying genomes present in a sample is an important initial task, but is complicated by taxonomic profiling tools lacking uncertainty quantification and using incomplete reference databases missing exact genome matches. YACHT (**Y**es/No **A**nswers to **C**ommunity membership via **H**ypothesis **T**esting) [@koslicki2024yacht] is a command-line tool for taxonomic profiling that uses binomial hypothesis testing on exclusive k-mers to confidently determine genome presence/absence in a metagenomic sample. YACHT assists in discovering rare microbiomes by identifying low-abundant species missed in other taxonomic profiling approaches while also controlling the false negative rate. Its statistical model overcomes challenges in sequencing coverage and incomplete genomes, making it ideal for diverse metagenomic applications, including functional profiling, metatranscriptomics, and clinical microbiome analysis. + +YACHT presents a robust, $k$-mer sketching-based statistical framework for accurately detecting genetic similarity between the reference database and the metagenomic sample by incorporating evolutionary sequence divergence through the average nucleotide identity (ANI) and sequencing coverage to enable efficient detection of sampled genomes. The workflow for YACHT includes the following commands. To begin, `yacht sketch` creates reduced representation "sketches" of the reference and sample datasets enabling swift comparisons. Then, `yacht train` is used to find a representative of closely related reference genomes using ANI. Lastly, `yacht run` uses the YACHT algorithm to perform hypothesis testing and identify the presence or absence of species. YACHT is developed with C++ and Python and depends on `sourmash` [@irber2024sourmash], a program for extracting and managing $k$-mers. + +# Statement of need + +Accurately identifying and characterizing microbial communities with low relative abundance is a significant challenge in metagenomics. The current profiling-based practice involves setting arbitrary filter thresholds or discarding low-abundance data without robust justification, which can compromise profiling accuracy and lead to misinterpretations [@schloss2020removal; @jia2022sequencing]. Even with such filtering, the results remain inherently arbitrary because they are influenced by biological complexities such as sequencing errors and evolutionary processes. The lack of a systematic approach to establishing credibility in these results diminishes researchers' confidence in biologically informed methods for identifying rare microorganisms, thereby undermining metagenomic studies. Moreover, these difficulties are exacerbated by the incompleteness of reference databases and the variability in sequencing coverage depth, underscoring the need for statistically credible approaches. + +Metagenomic methods rely on existing genome references to detect and classify microbial organisms. However, these reference databases are often incomplete, and conventional metrics may not always align with traditional taxonomic frameworks that account for genomic changes. Consequently, microbes that carry mutations or have diverged evolutionarily can remain undetected, causing inaccuracies in microbial community profiling and misinterpretation of data [@kunin2008bioinformatician; @schlaberg2017validation; @loeffler2020improving; @marcelino2020ccmetagen]. Hence, analytical frameworks need to incorporate genome similarity metrics to capture the full breadth of microbial diversity and to provide accurate, interpretable microbiome dynamics. However, incomplete databases alone do not account for all metagenomic challenges; sequence coverage depth also contributes to the resolution and reliability of microbial detection and characterization. + +Sequence coverage depth, defined as the portion of a microbe’s genome detected in a sample, is crucial for detecting low-abundance microbes. However, sequencing processes often fail to achieve complete coverage of all genomes in a sample due to limited sequencing depth. As a result, rare or low-abundance taxa may exhibit low sequence coverage, leading to their misinterpretation as noise rather than genuine observations [@mande2012classification; @shakya2013comparative; @sczyrba2017critical; @meyer2022critical]. Furthermore, the lack of guidelines for establishing a biologically meaningful coverage depth threshold introduces subjectivity and inconsistency in the metagenomic analyses. Therefore, implementing dynamic coverage depth thresholds tailored to varying abundance levels is essential for delivering accurate metagenomic studies. Yet, even if we address coverage depth and incomplete genome reference problems, ensuring proper control over statistical errors remains another major challenge. + +Existing metagenomic methods lack the statistical rigor to control false positives and false negatives effectively. High false positive rates misrepresent microbial composition and lead to biased conclusions, undermining research reliability. Conversely, false negative rates cause researchers to overlook important taxa, especially those in low abundance that often carry significant biological importance [@jousset2017less]. Incomplete reference databases, sequencing errors, and evolutionary divergence between reference and sample genomes further complicate these challenges. Therefore, maintaining appropriate control over these statistical error rates is critical to ensure more confident, reliable biological inferences and minimize the risk of misinterpretation. While limitations in reference database, sequence coverage depth and balance of statistical error pose significant challenges, the complexity of metagenomic analysis demands a multifaceted approach to capture microbial profiling accurately. + +To address these challenges, YACHT offers a statistical framework that can accurately determine the presence or absence of microbial genome in a sample through hypothesis testing. The algorithm’s mathematical model accounts for evolutionary sequence divergence and incomplete sequencing depth by utilizing genome similarity and minimum sequencing depth parameters. It employs the FracMinHash sketching technique [@irber2020decentralizing; @Irber2022FracMinHash], an alignment-free $k$-mer approach, facilitating fast and accurate genome detection that can efficiently process large datasets. YACHT ensures precise detection of low abundance taxa with a user-defined false negative rate, minimizing the risk of misinterpretation of the result. Our approach can be used for other metagenomic applications such as functional profiling, metatranscriptomic studies [@marcelino2019metatranscriptomics], metabolic potential analyses [@ward2018metapoap; @pereira2024metatranscriptomics], and the characterization of low abundant clinical metagenomic samples such as skin [@godlewska2020metagenomic]. YACHT enhances metagenomic analysis by offering reduced reliance on arbitrary thresholds, improving the interpretability of the result without compromising biological relevance, and allowing researchers to differentiate between genuine artifacts from “noise” with statistical confidence. + +# Workflow + +The YACHT workflow involves four primary steps. First, `yacht sketch` samples compact representations of reference genomes using `sourmash`. Second, `yacht train` preprocesses the reference genomes, merging those with high average nucleotide identity (ANI) into a single representative. Third, `yacht run` executes the core YACHT algorithm to perform hypothesis testing and determine the presence or absence of organisms. Finally, `yacht convert` transforms the results into popular output formats like CAMI, BIOM, and GraphPhlAn. + +![The YACHT workflow illustrated with the four primary stages: sketching, training, running, and converting. \label{fig:workflow}](workflow.png) + +As outlined in the workflow in **Figure 1**, YACHT requires two primary inputs: a pre-trained reference configuration (JSON) and a sketched sample signature. See the [repository](https://github.com/KoslickiLab/YACHT/) for a detailed step-by-step workflow. + +### Output examples +The `yacht run` output provides probabilistic decisions on organism presence or absence, as shown in **Table 1** below. For each organism, columns like `num_matches` and `acceptance_threshold` are reported, indicating the number of $k$-mers found and the minimum required to be considered present, respectively. The `Presence` column then reports `TRUE` or `FALSE` based on this comparison. + +\begin{table}[ht] +\centering +\small % Reduce font size for this table +\setlength{\tabcolsep}{4pt} % Shrink column padding +\begin{tabular}{lccp{2.6cm}p{3.2cm}} +\toprule +\textbf{Organism} & \textbf{Presence} & \textbf{num\_matches} & \textbf{acceptance\_threshold} & \textbf{alt\_confidence\_mut\_rate} \\ +\midrule +Sediminispirochaeta & TRUE & 2572 & 895 & 0.053008659 \\ +Natronobacterium & TRUE & 700 & 638 & 0.053534755 \\ +Echinicola & FALSE & 244 & 978 & 0.052885411 \\ +\bottomrule +\end{tabular} +\caption{YACHT results for Sediminispirochaeta, Natronobacterium, and Echinicola are reported. For each species, the following are shown as a subset of the output: whether the organism passed the presence threshold (Presence), the number of exclusive $k$-mer matches (num\_matches), the expected minimum number of matches (acceptance\_threshold), and an alternative confidence estimate for the mutation rate (alt\_confidence\_mut\_rate) are shown. Note that Echinicola is not reported as present, while Sediminispirochaeta and Natronobacterium are present meeting the acceptance threshold. Results were generated using the MBARC-26 dataset (SRA: SRR6394747 by @Singer2016MockCommunity) with YACHT parameters: $k$-size of 31, minimum coverage of 0.05, and ANI threshold of 0.95. Please refer to Use Case Examples for more information.} +\end{table} + + + +| | +|--| +| | + +# Use case examples + +We present the three use case examples to demonstrate the application of YACHT for identifying taxonomy in microbiome studies: (i) analyzing low-abundance metagenomic samples that are common in clinical settings, (ii) performing MAG fishing to detect specific metagenomic-assembled genomes, and (iii) evaluating synthetic microbial communities to identify the presence of specific organisms. + +**Low abundance samples:** YACHT can analyze metagenomic samples with low microbial DNA concentrations, which are common in clinical and environmental studies. In this use case example, we adjust the ANI threshold and $k$-size to balance sensitivity and specificy, with higher values increasing stringency and refining species resolution. Using a human skin metagenomic sample, we show that these parameters markedly influence species reporting highlighting the need for careful threshold selection. For more information, refer to [Low abundance samples](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/low_abundance_samples). + +**Metagenomic-assembled genome (MAG) fishing:** YACHT can be employed to search for specific MAGs of interest within a sample by using a single MAG as the training reference database. Applying this approach to two skin metagenomic samples shows that detection strength varies with sequencing depths and coverage. This use case example illustrates how MAG fishing with YACHT is sensitive to coverage and parameter choice, emphasizing the importance of sequencing depth when assessing MAG presence. Find further detail in [MAG fishing](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/MAG_fishing). + +**Synthetic metagenomes:** YACHT can assess the construction of mock or synthetic microbial communities to verify that the designed microbes are present. Using a synthetic community from the literature, we show that ANI thresholds can influence accuracy where higher ANI thresholds recover most expected genomes, while lower ones can introduce false positives further highlighitng how parameter choice—particularly ANI and minimum coverage—affect sensitivity and specificity when validating synthetic community composition. For additional information, refer to [Synthetic metagenomes](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/synthetic_metagenome) + +# Acknowledgements +We thank the contributors and collaborators who supported the development of YACHT. This work was supported in part by the National Institutes of Health (NIH) under grant number 5R01GM146462-03. + +# References diff --git a/joss/workflow.png b/joss/workflow.png new file mode 100644 index 00000000..a7b195e4 Binary files /dev/null and b/joss/workflow.png differ