Skip to content

Conversation

@jtnystrom
Copy link
Contributor

@jtnystrom jtnystrom commented Jul 29, 2025

Addresses issue #14.

This forgoes the fasta and fastq readers from Fastdoop (although the IndexedFastaReader is kept) and instead uses Spark's text file reader together with window functions. This simplifies the code base and enables us to use Spark's builtin gz and bz2 support.
Compression is detected by file name suffix, supporting patterns like *.(.fq|.fastq|.fa|.fasta)(.gz|.bz2|), e.g: sample.fasta.bz2.

Compressed file reading has a slight slowdown (especially for .gz which isn't splittable, unlike .bz2), but this is a small cost in the bigger picture of sample classification, where traversing the index is the most expensive step.

This also generates compressed .fa and .fq file in InputReaderProps to test these file format readers.

@jtnystrom
Copy link
Contributor Author

One potential problem is that this (Spark's compression detection mechanism) is sensitive to file suffixes (.gz, .bz2 etc) which are also case sensitive (must be lowercase). We may need a way to override them in the future.

@jtnystrom jtnystrom merged commit 0abfd9c into master Aug 15, 2025
6 checks passed
@jtnystrom jtnystrom deleted the jtnystrom/compressed_input branch August 15, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants