Support compressed (.gz, .bz2) fasta and fastq samples for classification #19

jtnystrom · 2025-07-29T04:13:17Z

Addresses issue #14.

This forgoes the fasta and fastq readers from Fastdoop (although the IndexedFastaReader is kept) and instead uses Spark's text file reader together with window functions. This simplifies the code base and enables us to use Spark's builtin gz and bz2 support.
Compression is detected by file name suffix, supporting patterns like *.(.fq|.fastq|.fa|.fasta)(.gz|.bz2|), e.g: sample.fasta.bz2.

Compressed file reading has a slight slowdown (especially for .gz which isn't splittable, unlike .bz2), but this is a small cost in the bigger picture of sample classification, where traversing the index is the most expensive step.

This also generates compressed .fa and .fq file in InputReaderProps to test these file format readers.

… function (to group line) and text reader (to handle compression transparently). First draft.

…tqReader classes.

Bugfix for lineSeparators in InputReaderProps (windows newline sequence is \r\n, not \n\r)

…pose of reading fasta and fastq files, with the exception of indexed long (fna, fna.fai) sequences. Rename FastdoopInputs to FileInputs.

… places

…in quality scores

…2) for testing

jtnystrom · 2025-08-15T05:19:44Z

One potential problem is that this (Spark's compression detection mechanism) is sensitive to file suffixes (.gz, .bz2 etc) which are also case sensitive (must be lowercase). We may need a way to override them in the future.

jtnystrom added 13 commits June 19, 2025 20:26

Support .gz and .bz2 compression for fastq reads using Spark's window…

e56d177

… function (to group line) and text reader (to handle compression transparently). First draft.

Streamline FastdoopInputs architecture. Remove FragmentParser and Fas…

3e75dc1

…tqReader classes.

FastaTextInput (work in progress) to support compressed fasta files

809736a

Bugfixes for window logic in FastqTextInput.

77b2a0d

Bugfix for lineSeparators in InputReaderProps (windows newline sequence is \r\n, not \n\r)

Remove most of Fastdoop since spark's text reader now fulfils the pur…

2df9b7a

…pose of reading fasta and fastq files, with the exception of indexed long (fna, fna.fai) sequences. Rename FastdoopInputs to FileInputs.

Remove unused parameter maxSequenceLength from FileInputs and related…

357a0f5

… places

Remove the unused parameter k from some InputReaders

272e8ba

Make fastq reader more robust, in the presence of @ and + characters …

848463c

…in quality scores

Clean up comments

a801ba8

Merge branch 'sbi2' into jtnystrom/compressed_input

d7ce698

InputReaderProps: generate compressed files (uncompressed, gzip, bzip…

88a50c6

…2) for testing

Merge branch 'sbi2' into jtnystrom/compressed_input

d97d83c

Merge branch 'sbi2' into jtnystrom/compressed_input

0d08f72

jtnystrom merged commit 0abfd9c into master Aug 15, 2025
6 checks passed

jtnystrom deleted the jtnystrom/compressed_input branch August 15, 2025 05:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support compressed (.gz, .bz2) fasta and fastq samples for classification #19

Support compressed (.gz, .bz2) fasta and fastq samples for classification #19

Uh oh!

jtnystrom commented Jul 29, 2025 •

edited

Loading

Uh oh!

jtnystrom commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support compressed (.gz, .bz2) fasta and fastq samples for classification #19

Support compressed (.gz, .bz2) fasta and fastq samples for classification #19

Uh oh!

Conversation

jtnystrom commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtnystrom commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jtnystrom commented Jul 29, 2025 •

edited

Loading