Spliced Transcripts Alignment to a Reference
© Alexander Dobin, 2009-2024
https://www.ncbi.nlm.nih.gov/pubmed/23104886
This is a fork of the upstream STAR repository with extended STARsolo functionality for unsorted BAM workflows, tag table export, and enhanced gene annotation.
Alex Dobin, [email protected]
https://github.com/alexdobin/STAR/issues
https://groups.google.com/d/forum/rna-star
This fork adds three major enhancements to STAR 2.7.11b:
Enables corrected cell barcode (CB) and UMI (UB) tags in unsorted BAM output. When combined with --outSAMtype BAM Unsorted, STAR now captures alignments during pass 1, then replays them in pass 2 after Solo correction to inject accurate barcode tags.
Quick Usage:
STAR --outSAMtype BAM Unsorted \
--soloAddTagsToUnsorted yes \
--soloType CB_UMI_Simple \
--soloFeatures Gene GeneFull \
[other parameters...]Exports corrected CB/UB assignments to a compact binary sidecar file without rewriting the entire BAM. Useful for audit trails, secondary analysis, or lightweight barcode extraction.
Quick Usage:
STAR --soloWriteTagTable Default \
--soloType CB_UMI_Simple \
--soloFeatures Gene \
[other parameters...]Two new BAM tags that provide comprehensive gene annotation beyond standard GX/GN tags:
- ZG: Comma-separated list of Ensembl gene IDs for all overlapping genes
- ZX: Genomic overlap classification (
exonic,intronic,none,spanning)
Quick Usage:
STAR --outSAMattributes NH HI AS nM NM CR CY UR UY GX GN gx gn ZG ZX \
--soloFeatures Gene GeneFull \
--soloStrand Unstranded \
[other parameters...]Key Benefits:
- 28% more gene annotations than standard GX tags
- 99.8% read coverage vs 79.2% with GX tags
- Perfect 100% concordance with existing GX tags
- Production validated on large-scale datasets
Allows running the full mapping pipeline while skipping Solo counting/matrix generation. Per-read artifacts (e.g., CB/UB tag injection into unsorted BAM and optional binary tag table) still finalize, while Solo.out/GeneFull remains an empty directory skeleton.
Quick Usage:
STAR --runThreadN 24 \
--genomeDir /path/to/genome \
--readFilesIn R2.fastq.gz R1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--outSAMattributes NH HI AS nM NM CR CY UR UY CB UB ZG ZX \
--soloType CB_UMI_Simple \
--soloCBwhitelist /path/to/whitelist.txt \
--soloFeatures Gene GeneFull \
--soloAddTagsToUnsorted yes \
--soloWriteTagTable Default \
--soloSkipProcessing yes \
--outFileNamePrefix output/Notes:
- Default is
--soloSkipProcessing no - Expect identical
Aligned.out.bamandAligned.out.cb_ub.binvs a full Solo run, withSolo.out/GeneFullempty in skip mode
This fork includes opt-in instrumentation and helper scripts to validate STAR’s key stream and replay parity against downstream tools.
-
STAR stash logging (env-gated; no overhead unless enabled):
STAR_STASH_DEBUG=/path/to/star_stash.tsvwrites one TSV row per alignment staged by Solo (phase, qname, CB/UB, sample index/tag, ZG, MAPQ/NH/NM).STAR_SAMPLE_DEBUG=/path/to/star_sample_debug.tsvcaptures per-emitted key details (CB, sample label/index, dense ID, UMI, gene) atflushGroup().- See tools/flex_debug/README_stash_debug.md for headers and usage.
-
End-to-end comparison wrapper:
tools/flex_debug/scripts/compare_star_bam.shcan regenerate STAR+bam_to_counts stash TSVs, replay STAR’skeys.binwithcr_key_replayer, and diff MEX outputs.- Useful flags:
--run,--force-align,--force-keys,--replay,--force-replay. - To compare replayed MEX vs bam_to_counts while ignoring raw CB16-only columns, add
--filter-b2c-mex(keeps 24bp CB+TAG8 only). - Internally uses tools/flex_debug/scripts/filter_mex_columns.py to rewrite a filtered MatrixMarket triplet.
-
Quantitative MEX comparison:
tools/flex_debug/flex/src/python/common/compare_counts.pyaligns barcodes/genes and reports gene/cell correlations and MARE. Supports optional probe-set filtering.
Validates that STAR’s embedded inline replayer (enabled via --soloSkipProcessing yes --soloWriteKeysBin yes --soloUMICorrection clique --soloUseInlineReplayer auto) matches the Stage 1 baseline MEX. Everything lives in new/tests/test_inline_replayer_parity.sh, new/tests/run_star_inline_replayer.sh, and new/tests/generate_baseline.sh.
Quick repo-local sanity run (uses the synthetic fixture under new/tests/fixtures/):
cd /mnt/pikachu/STAR
./new/tests/test_inline_replayer_parity.sh \
--skip-align \
--compare-script "" \
--star-binary "$(command -v true)"--compare-script ""skips the optional pandas/scipy dependency wall while still checking byte-for-byte equality.--star-binary "$(command -v true)"is a placeholder so you can exercise the script without compiling STAR; point it atsource/STARonce built.- Latest fixture run (Nov 2025) with the command above: PASS. Inline
_matrix/_barcodes/_featuresmatched the baseline exactly, outputs undernew/tests/fixtures/storage/100K/SC2300771/results/inline_replayer_parity*.
STAR now includes an inline EmptyDrops pipeline that runs OrdMag filtering and EmptyDrops multinomial test internally, eliminating the need for external Python/R scripts. The pipeline includes tag dominance filtering to handle multi-tag CB16 conflicts.
Quick Usage:
STAR --runMode soloCellFiltering \
<matrix_dir> <output_dir> \
--soloCellFilter EmptyDrops_CR 12000 0.99 10 500 0.01 20000 45000 90000 0.001 2000 \
--soloFeatures Gene \
--soloType CB_UMI_Simple \
--soloCBwhitelist <whitelist> \
--soloUMIlen 12 --soloCBlen 16 \
--soloCBposition 0_0_0_16 --soloUMIposition 0_0_16_12 \
--soloOrdMagMode auto \
--soloDumpOrdmag yes \
--flexEmptyDrops yes \
--soloDominanceRatio 10.0 \
--soloRawPvalueThreshold 0.999Complete Example:
# Run inline EmptyDrops with dominance filtering on SC2300771 dataset
STAR --runMode soloCellFiltering \
/mnt/pikachu/STAR/tests/100K/SC2300771/results/replay/matrix \
/tmp/star_inline_test \
--soloCellFilter EmptyDrops_CR 12000 0.99 10 500 0.01 20000 45000 90000 0.001 2000 \
--soloFeatures Gene \
--soloType CB_UMI_Simple \
--soloCBwhitelist /mnt/pikachu/storage/scRNAseq_output/whitelists/737K-fixed-rna-profiling.txt \
--soloUMIlen 12 --soloCBlen 16 \
--soloCBposition 0_0_0_16 --soloUMIposition 0_0_16_12 \
--soloOrdMagMode auto \
--soloDumpOrdmag yes \
--flexEmptyDrops yes \
--soloDominanceRatio 10.0 \
--soloRawPvalueThreshold 0.999
# Check outputs
ls -lh /tmp/star_inline_test/Solo.out/Gene/filtered/OrdMag/
ls -lh /tmp/star_inline_test/Solo.out/Gene/filtered/InlineDrops/
# View dominance statistics in log
grep "Dominance check" /tmp/star_inline_test/Log.out
# Compare final passing barcodes
wc -l /tmp/star_inline_test/Solo.out/Gene/filtered/InlineDrops/passing_barcodes.txtKey Flags:
--flexEmptyDrops yes: REQUIRED - Enables Flex mode and dominance filtering--soloDominanceRatio 10.0: Enable dominance filtering (defaults to 10.0 when flexEmptyDrops enabled)--soloDominanceRatio 0: Disable dominance filtering--soloOrdMagMode auto: Enable OrdMag filtering--soloDumpOrdmag yes: Write OrdMag outputs--soloRawPvalueThreshold 0.999: EmptyDrops p-value threshold
Pipeline Flow:
- OrdMag filtering (per-tag) → Simple cells + candidates
- Dominance check (per-tag) → Filters ambiguous CB16s (multi-tag conflicts)
- EmptyDropsMultinomial (per-tag) → Multinomial test with p-value threshold
Outputs:
Solo.out/Gene/filtered/OrdMag/: OrdMag filtering outputsSolo.out/Gene/filtered/InlineDrops/: EmptyDrops outputs (passing_barcodes.txt,pvalues.csv)
Test Results: Validated against Python/R reference with 99.97% overlap (12,412 / 12,416 barcodes). See new/docs/dominance_filtering_test_results.md for details.
The STAR fork is typically paired with ordmag-style preprocessing when handling 10x Flex datasets. That pipeline does not invoke Cell Ranger's JIBES EM model; instead it runs a lightweight Poisson–multinomial dominance test independently for each sample/TAG:
- Reads are partitioned by canonical TAG8. For each sample branch we tally per-CB16 UMI counts split into “on-tag” (matches the branch TAG) and “off-tag” observations.
- Routing into each branch is gated by a simple dominance test: for sample
S, its candidate UMI count must exceed the weakest TAG still under consideration by at least 10× (--cr-max-min-ratiodefault = 10). Only then do we assign the read toS; otherwise it remains unassigned. - Given the branch’s expected contamination rate and recovered-cell target, we evaluate the Poisson–multinomial tail probability that the observed off-tag mass is due to cross-talk. If the on-tag dominance exceeds the fixed threshold the CB16 is marked as a singlet for that sample; otherwise it is labeled multiplet/ambiguous and dropped for that branch only.
- Because branches are independent, a CB16 can legitimately pass multiple samples when it shows strong on-tag signal for more than one TAG8. Only after this per-sample gating do we run per-sample EmptyDrops and merge the resulting CB/TAG composites into the final allow-list consumed by
bam_to_counts.
There is no EM loop, tie ratio heuristic, or additional guard rail—just the closed-form Poisson–multinomial test implemented in tools/flex_debug/scripts/run_ordmag_nonambient.py. Downstream, bam_to_counts now accepts that composite allow-list (24 bp CB+TAG8 entries) and re-materializes barcodes one-to-one with the filtered CB/TAG pairs, running optional clique-based UR correction before writing a new MEX.
Why duplicates remain: Because the ratio gate and Poisson test are applied per branch, it is perfectly legal (and expected) for a CB16 to appear in both BC004 and BC006 if each branch sees a strong on-tag majority in its own routed reads. The filters are intentionally tolerant of this scenario so we do not throw away valid multiplet-free data; deduplication happens later when we aggregate CB/TAG composites during UMI correction.
Testing workflow (typical):
- Run and replay
STAR_STASH_DEBUG=/tmp/star.tsv STAR_SAMPLE_DEBUG=/tmp/sample.tsv \ tools/flex_debug/scripts/compare_star_bam.sh --run --force-align --force-keys --replay --force-replay --filter-b2c-mex - Inspect MEX diffs and metrics
barcodes.tsv/features.tsvshould match;matrix.mtxmay differ only by triplet ordering (content-equal).- For metrics:
python3 tools/flex_debug/flex/src/python/common/compare_counts.py --cr-gene /storage/.../bam_to_counts_out --fm-gene /storage/.../alignment/replay_mex --outprefix /tmp/replay_vs_b2c.
Detailed notes and findings:
- new/docs/star_bam_stash_comparison.md — STAR vs bam_to_counts stash findings and index harmonization.
- tools/flex_debug/README_stash_debug.md — stash/sample debug TSVs and flags.
- new/docs/debug_instrumentation.md — broader debug instrumentation overview.
Located in new/docs/:
- TECHNICAL_NOTES.md - Complete technical implementation details for all fork features
- ZG_ZX_Implementation_Summary.md - ZG/ZX tag technical documentation with code references
- two_pass_unsorted_usage.md - Usage guide for unsorted BAM workflows
- CHANGES_FORK.md - Detailed changelog for fork modifications
- RELEASEnotes.md - Fork release notes and validation results
- memory_testing_guide.md - AddressSanitizer testing procedures
- debug_instrumentation.md - Debug logging and validation
- STARmanual.pdf - Original STAR manual
- CHANGES.md - Upstream STAR changes and release history
Located in new/scripts/:
runSTAR.sh- Production script with comprehensive ZG/ZX configurationrunSTAR_debug.sh- Debug-enabled run with monitoringvalidate_zg_zx.py- Validation script for ZG/ZX tag correctness
Located in new/tests/:
emit_test.sh- Binary tag stream and unsorted BAM validationintegration_test.sh- End-to-end CB/UB patching testsmem_test_tags.sh- AddressSanitizer memory safety testsbuild_probe_txome_fixture.sh- Build a tiny probe-only transcriptome for testsprobe_lines_test.sh- Requires a BAM path argument; analyzes probe-specific lines
To enable transcriptome-dependent tests, build a small genomeDir and set STAR_TXOME_DIR:
./new/tests/build_probe_txome_fixture.sh \
--fasta /path/to/probes_only.fa \
--gtf /path/to/probes_only.gtf \
--out /tmp/probe_txome --n 2
export STAR_TXOME_DIR=/tmp/probe_txome
unset STAR_TEST_SKIP_TXOME
make -C source testTo skip them in CI:
export STAR_TEST_SKIP_TXOME=1
make -C source test- x86-64 compatible processors
- 64 bit Linux or Mac OS X
- At least 16GB RAM for mammalian genomes (32GB recommended)
/
├── source/ # All source files required for compilation
├── bin/ # Pre-compiled executables for Linux and Mac OS X
├── doc/ # Upstream STAR documentation
├── extras/ # Miscellaneous files and scripts
├── tools/ # Binary tag decoder and utilities
└── new/ # Fork-only material
├── docs/ # Fork documentation (see TECHNICAL_NOTES.md)
├── scripts/ # Production and validation scripts
├── tests/ # Test harnesses
├── plans/ # Development planning documents
└── testing/ # Test data and outputs
cd source
make STARFor processors without AVX support:
make STAR CXXFLAGS_SIMD=sse# Install brew and gcc
brew install gcc
# Build STAR (adjust gcc version path as needed)
cd source
make STARforMacStatic CXX=/usr/local/Cellar/gcc/8.2.0/bin/g++-8
# Install to system path
cp STAR /usr/local/binIf g++ is not on the path:
cd source
make STAR CXX=/path/to/g++# Platform-specific optimization
make CXXFLAGSextra=-march=native
# With link-time optimization
make LDFLAGSextra=-flto CXXFLAGSextra="-flto -march=native"cd source
ASAN=1 make clean
ASAN=1 make STAR
# Run memory tests
cd ..
ASAN_OPTIONS="detect_leaks=1" ./new/tests/mem_test_tags.shNote: ASan builds require ~3x more RAM and run 2-5x slower than production builds.
STAR --runThreadN 24 \
--genomeDir /path/to/genome \
--readFilesIn R2.fastq.gz R1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--outSAMattributes NH HI AS nM NM CR CY UR UY \
--soloType CB_UMI_Simple \
--soloCBwhitelist /path/to/whitelist.txt \
--soloFeatures Gene GeneFull \
--soloAddTagsToUnsorted yes \
--outFileNamePrefix output/STAR --runThreadN 24 \
--genomeDir /path/to/genome \
--readFilesIn R2.fastq.gz R1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--soloType CB_UMI_Simple \
--soloCBwhitelist /path/to/whitelist.txt \
--soloFeatures Gene \
--soloWriteTagTable Default \
--outFileNamePrefix output/STAR --runThreadN 24 \
--genomeDir /path/to/genome \
--readFilesIn R2.fastq.gz R1.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM Unsorted \
--outSAMattributes NH HI AS nM NM CR CY UR UY GX GN gx gn ZG ZX \
--soloType CB_UMI_Simple \
--soloCBwhitelist /path/to/whitelist.txt \
--soloFeatures Gene GeneFull \
--soloStrand Unstranded \
--soloAddTagsToUnsorted yes \
--soloWriteTagTable Default \
--quantMode GeneCounts \
--outFileNamePrefix output/# Use production debug script
./new/scripts/runSTAR_debug.sh
# Or enable debug logging manually
STAR_DEBUG_TAG=1 STAR [parameters...]# View first 10 reads with ZG/ZX tags
samtools view output/Aligned.out.bam | grep -E "ZG:Z:|ZX:Z:" | head -10
# Extract ZG/ZX tags for analysis
samtools view output/Aligned.out.bam | awk '{
for(i=1;i<=NF;i++)
if($i~/^ZG:Z:/ || $i~/^ZX:Z:/)
print $1"\t"$i
}' > zg_zx_tags.txt
# Validate ZG/ZX tags
python3 new/scripts/validate_zg_zx.py output/Aligned.out.bam allowed_genes.txt# Compile decoder
cd tools
make
# Decode tag table
./decode_tag_binary ../output/Aligned.out.cb_ub.bin > decoded_tags.txtThis release has been tested with default parameters for human and mouse genomes. Mammalian genomes require at least 16GB of RAM, ideally 32GB.
Fork-Specific Notes:
- Two-pass unsorted BAM requires sufficient disk space for temporary files
- Tag table binary format is specific to this fork; use provided decoder tool
- ZG/ZX tags are BAM-only (not emitted in SAM format)
- GeneFull must be in
--soloFeaturesfor ZG/ZX tags to populate
STAR can be installed on FreeBSD via the FreeBSD ports system:
pkg install starNote: FreeBSD ports may not include fork-specific features. Compile from source for full functionality.
Docker build scripts are available in new/scripts/:
./new/scripts/docker_build.sh
./new/scripts/runSTAR_docker.shFor issues or contributions related to:
- Upstream STAR: Use https://github.com/alexdobin/STAR/issues
- Fork-specific features: Contact the fork maintainer or create issues in the fork repository
See LICENSE file for details.
The development of STAR is supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG009318. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.