STAR 2.7.11b (Fork)

Spliced Transcripts Alignment to a Reference
© Alexander Dobin, 2009-2024
https://www.ncbi.nlm.nih.gov/pubmed/23104886

This is a fork of the upstream STAR repository with extended STARsolo functionality for unsorted BAM workflows, tag table export, and enhanced gene annotation.

UPSTREAM AUTHOR/SUPPORT

Alex Dobin, [email protected]
https://github.com/alexdobin/STAR/issues
https://groups.google.com/d/forum/rna-star

FORK ENHANCEMENTS

This fork adds three major enhancements to STAR 2.7.11b:

1. Two-Pass Unsorted BAM with CB/UB Tags (`--soloAddTagsToUnsorted`)

Enables corrected cell barcode (CB) and UMI (UB) tags in unsorted BAM output. When combined with --outSAMtype BAM Unsorted, STAR now captures alignments during pass 1, then replays them in pass 2 after Solo correction to inject accurate barcode tags.

Quick Usage:

STAR --outSAMtype BAM Unsorted \
     --soloAddTagsToUnsorted yes \
     --soloType CB_UMI_Simple \
     --soloFeatures Gene GeneFull \
     [other parameters...]

2. Binary Tag Table Export (`--soloWriteTagTable`)

Exports corrected CB/UB assignments to a compact binary sidecar file without rewriting the entire BAM. Useful for audit trails, secondary analysis, or lightweight barcode extraction.

Quick Usage:

STAR --soloWriteTagTable Default \
     --soloType CB_UMI_Simple \
     --soloFeatures Gene \
     [other parameters...]

3. Custom Gene Annotation Tags (ZG/ZX)

Two new BAM tags that provide comprehensive gene annotation beyond standard GX/GN tags:

ZG: Comma-separated list of Ensembl gene IDs for all overlapping genes
ZX: Genomic overlap classification (exonic, intronic, none, spanning)

Quick Usage:

STAR --outSAMattributes NH HI AS nM NM CR CY UR UY GX GN gx gn ZG ZX \
     --soloFeatures Gene GeneFull \
     --soloStrand Unstranded \
     [other parameters...]

Key Benefits:

28% more gene annotations than standard GX tags
99.8% read coverage vs 79.2% with GX tags
Perfect 100% concordance with existing GX tags
Production validated on large-scale datasets

4. Solo Skip Processing (`--soloSkipProcessing`)

Allows running the full mapping pipeline while skipping Solo counting/matrix generation. Per-read artifacts (e.g., CB/UB tag injection into unsorted BAM and optional binary tag table) still finalize, while Solo.out/GeneFull remains an empty directory skeleton.

Quick Usage:

STAR --runThreadN 24 \
     --genomeDir /path/to/genome \
     --readFilesIn R2.fastq.gz R1.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM Unsorted \
     --outSAMattributes NH HI AS nM NM CR CY UR UY CB UB ZG ZX \
     --soloType CB_UMI_Simple \
     --soloCBwhitelist /path/to/whitelist.txt \
     --soloFeatures Gene GeneFull \
     --soloAddTagsToUnsorted yes \
     --soloWriteTagTable Default \
     --soloSkipProcessing yes \
     --outFileNamePrefix output/

Notes:

Default is --soloSkipProcessing no
Expect identical Aligned.out.bam and Aligned.out.cb_ub.bin vs a full Solo run, with Solo.out/GeneFull empty in skip mode

DOCUMENTATION

Instrumentation + Testing (keys.bin, stash, replay)

This fork includes opt-in instrumentation and helper scripts to validate STAR’s key stream and replay parity against downstream tools.

STAR stash logging (env-gated; no overhead unless enabled):
- STAR_STASH_DEBUG=/path/to/star_stash.tsv writes one TSV row per alignment staged by Solo (phase, qname, CB/UB, sample index/tag, ZG, MAPQ/NH/NM).
- STAR_SAMPLE_DEBUG=/path/to/star_sample_debug.tsv captures per-emitted key details (CB, sample label/index, dense ID, UMI, gene) at flushGroup().
- See tools/flex_debug/README_stash_debug.md for headers and usage.
End-to-end comparison wrapper:
- tools/flex_debug/scripts/compare_star_bam.sh can regenerate STAR+bam_to_counts stash TSVs, replay STAR’s keys.bin with cr_key_replayer, and diff MEX outputs.
- Useful flags: --run, --force-align, --force-keys, --replay, --force-replay.
- To compare replayed MEX vs bam_to_counts while ignoring raw CB16-only columns, add --filter-b2c-mex (keeps 24bp CB+TAG8 only).
- Internally uses tools/flex_debug/scripts/filter_mex_columns.py to rewrite a filtered MatrixMarket triplet.
Quantitative MEX comparison:
- tools/flex_debug/flex/src/python/common/compare_counts.py aligns barcodes/genes and reports gene/cell correlations and MARE. Supports optional probe-set filtering.

Inline Replayer Parity Test (Inline Solo Mode)

Validates that STAR’s embedded inline replayer (enabled via --soloSkipProcessing yes --soloWriteKeysBin yes --soloUMICorrection clique --soloUseInlineReplayer auto) matches the Stage 1 baseline MEX. Everything lives in new/tests/test_inline_replayer_parity.sh, new/tests/run_star_inline_replayer.sh, and new/tests/generate_baseline.sh.

Quick repo-local sanity run (uses the synthetic fixture under new/tests/fixtures/):

cd /mnt/pikachu/STAR
./new/tests/test_inline_replayer_parity.sh \
  --skip-align \
  --compare-script "" \
  --star-binary "$(command -v true)"

--compare-script "" skips the optional pandas/scipy dependency wall while still checking byte-for-byte equality.
--star-binary "$(command -v true)" is a placeholder so you can exercise the script without compiling STAR; point it at source/STAR once built.
Latest fixture run (Nov 2025) with the command above: PASS. Inline _matrix/_barcodes/_features matched the baseline exactly, outputs under new/tests/fixtures/storage/100K/SC2300771/results/inline_replayer_parity*.

Inline EmptyDrops with Dominance Filtering

STAR now includes an inline EmptyDrops pipeline that runs OrdMag filtering and EmptyDrops multinomial test internally, eliminating the need for external Python/R scripts. The pipeline includes tag dominance filtering to handle multi-tag CB16 conflicts.

Quick Usage:

STAR --runMode soloCellFiltering \
    <matrix_dir> <output_dir> \
    --soloCellFilter EmptyDrops_CR 12000 0.99 10 500 0.01 20000 45000 90000 0.001 2000 \
    --soloFeatures Gene \
    --soloType CB_UMI_Simple \
    --soloCBwhitelist <whitelist> \
    --soloUMIlen 12 --soloCBlen 16 \
    --soloCBposition 0_0_0_16 --soloUMIposition 0_0_16_12 \
    --soloOrdMagMode auto \
    --soloDumpOrdmag yes \
    --flexEmptyDrops yes \
    --soloDominanceRatio 10.0 \
    --soloRawPvalueThreshold 0.999

Complete Example:

# Run inline EmptyDrops with dominance filtering on SC2300771 dataset
STAR --runMode soloCellFiltering \
    /mnt/pikachu/STAR/tests/100K/SC2300771/results/replay/matrix \
    /tmp/star_inline_test \
    --soloCellFilter EmptyDrops_CR 12000 0.99 10 500 0.01 20000 45000 90000 0.001 2000 \
    --soloFeatures Gene \
    --soloType CB_UMI_Simple \
    --soloCBwhitelist /mnt/pikachu/storage/scRNAseq_output/whitelists/737K-fixed-rna-profiling.txt \
    --soloUMIlen 12 --soloCBlen 16 \
    --soloCBposition 0_0_0_16 --soloUMIposition 0_0_16_12 \
    --soloOrdMagMode auto \
    --soloDumpOrdmag yes \
    --flexEmptyDrops yes \
    --soloDominanceRatio 10.0 \
    --soloRawPvalueThreshold 0.999

# Check outputs
ls -lh /tmp/star_inline_test/Solo.out/Gene/filtered/OrdMag/
ls -lh /tmp/star_inline_test/Solo.out/Gene/filtered/InlineDrops/

# View dominance statistics in log
grep "Dominance check" /tmp/star_inline_test/Log.out

# Compare final passing barcodes
wc -l /tmp/star_inline_test/Solo.out/Gene/filtered/InlineDrops/passing_barcodes.txt

Key Flags:

--flexEmptyDrops yes: REQUIRED - Enables Flex mode and dominance filtering
--soloDominanceRatio 10.0: Enable dominance filtering (defaults to 10.0 when flexEmptyDrops enabled)
--soloDominanceRatio 0: Disable dominance filtering
--soloOrdMagMode auto: Enable OrdMag filtering
--soloDumpOrdmag yes: Write OrdMag outputs
--soloRawPvalueThreshold 0.999: EmptyDrops p-value threshold

Pipeline Flow:

OrdMag filtering (per-tag) → Simple cells + candidates
Dominance check (per-tag) → Filters ambiguous CB16s (multi-tag conflicts)
EmptyDropsMultinomial (per-tag) → Multinomial test with p-value threshold

Outputs:

Solo.out/Gene/filtered/OrdMag/: OrdMag filtering outputs
Solo.out/Gene/filtered/InlineDrops/: EmptyDrops outputs (passing_barcodes.txt, pvalues.csv)

Test Results: Validated against Python/R reference with 99.97% overlap (12,412 / 12,416 barcodes). See new/docs/dominance_filtering_test_results.md for details.

Flex Sample Filtering (Ordmag Pipeline Overview)

The STAR fork is typically paired with ordmag-style preprocessing when handling 10x Flex datasets. That pipeline does not invoke Cell Ranger's JIBES EM model; instead it runs a lightweight Poisson–multinomial dominance test independently for each sample/TAG:

Reads are partitioned by canonical TAG8. For each sample branch we tally per-CB16 UMI counts split into “on-tag” (matches the branch TAG) and “off-tag” observations.
Routing into each branch is gated by a simple dominance test: for sample S, its candidate UMI count must exceed the weakest TAG still under consideration by at least 10× (--cr-max-min-ratio default = 10). Only then do we assign the read to S; otherwise it remains unassigned.
Given the branch’s expected contamination rate and recovered-cell target, we evaluate the Poisson–multinomial tail probability that the observed off-tag mass is due to cross-talk. If the on-tag dominance exceeds the fixed threshold the CB16 is marked as a singlet for that sample; otherwise it is labeled multiplet/ambiguous and dropped for that branch only.
Because branches are independent, a CB16 can legitimately pass multiple samples when it shows strong on-tag signal for more than one TAG8. Only after this per-sample gating do we run per-sample EmptyDrops and merge the resulting CB/TAG composites into the final allow-list consumed by bam_to_counts.

There is no EM loop, tie ratio heuristic, or additional guard rail—just the closed-form Poisson–multinomial test implemented in tools/flex_debug/scripts/run_ordmag_nonambient.py. Downstream, bam_to_counts now accepts that composite allow-list (24 bp CB+TAG8 entries) and re-materializes barcodes one-to-one with the filtered CB/TAG pairs, running optional clique-based UR correction before writing a new MEX.

Why duplicates remain: Because the ratio gate and Poisson test are applied per branch, it is perfectly legal (and expected) for a CB16 to appear in both BC004 and BC006 if each branch sees a strong on-tag majority in its own routed reads. The filters are intentionally tolerant of this scenario so we do not throw away valid multiplet-free data; deduplication happens later when we aggregate CB/TAG composites during UMI correction.

Testing workflow (typical):

Run and replay STAR_STASH_DEBUG=/tmp/star.tsv STAR_SAMPLE_DEBUG=/tmp/sample.tsv \ tools/flex_debug/scripts/compare_star_bam.sh --run --force-align --force-keys --replay --force-replay --filter-b2c-mex
Inspect MEX diffs and metrics
- barcodes.tsv/features.tsv should match; matrix.mtx may differ only by triplet ordering (content-equal).
- For metrics: python3 tools/flex_debug/flex/src/python/common/compare_counts.py --cr-gene /storage/.../bam_to_counts_out --fm-gene /storage/.../alignment/replay_mex --outprefix /tmp/replay_vs_b2c.

Detailed notes and findings:

new/docs/star_bam_stash_comparison.md — STAR vs bam_to_counts stash findings and index harmonization.
tools/flex_debug/README_stash_debug.md — stash/sample debug TSVs and flags.
new/docs/debug_instrumentation.md — broader debug instrumentation overview.

Fork-Specific Documentation

Located in new/docs/:

TECHNICAL_NOTES.md - Complete technical implementation details for all fork features
ZG_ZX_Implementation_Summary.md - ZG/ZX tag technical documentation with code references
two_pass_unsorted_usage.md - Usage guide for unsorted BAM workflows
CHANGES_FORK.md - Detailed changelog for fork modifications
RELEASEnotes.md - Fork release notes and validation results
memory_testing_guide.md - AddressSanitizer testing procedures
debug_instrumentation.md - Debug logging and validation

Upstream STAR Documentation

STARmanual.pdf - Original STAR manual
CHANGES.md - Upstream STAR changes and release history

Production Scripts

Located in new/scripts/:

runSTAR.sh - Production script with comprehensive ZG/ZX configuration
runSTAR_debug.sh - Debug-enabled run with monitoring
validate_zg_zx.py - Validation script for ZG/ZX tag correctness

Test Suites

Located in new/tests/:

emit_test.sh - Binary tag stream and unsorted BAM validation
integration_test.sh - End-to-end CB/UB patching tests
mem_test_tags.sh - AddressSanitizer memory safety tests
build_probe_txome_fixture.sh - Build a tiny probe-only transcriptome for tests
probe_lines_test.sh - Requires a BAM path argument; analyzes probe-specific lines

To enable transcriptome-dependent tests, build a small genomeDir and set STAR_TXOME_DIR:

./new/tests/build_probe_txome_fixture.sh \
  --fasta /path/to/probes_only.fa \
  --gtf /path/to/probes_only.gtf \
  --out /tmp/probe_txome --n 2
export STAR_TXOME_DIR=/tmp/probe_txome
unset STAR_TEST_SKIP_TXOME
make -C source test

To skip them in CI:

export STAR_TEST_SKIP_TXOME=1
make -C source test

HARDWARE/SOFTWARE REQUIREMENTS

x86-64 compatible processors
64 bit Linux or Mac OS X
At least 16GB RAM for mammalian genomes (32GB recommended)

DIRECTORY CONTENTS

/
├── source/          # All source files required for compilation
├── bin/             # Pre-compiled executables for Linux and Mac OS X
├── doc/             # Upstream STAR documentation
├── extras/          # Miscellaneous files and scripts
├── tools/           # Binary tag decoder and utilities
└── new/             # Fork-only material
    ├── docs/        # Fork documentation (see TECHNICAL_NOTES.md)
    ├── scripts/     # Production and validation scripts
    ├── tests/       # Test harnesses
    ├── plans/       # Development planning documents
    └── testing/     # Test data and outputs

COMPILING FROM SOURCE

Standard Compilation (Linux)

cd source
make STAR

For processors without AVX support:

make STAR CXXFLAGS_SIMD=sse

Mac OS X Compilation

# Install brew and gcc
brew install gcc

# Build STAR (adjust gcc version path as needed)
cd source
make STARforMacStatic CXX=/usr/local/Cellar/gcc/8.2.0/bin/g++-8

# Install to system path
cp STAR /usr/local/bin

Non-Standard GCC

If g++ is not on the path:

cd source
make STAR CXX=/path/to/g++

Platform-Specific Optimization

# Platform-specific optimization
make CXXFLAGSextra=-march=native

# With link-time optimization
make LDFLAGSextra=-flto CXXFLAGSextra="-flto -march=native"

AddressSanitizer Build (Memory Testing)

cd source
ASAN=1 make clean
ASAN=1 make STAR

# Run memory tests
cd ..
ASAN_OPTIONS="detect_leaks=1" ./new/tests/mem_test_tags.sh

Note: ASan builds require ~3x more RAM and run 2-5x slower than production builds.

USAGE EXAMPLES

Example 1: Unsorted BAM with CB/UB Tags

STAR --runThreadN 24 \
     --genomeDir /path/to/genome \
     --readFilesIn R2.fastq.gz R1.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM Unsorted \
     --outSAMattributes NH HI AS nM NM CR CY UR UY \
     --soloType CB_UMI_Simple \
     --soloCBwhitelist /path/to/whitelist.txt \
     --soloFeatures Gene GeneFull \
     --soloAddTagsToUnsorted yes \
     --outFileNamePrefix output/

Example 2: Tag Table Export Only

STAR --runThreadN 24 \
     --genomeDir /path/to/genome \
     --readFilesIn R2.fastq.gz R1.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM Unsorted \
     --soloType CB_UMI_Simple \
     --soloCBwhitelist /path/to/whitelist.txt \
     --soloFeatures Gene \
     --soloWriteTagTable Default \
     --outFileNamePrefix output/

Example 3: Full Feature Set (Unsorted BAM + Tag Table + ZG/ZX)

STAR --runThreadN 24 \
     --genomeDir /path/to/genome \
     --readFilesIn R2.fastq.gz R1.fastq.gz \
     --readFilesCommand zcat \
     --outSAMtype BAM Unsorted \
     --outSAMattributes NH HI AS nM NM CR CY UR UY GX GN gx gn ZG ZX \
     --soloType CB_UMI_Simple \
     --soloCBwhitelist /path/to/whitelist.txt \
     --soloFeatures Gene GeneFull \
     --soloStrand Unstranded \
     --soloAddTagsToUnsorted yes \
     --soloWriteTagTable Default \
     --quantMode GeneCounts \
     --outFileNamePrefix output/

Example 4: Debug Mode with Full Instrumentation

# Use production debug script
./new/scripts/runSTAR_debug.sh

# Or enable debug logging manually
STAR_DEBUG_TAG=1 STAR [parameters...]

INSPECTING OUTPUT

Examine ZG/ZX Tags

# View first 10 reads with ZG/ZX tags
samtools view output/Aligned.out.bam | grep -E "ZG:Z:|ZX:Z:" | head -10

# Extract ZG/ZX tags for analysis
samtools view output/Aligned.out.bam | awk '{
    for(i=1;i<=NF;i++) 
        if($i~/^ZG:Z:/ || $i~/^ZX:Z:/) 
            print $1"\t"$i
}' > zg_zx_tags.txt

# Validate ZG/ZX tags
python3 new/scripts/validate_zg_zx.py output/Aligned.out.bam allowed_genes.txt

Decode Binary Tag Table

# Compile decoder
cd tools
make

# Decode tag table
./decode_tag_binary ../output/Aligned.out.cb_ub.bin > decoded_tags.txt

LIMITATIONS

This release has been tested with default parameters for human and mouse genomes. Mammalian genomes require at least 16GB of RAM, ideally 32GB.

Fork-Specific Notes:

Two-pass unsorted BAM requires sufficient disk space for temporary files
Tag table binary format is specific to this fork; use provided decoder tool
ZG/ZX tags are BAM-only (not emitted in SAM format)
GeneFull must be in --soloFeatures for ZG/ZX tags to populate

FREEBSD PORTS

STAR can be installed on FreeBSD via the FreeBSD ports system:

pkg install star

Note: FreeBSD ports may not include fork-specific features. Compile from source for full functionality.

DOCKER

Docker build scripts are available in new/scripts/:

./new/scripts/docker_build.sh
./new/scripts/runSTAR_docker.sh

CONTRIBUTING

For issues or contributions related to:

Upstream STAR: Use https://github.com/alexdobin/STAR/issues
Fork-specific features: Contact the fork maintainer or create issues in the fork repository

LICENSE

See LICENSE file for details.

FUNDING

The development of STAR is supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R01HG009318. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Name		Name	Last commit message	Last commit date
Latest commit History 1,242 Commits
.cursor/plans		.cursor/plans
.specstory		.specstory
bin		bin
doc		doc
examples		examples
extras		extras
new		new
source		source
test		test
tools		tools
.cursorindexingignore		.cursorindexingignore
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
Aligned.out.sam		Aligned.out.sam
CHANGES.md		CHANGES.md
Dockerfile		Dockerfile
Dockerfile.prebuilt		Dockerfile.prebuilt
Dockerfile.star		Dockerfile.star
LICENSE		LICENSE
README.md		README.md
STAR		STAR
align_prod.sh		align_prod.sh
build_container.sh		build_container.sh
handoff		handoff
output		output
report.txt		report.txt
run100K_test.sh		run100K_test.sh
run_docker.sh		run_docker.sh
summary.txt		summary.txt
summary.txt.save		summary.txt.save

License

morphic-bio/STAR

Folders and files

Latest commit

History

Repository files navigation

STAR 2.7.11b (Fork)

UPSTREAM AUTHOR/SUPPORT

FORK ENHANCEMENTS

1. Two-Pass Unsorted BAM with CB/UB Tags (--soloAddTagsToUnsorted)

2. Binary Tag Table Export (--soloWriteTagTable)

3. Custom Gene Annotation Tags (ZG/ZX)

4. Solo Skip Processing (--soloSkipProcessing)

DOCUMENTATION

Instrumentation + Testing (keys.bin, stash, replay)

Inline Replayer Parity Test (Inline Solo Mode)

Inline EmptyDrops with Dominance Filtering

Flex Sample Filtering (Ordmag Pipeline Overview)

Fork-Specific Documentation

Upstream STAR Documentation

Production Scripts

Test Suites

HARDWARE/SOFTWARE REQUIREMENTS

DIRECTORY CONTENTS

COMPILING FROM SOURCE

Standard Compilation (Linux)

Mac OS X Compilation

Non-Standard GCC

Platform-Specific Optimization

AddressSanitizer Build (Memory Testing)

USAGE EXAMPLES

Example 1: Unsorted BAM with CB/UB Tags

Example 2: Tag Table Export Only

Example 3: Full Feature Set (Unsorted BAM + Tag Table + ZG/ZX)

Example 4: Debug Mode with Full Instrumentation

INSPECTING OUTPUT

Examine ZG/ZX Tags

Decode Binary Tag Table

LIMITATIONS

FREEBSD PORTS

DOCKER

CONTRIBUTING

LICENSE

FUNDING

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 36

Uh oh!

Languages

1. Two-Pass Unsorted BAM with CB/UB Tags (`--soloAddTagsToUnsorted`)

2. Binary Tag Table Export (`--soloWriteTagTable`)

4. Solo Skip Processing (`--soloSkipProcessing`)

Packages