Skip to content

Conversation

@fhennig
Copy link
Contributor

@fhennig fhennig commented Jan 19, 2026

resolves #1064

Summary

In some cases we do want X to be rejected, so I opted instead to make it possible to parse with ambiguity. This is used now for the mutations-over-time component, which fixes the problem we had before.

Screenshot

The previously failing plot now working:

image

Claude code plan

Read the Claude Code plan below for more details on the investigation of the bug as well as finding an appropriate solution.

Details

Bug Investigation: X234T Mutation Validation Error

Problem Summary

Mutations like "X234T" supplied in includeMutations are incorrectly rejected as invalid in queryMutationsOverTime.ts, even though 'X' is a valid IUPAC character representing an unknown nucleotide or amino acid.

Data Flow Trace

From includeMutations to Error

gs-mutations-over-time.tsx
  ↓ displayMutations prop
mutations-over-time.tsx:86
  ↓ passed to queryMutationsOverTimeData() as 6th parameter
queryOverallMutationData() in queryMutationsOverTime.ts:165-172
  ↓ renamed to includeMutations
codeToEmptyEntry(code) for each mutation code (line 111)
  ↓ calls
SubstitutionClass.parse(code) or DeletionClass.parse(code) (lines 70-71, 61-62)
  ↓ parsing fails, returns null
parseMutationCode(code) throws error (line 269)
  ↓
ERROR: "Given code is not valid: X234T"

Key Files

  • Error location: components/src/query/queryMutationsOverTime.ts:269
  • Bug location: components/src/utils/mutations.ts:17-18
  • Parsing logic: components/src/utils/mutations.ts:82-98

Why We're Parsing Mutations

The code parses mutations in two scenarios:

1. User-Supplied Mutations (includeMutations)

In codeToEmptyEntry() (queryMutationsOverTime.ts:70-78):

function codeToEmptyEntry(code: string): Entry {
    const maybeDeletion = DeletionClass.parse(code);
    if (maybeDeletion) {
        return { count: 0, mutation: maybeDeletion };
    }
    const maybeSubstitution = SubstitutionClass.parse(code);
    // ...
}

Purpose: Create empty entry objects (with count: 0) for mutations that users want to display, even if they don't appear in the API response. This ensures the mutations-over-time grid shows these mutations with zero prevalence rather than omitting them entirely.

2. API Response Mutations

In parseMutationCode() (queryMutationsOverTime.ts:261-270):

function parseMutationCode(code: string): SubstitutionClass | DeletionClass {
    const maybeDeletion = DeletionClass.parse(code);
    if (maybeDeletion) {
        return maybeDeletion;
    }
    const maybeSubstitution = SubstitutionClass.parse(code);
    if (maybeSubstitution) {
        return maybeSubstitution;
    }
    throw Error(`Given code is not valid: ${code}`);
}

Purpose: Convert mutation code strings from the API into typed SubstitutionClass or DeletionClass objects. These objects provide:

  • Type safety (substitution vs deletion)
  • Parsed components (segment, position, reference value, substitution value)
  • Helper methods for working with mutations

Root Cause of Bug

Incomplete IUPAC Character Sets

In components/src/utils/mutations.ts:17-18:

const nucleotideChars = 'ACGTRYKMSWBDHVN'; // MISSING: X, U
const aminoAcidChars = 'ACDEFGHIKLMNPQRSTVWY'; // MISSING: X (and others)

Per IUPAC standards (https://www.bioinformatics.org/sms/iupac.html):

  • 'X' = unknown/any nucleotide or amino acid
  • These character sets are incomplete

Regex Validation Failure

The regex in buildSubstitutionRegex() (mutations.ts:28-37):

`(?<valueAtReference>[${chars}*])?` + // First character must be in chars
    `(?<position>\\d+)` +
    `(?<substitutionValue>[${chars}.*])?$`;

For "X234T":

  • ✗ 'X' not in nucleotideChars → nucleotide regex fails
  • ✗ 'X' not in aminoAcidChars → amino acid regex fails
  • ✗ Not a deletion (no '-') → deletion regex fails
  • → All parsing returns null → error thrown

Broader Impact Analysis

Where Mutation Parsing is Used

The mutation parsing logic (SubstitutionClass.parse() and DeletionClass.parse()) is used in 8 files:

  1. queryMutationsOverTime.ts (lines 60-80, 260-269)

    • codeToEmptyEntry() - Creates empty entries for user-supplied mutations
    • parseMutationCode() - Primary entry point for parsing API response mutations
  2. queryWastewaterMutationsOverTime.ts (line 60)

    • transformMutations() - Parses wastewater mutation frequency data
  3. parseAndValidateMutation.ts (lines 35-76)

    • Validates user input in mutation filter UI
    • CRITICAL: This is where users enter mutations in the UI
  4. mutation-comparison-venn.tsx (line 147)

    • Parses mutation codes for Venn diagram display
  5. getMutationsGridData.ts (lines 3-7)

    • Uses bases export to initialize grid columns
    • Creates UI columns for each supported base type
  6. mutations-grid.tsx (line 46)

    • Dynamically creates table headers from bases array

7-8. Test files: mutations.spec.ts, queryWastewaterMutationsOverTime.spec.ts, etc.

Documentation Inconsistency Found

Critical finding: The documentation explicitly states that 'X' is supported:

From mutation-filter-info.tsx:140-142:

A <b>&lt;base&gt;</b> can be one of the 20 amino acid codes.
It can also be <b>*</b> for a stop codon, <b>-</b> for deletion and <b>X</b> for unknown.

However, the implementation does not support 'X' - this is an inconsistency between docs and code.

Is the Exclusion Intentional?

Evidence suggests it may be intentional for domain-specific reasons:

  1. Coverage Calculations (queryMutationsOverTime.ts:226-228):

    // 'coverage' in the API resp. is the number of seqs. that have a non-ambiguous symbol at position

    The system explicitly distinguishes between ambiguous and non-ambiguous symbols for statistical calculations.

  2. Proportion Calculations (mutation-info.tsx):

    • Examples show ambiguous symbols like Y (meaning T or C) are excluded from proportion calculations
    • This suggests the system is designed around concrete mutations only
  3. UI Display (bases array in mutations.ts:262-287):

    nucleotide: ['A', 'C', 'G', 'T', '-'],  // No 'N' (unknown)
    'amino acid': ['I', 'L', 'V', ...],     // No 'X' (unknown)

    The UI grid only shows concrete bases, not ambiguous symbols.

  4. LAPIS Backend Design:

    • Backend likely only reports concrete mutations
    • Ambiguous positions are tracked as "coverage" rather than mutations
    • This is standard practice in genomics: don't treat unknown bases as mutations

Potential Impact if 'X' Were Added

Area Impact Severity
Regex parsing 'X' would match successfully Low
UI grid columns New 'X' column would appear Medium
Coverage calculations Ambiguity in statistics High
Backend integration 'X' queries sent to API High
Mutation filter UI Users can enter 'X' Medium
Tests Need new test coverage Medium

Critical dependencies:

  • Backend API must support 'X' in mutation queries
  • Coverage/proportion calculations must handle ambiguous symbols correctly
  • The bases array would need updating (creates new grid column)

Recent Related Changes

Stop Codon Support (PR #987, Sept 2025):

  • Added * (stop codon) support to character sets
  • Shows the system actively adds special characters when needed
  • But 'X' was not added at that time (possibly intentional)

Files That Would Need Changes

If 'X' support were to be added:

  1. mutations.ts:17-18 - Add 'X' to character sets
  2. mutations.ts:262-287 - Add 'X' to bases array (creates UI column)
  3. mutations.spec.ts - Add test cases for 'X' parsing
  4. getMutationsGridData.ts - Verify grid initialization handles 'X'
  5. Backend verification - Confirm LAPIS API supports 'X' in queries

Recommended Solution

Add an optional parseAmbiguousSymbols parameter to the parsing logic that allows ambiguous symbols (like 'X') to be parsed when explicitly enabled.

Design Approach

Key Principle: Maintain backward compatibility by defaulting to current behavior (rejecting ambiguous symbols), but allow specific use cases to opt-in.

Benefits:

  • Preserves existing behavior for coverage calculations and UI grids
  • Allows includeMutations to accept user-specified ambiguous mutations
  • Explicit opt-in prevents unintended consequences
  • No breaking changes to existing code

Implementation Plan

1. Update Character Sets (mutations.ts:16-18)

Add ambiguous character definitions:

const nucleotideChars = 'ACGTRYKMSWBDHVN';
const aminoAcidChars = 'ACDEFGHIKLMNPQRSTVWY';
// NEW: Ambiguous symbols
const ambiguousNucleotideChars = 'X'; // Unknown nucleotide
const ambiguousAminoAcidChars = 'X'; // Unknown amino acid

2. Update Regex Builders (mutations.ts:28-38, 101-108, 163-172)

Modify buildSubstitutionRegex, buildDeletionRegex, and buildInsertionRegex to accept a parameter:

function buildSubstitutionRegex(
    type: 'nucleotide' | 'aminoAcid',
    segmentPartIsOptional: boolean,
    parseAmbiguousSymbols: boolean = false, // NEW parameter
) {
    const baseChars = type === 'nucleotide' ? nucleotideChars : aminoAcidChars;
    const ambiguousChars = type === 'nucleotide' ? ambiguousNucleotideChars : ambiguousAminoAcidChars;
    const chars = parseAmbiguousSymbols ? baseChars + ambiguousChars : baseChars;

    return new RegExp(
        `^${segmentPart(segmentPartIsOptional)}` +
            `(?<valueAtReference>[${chars}*])?` +
            `(?<position>\\d+)` +
            `(?<substitutionValue>[${chars}.*])?$`,
        'i',
    );
}

Apply similar changes to buildDeletionRegex and buildInsertionRegex.

3. Update Regex Initialization (mutations.ts:40-42, 110-111, 174-175)

Create two sets of regexes:

// Standard regexes (current behavior)
const nucleotideSubstitutionRegex = buildSubstitutionRegex('nucleotide', false, false);
const aminoAcidSubstitutionRegex = buildSubstitutionRegex('aminoAcid', false, false);

// NEW: Regexes that allow ambiguous symbols
const nucleotideSubstitutionRegexWithAmbiguous = buildSubstitutionRegex('nucleotide', false, true);
const aminoAcidSubstitutionRegexWithAmbiguous = buildSubstitutionRegex('aminoAcid', false, true);

Do the same for deletion and insertion regexes.

4. Update Parse Methods (mutations.ts:82-98)

Add parseAmbiguousSymbols parameter to SubstitutionClass.parse(), DeletionClass.parse(), and InsertionClass.parse():

class SubstitutionClass {
    static parse(
        mutationStr: string,
        segmentIsOptional: boolean = false,
        parseAmbiguousSymbols: boolean = false, // NEW parameter
    ): SubstitutionClass | null {
        // Select appropriate regexes based on flag
        const matchNucleotide = parseAmbiguousSymbols
            ? nucleotideSubstitutionRegexWithAmbiguous.exec(mutationStr)
            : nucleotideSubstitutionRegex.exec(mutationStr);
        const matchAminoAcid = parseAmbiguousSymbols
            ? aminoAcidSubstitutionRegexWithAmbiguous.exec(mutationStr)
            : aminoAcidSubstitutionRegex.exec(mutationStr);
        // ... rest of implementation
    }
}

5. Update Call Sites

Enable for includeMutations (queryMutationsOverTime.ts:70-78):

function codeToEmptyEntry(code: string): Entry {
    const maybeDeletion = DeletionClass.parse(code, false, true); // parseAmbiguousSymbols=true
    if (maybeDeletion) {
        return { count: 0, mutation: maybeDeletion };
    }
    const maybeSubstitution = SubstitutionClass.parse(code, false, true); // parseAmbiguousSymbols=true
    // ...
}

Keep disabled (default) for all other call sites:

  • parseMutationCode() in queryMutationsOverTime.ts:260-269
  • transformMutations() in queryWastewaterMutationsOverTime.ts:60
  • parseAndValidateMutation() in parseAndValidateMutation.ts:35-76
  • Venn diagram parsing in mutation-comparison-venn.tsx:147

6. Add Tests (mutations.spec.ts)

Add test cases for ambiguous symbol parsing:

describe('SubstitutionClass.parse with ambiguous symbols', () => {
    it('should reject X when parseAmbiguousSymbols=false (default)', () => {
        expect(SubstitutionClass.parse('gene1:X234T')).toEqual(null);
        expect(SubstitutionClass.parse('gene1:X234T', false, false)).toEqual(null);
    });

    it('should accept X when parseAmbiguousSymbols=true', () => {
        const result = SubstitutionClass.parse('gene1:X234T', false, true);
        expect(result).not.toEqual(null);
        expect(result?.position).toEqual(234);
        expect(result?.valueAtReference).toEqual('X');
        expect(result?.substitutionValue).toEqual('T');
    });

    it('should accept X in substitutionValue when parseAmbiguousSymbols=true', () => {
        const result = SubstitutionClass.parse('gene1:A234X', false, true);
        expect(result).not.toEqual(null);
        expect(result?.substitutionValue).toEqual('X');
    });
});

Add similar tests for DeletionClass and InsertionClass.

7. Add Integration Tests (queryMutationsOverTime.spec.ts)

Test that includeMutations works with 'X':

it('should accept X in includeMutations', async () => {
    const result = await queryMutationsOverTimeData(
        // ... parameters
        ['X234T'], // includeMutations with X
        // ... rest
    );
    // Verify X234T appears in results with count: 0
});

8. Update Documentation

Update mutation-filter-info.tsx to clarify when 'X' is supported:

// Clarify that X is only supported in certain contexts
// Or keep as-is if we want to eventually enable it in the filter UI

Add code comments in mutations.ts:

/**
 * Parse a mutation code string into a SubstitutionClass object
 * @param mutationStr - The mutation code to parse (e.g., "gene1:A234T")
 * @param segmentIsOptional - Whether segment prefix is optional
 * @param parseAmbiguousSymbols - Whether to allow ambiguous IUPAC symbols like 'X' (unknown)
 *                                Defaults to false to maintain existing behavior for coverage calculations
 * @returns SubstitutionClass object or null if parsing fails
 */

Files to Modify

  1. components/src/utils/mutations.ts

    • Lines 16-18: Add ambiguous character constants
    • Lines 28-38: Update buildSubstitutionRegex signature and implementation
    • Lines 40-42: Add ambiguous regex variants
    • Lines 82-98: Update SubstitutionClass.parse() signature
    • Lines 101-111: Update deletion regex builder and class
    • Lines 163-175: Update insertion regex builder and class
  2. components/src/query/queryMutationsOverTime.ts

    • Lines 70-78: Update codeToEmptyEntry() to pass parseAmbiguousSymbols=true
  3. components/src/utils/mutations.spec.ts

    • Add new test suite for ambiguous symbol parsing
  4. components/src/query/queryMutationsOverTime.spec.ts

    • Add integration test for includeMutations with 'X'

Summary

This solution:

  • ✅ Fixes the bug for includeMutations (allows X234T)
  • ✅ Maintains backward compatibility (default behavior unchanged)
  • ✅ Preserves existing UI grid columns (bases array unchanged)
  • ✅ Keeps coverage calculations unchanged (no ambiguous symbols by default)
  • ✅ Explicit opt-in prevents unintended consequences
  • ✅ Well-tested and documented
  • ✅ No breaking changes

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@vercel
Copy link

vercel bot commented Jan 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
dashboard-components Error Error Jan 19, 2026 0:36am

Request Review

@fhennig fhennig changed the title fix fix(dashboard-components): sequence code X is now accepted as valid when parseAmbiguousSymbols=true Jan 19, 2026
@fhennig fhennig self-assigned this Jan 19, 2026
@fhennig fhennig marked this pull request as ready for review January 19, 2026 12:43
const nucleotideChars = 'ACGTRYKMSWBDHVN';
const aminoAcidChars = 'ACDEFGHIKLMNPQRSTVWY';
// Ambiguous IUPAC symbols (excluded from standard parsing but can be enabled via parseAmbiguousSymbols flag)
const ambiguousNucleotideChars = 'X'; // Unknown nucleotide
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we need all the nucleotideChars that are not ATCG:

Suggested change
const ambiguousNucleotideChars = 'X'; // Unknown nucleotide
const ambiguousNucleotideChars = 'RYKMSWBDHVN';

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess we shouldn't have them in nucleotideChars then anymore.

}
}

export function toMutation(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method unused?

Comment on lines +102 to +103
expect(SubstitutionClass.parse('gene1:A234X')).to.equal(null);
expect(SubstitutionClass.parse('gene1:A234X', false, false)).to.equal(null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the desired behavior? I thought that ambiguity codes should always be allowed in the substitutionValue?

};
}
const maybeSubstitution = SubstitutionClass.parse(code);
const maybeSubstitution = SubstitutionClass.parse(code, false, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const maybeSubstitution = SubstitutionClass.parse(code, false, true);
const maybeSubstitution = SubstitutionClass.parse(code, { segmentIsOptional: false, parseAmbiguousSymbols: true });

I don't really like unnamed boolean arguments. Having them named with this "options object workaround" also helps with making the order of optional arguments irrelevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amino acid mutations over time: invalid mutation code

3 participants