Skip to content

Conversation

@gerau
Copy link
Contributor

@gerau gerau commented Dec 18, 2025

No description provided.

@apoelstra
Copy link
Contributor

cc @canndrew may want to keep an eye on progress here

@gerau
Copy link
Contributor Author

gerau commented Jan 12, 2026

Right now there is a working parser using the chumsky crate which replicates the behavior of the pest parser in terms of building a correct parse tree -- it should produce the same Simplicity program. This implementation also fixes #79.

Error reporting is currently broken because we need to replace the logic of parse::ParseFromStr to return multiple errors or handle recoverable errors differently, and error recovery is proving to be more overwhelming than I estimated it would be.

The code will be refactored because some parts are only half-finished (such as adding Spanned for certain names) and there are better ways to use parser combinators. However, I want to show this progress before implementing error recovery.

@gerau
Copy link
Contributor Author

gerau commented Jan 12, 2026

cc @canndrew

}

#[test]
#[ignore]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1b1e751 It's nice to see that chumsky seems to be faster than pest here.

@gerau gerau force-pushed the simc/chumsky-migration branch from 1b1e751 to 1e7c61b Compare January 14, 2026 15:10
src/error.rs Outdated
})
.map_or(0, |ts| u32::from(ts) as usize);

let start_col = file[line_start_byte..self.span.start].chars().count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to count columns as being the number of utf8 codepoints? There's no good way to define "number of columns" in general for non-ascii text, but LSP defines it as the number of utf16 codepoints and that's the closest thing to a standard that I'm aware of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just checked and LSP now allows you to choose between utf{8,16,32} at your leisure. But it's moot anyway since this is just deciding how long an underline to print and that's going to depend on the terminal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider switching to ariadne for error pretty-printing, as it's the "sister-crate" for chumsky.

@canndrew
Copy link
Contributor

It's weird that the lexer is treating all our built-in macro/function/etc names as being keywords. I realize that's how the compiler currently works, so it's okay to land this PR as-is to keep the changes small. But obviously we'd want to eventually treat these as just being identifiers.

@gerau gerau force-pushed the simc/chumsky-migration branch 2 times, most recently from c03241c to bd5c30f Compare January 20, 2026 16:18
gerau added 5 commits January 21, 2026 14:19
The lexer parses incoming code into tokens, which makes it simpler to
process using `chumsky`.
This commit introduce multiple changes, because it full rewrite of
parsing and error

Changes in `error.rs`:
- Change `Span` to use byte offsets in place of old `Position`
- Add `line-index` crate to calculate line and column of byte offset
- Change `RichError` implementation to use new `Span` structure
- Implement `chumsky` error traits, so it can be used in error reporting
  of parsers
- add `expected..found` error

Changes in `parse.rs`:
- Fully rewrite `pest` parsers to `chumsky` parsers.
- Change `ParseFromStr` trait to use this change.
This adds `ParseFromStrWithErrors`, which would take `ErrorCollector`
and return an `Option` of AST.
Also changes `TemplateProgram` to use new trait with collector
@gerau gerau force-pushed the simc/chumsky-migration branch from bd5c30f to 24a6bc6 Compare January 21, 2026 13:08
@gerau
Copy link
Contributor Author

gerau commented Jan 21, 2026

I would like to provide more context on a few points:

  1. Some of the parsers try to recover to some "default" values, so it could continue parse and report an error. If I understand correctly, in most parsers this is implemented by adding to parsing structures error states, so analysis stage of the compiler could handle this cases correctly. I haven't done this in this PR, because it requires changing the analysis code as well. Right now, it would not progress to analysis stage if there is a parsing error.

  2. I changed the lexer to not parse built-in types and functions as keywords, because this creates behavior, that was not in original pest parser (e.g. u1 was considered UnsignedType, even if it's defined as variable). This also does not require significant changes to parser itself, so I think we should keep this change here.

  3. I didn't change errors too much and their printing, but I think we should consider refactor errors and use ariadne for collecting them and printing. It seems to pair fairly well with chumsky, and it would provide prettier errors than we currently have.

@gerau
Copy link
Contributor Author

gerau commented Jan 21, 2026

Also a note about performance: chumsky seems faster in general than pest parser. For example, on my machine for a large file .simf file, which was generated by simplicity-bn254, chumsky is 10 times faster than pest for parsing. But trade-off for this is slower compilation times and lag with rust-analyzer, because chumsky is type-driven.

It would be nice if we could move the parser to a different crate, so it would not affect compile time too much, and the SimplicityHL parser could be used separately from the compiler.

@gerau
Copy link
Contributor Author

gerau commented Jan 21, 2026

cc @canndrew @KyrylR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants