Skip to content

Initial draft of a public API to navigate and manipulate VHDL syntax#1

Closed
Schottkyc137 wants to merge 106 commits intomasterfrom
syntax-crate
Closed

Initial draft of a public API to navigate and manipulate VHDL syntax#1
Schottkyc137 wants to merge 106 commits intomasterfrom
syntax-crate

Conversation

@Schottkyc137
Copy link
Copy Markdown
Owner

No description provided.

Comment thread vhdl_syntax/src/tokens/tokenizer.rs Outdated
Schottkyc137 and others added 27 commits January 3, 2025 14:54
This is opposed to the previous implementation where the leading trivia would
only return the immediate leading trivia stored in the token, which could
be confusing. To get the trivia between two tokens, the user would have to
manually extract leading and trailing trivias for the current and the
previous resp. next token.
Implement greedy name parsing
@Schottkyc137
Copy link
Copy Markdown
Owner Author

Finally had some more time to work on this and I think it's looking quite good. The current status is that the new vhdl_syntax crate is now capable of parsing and re-emitting every VHDL file in the example project (except for one file that uses PSL syntax which is still not supported).

There are also a bunch of other changes - like a new snapshot-based testing system which simplifies writing tests a bunch, optional serde integration for AST dumping (there is now also the vhdl-dump-ast crate + binary that uses these capabilities) and a working parse -> syntax tree -> dump roundtrip that preserves all nodes.

I think that this might be ready to be merged into the main branch quite soon. There is still a bit of clean-up to do and of course lots of testing.

@skaupper do have any further comments on this? I think requesting a full review from you is unreasonable as this is a gigantic PR with > 300 files changed - but if you have any high level comments I'd appreciate them!

@skaupper
Copy link
Copy Markdown

skaupper commented Jan 9, 2026

Being able to parse the example project is already pretty good.

If I see that correctly the parser still uses the hand coded productions with the generated syntax nodes being the interface to the user together with the AstNode trait. I like how these user facing nodes are generated from a far more readable format.

One thing I am curious about: The generated nodes do not distinguish between mandatory and optional tokens or am I missing something? E.g. ContextDeclarationSyntax::context_token returns an Option<SyntaxToken> which does not feel right. I think a return type of SyntaxToken with a panic (and a proper error message) if the token is not found feels better. In the end the underlying productions must assure that the mandatory tokens are there, otherwise it would be a bug anyways.

As far as I see error reporting is supported in a basic form. Do you have plans for error recovery as well?

In some of the examples you mention that the parser API is not finalized. Do you want to further improve that after merging it or is that part of the cleanup?

@Schottkyc137
Copy link
Copy Markdown
Owner Author

Thanks for the feedback! I really appreciate it!

The fact that all AstNodes return an Option is on purpose, and it's related to error tolerant parsing.
One of the goals of this library is to represent source trees in a lossless form - no matter the errors they contain. As a consequence, the AST must support this as well. The fact that the vhdl_lang implementation is not capable of doing this to the extend that this crate is able to is a pain point for lots of implementations like completions or fault-tolerant analysis.

That being said, I agree that the current design makes certain applications very un-ergonomic, especially if a consumer only wants to accept correct source code.
I have thought about how to handle this and have come up with two solutions (that other, similar frameworks also use):

  1. Have a secondary "Correct" API (called CorrectAst for the sake of simplicity). You could transform an AstNode into a Result<CorrectAstNode, ErrorKind> where all fields are checked recursively.

  2. Make all Tokens and non-optional fields actually non-optional and insert fake tokens instead. This is what SwiftSyntax uses if I'm not mistaken and it has some advantages. Each "virtual token" could be tagged whether it is actually present in the source code or synthetic and consumers can deal with that information however they like.
    The disadvantage imo is from an API perspective. Conceptually, there is not much of a difference between a "missing token" (I.e., Option<SyntaxToken>) and a "token that is declared missing" (I.e., a token where the presence is tagged), except that the former declares the intent more directly. If you don't know anything about the API, the functions will at least tell you that a token or node can be missing. A tagged token has a high chance of missuse.

So I think I prefer option 1, if any. It's a you-get-what-you-pay-for solution, can easily be generated from the YAML files and can has an intuitive API.
But imo it still remains to be seen whether it's even needed. With the current node API, lots of use-cases that I can think of can already be accomplished (linter, analyzer, formatter). Other use-cases are a bit more annoying, but also feasible.
I'm absolutely not sure about this point, so I'm happy to hear your opinion! Do you prefer Option 1, 2, or any other option?

Concerning the finalization of the parser API: This is what I plan to do after merging. I think that at some point done is good enough and a 40k merge request is already tons of changes. The only thing I want to avoid is merging something this big and after a couple of weeks / months decide that it needs to be refactored. That would defeat the whole point.

The only thing I'm currently considering is whether the actual AST nodes need a little refactoring. I have started to build a linter as a use-case for the library because it is something I had not thought about in the first place (and I'd also appreciate a good VHDL linter;). Refactoring ArchitectureBody from Keyword(Architecture), Identifier, Keyword(Of), Name, Keyword(Is), ... into ArchitectureHeader, ArchitectureDeclarativePart, ... has made the job easier and I hypothesize other tools would appreciate that as well. But I'm still not sure if it's worth the effort. If you're interested I can push the branch - but it's quite bare-bones at the moment.

@skaupper
Copy link
Copy Markdown

Okay I get where you want to go with the AST now. Even if you would like to accept incorrect code, how should that work? I've toyed around with the example given in linting.rs and by changing the VHDL code slightly the parser fell apart (notice the missing is keyword in the first line):

entity baz
    port (
        a: in std_ulogic
    );
end entity baz;

entity foo is
end entity bar;

To be fair, error recovery is still a TODO according to Parser::expect_token.

Lets say the parser can recover from an error by synchronizing to an end; or something. You'd have at least one SyntaxNode which is at best incomplete and at worst completely erroneous. Isn't it sufficient to flag the node itself as erroneous instead of the tokens returned by the query functions? The information about errors shouldn't be too hard to add since Parser::expect_token already generates diagnostics. Users then can either ignore the node altogether or try making sense of it with .raw().tokens() etc. I would even consider calling query functions on an erroneous node as an error because the results can be misleading in the worst case. This approach would probably also benefit from the introduction of further intermediate nodes like the example you mentioned with ArchitectureHeader, ArchitectureDeclarativePart, ....

@Schottkyc137
Copy link
Copy Markdown
Owner Author

Even if you would like to accept incorrect code, how should that work?

Currently, the scope of this library is to parse correct code (tbh that's hard enough :D).
But eventually, the parser should be capable of error handling (I'd like to say "better error handling", but at the moment errors like you mention will break the entire code). I will attempt to lay out how that would look like:
For your example, say that the is token is missing. One possible error recovery strategy is to check the next token; if it is the port or generic token – good, just do not emit the is token. If not, eat the token hoping that the user just inserted some bogus input.
There are other, simple case that the parser currently is not capable of dealing with like seeing a : token instead of a ; token, an identifier archtecture instead of architecture, e.t.c.
In general, the idea is that the parser should try to attempt to get the best match based on the user's input.

You'd have at least one SyntaxNode which is at best incomplete and at worst completely erroneous

I will try to motivate the need for a structure (AST / CST / SyntaxNodes) that support having incomplete or erroneous nodes and tokens, and it's the applications:

  1. A language server must be able to deal with broken trees. At the point where you are typing code, the tree is broken:
    a <= b and c.
    From the standpoint of a language, the statement above is horribly miss-formed. There is at least a semicolon missing and nothing after a dot (one would expect a selected name). But this is how you usually type code, and users expect language servers to generate suggestions at this place. To enable this case, it's extremely valuable to have a parser that can parse most of this statement into any kind of SignalAssignment.
  2. Good formatters can also deal with incorrect and incomplete code. This can be valuable to pin-point where an error is, if the issue is obstructed, i.e., from copy-paste code.
  3. Linters should also have certain error recovery capabilities. You wouldn't want a linter that stops at the first error because you would need to run it again and again, once for every error in the source file.

Users then can either ignore the node altogether or try making sense of it with .raw().tokens() etc.

This raises the issue that you would have to re-implement complex recovery logic instead of at one central place.
Even if one provides a secondary API to do error recovery after emitting an erroneous token, the parser will still need to implement some kind of error recovery itself if it is to be used for any of the previous use-cases, so you can't really only outsource that to some other API

Isn't it sufficient to flag the node itself as erroneous instead of the tokens returned by the query functions?

I believe that this is the critical point: At what granularity does the AST allow erroneous nodes? I believe that the answer should be either at every possible position or at none. The case "at every possible position" is used for the examples above and the case "none" is used for documentation generators, test bench extractors, and in general code that expects to work with correct VHDL code.
What you describe seems to me like a chimera between the two worlds: Some kind of errors are allowed in the AST, but not everywhere.

Currently implemented is the "errors are possible everywhere" API whereas what's still missing is the "everything is correct (or there are diagnostics)" API.

@Schottkyc137
Copy link
Copy Markdown
Owner Author

Now part of the main repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants