Skip to content

Implement greedy name parsing#2

Merged
Schottkyc137 merged 3 commits intoSchottkyc137:syntax-cratefrom
skaupper:name-parsing
Feb 13, 2025
Merged

Implement greedy name parsing#2
Schottkyc137 merged 3 commits intoSchottkyc137:syntax-cratefrom
skaupper:name-parsing

Conversation

@skaupper
Copy link
Copy Markdown

I implemented a first version of name parsing.

As already discussed through Matrix chat, this first implementation cannot fully differentiate some constructs and therefore clumps some tokens together in a generic Internal(...Tokens) node. This can be improved upon, but for simple names (especially selected names) it should work just fine.

Also, I added some test to show how the resulting structure may look like. The resulting SyntaxNode structure and naming is up for debate.

I implemented some (but definitely not all) of the name related production rules, like association_list, range and signature.


The big TODO is parsing of expression and subtype_indication. If there is a way to differentiate between these two on the fly, we could maybe avoid doing a second pass to parse Internal(...) nodes.

Comment thread vhdl_syntax/src/parser/productions/names.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/names.rs
@skaupper skaupper mentioned this pull request Jan 24, 2025
Copy link
Copy Markdown
Owner

@Schottkyc137 Schottkyc137 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! I feel like we're moving big steps into the right direction with this PR.
I have left some comments about what I think about some stuff you implemented, but most is up for discussion. Just let me know what you think.

Comment thread vhdl_syntax/src/parser/productions/design.rs
Comment thread vhdl_syntax/src/parser/productions/design.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/design.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/interface.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/interface.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/names.rs
Comment thread vhdl_syntax/src/parser/productions/names.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/scalar_types.rs Outdated
Comment thread vhdl_syntax/src/parser/productions/scalar_types.rs Outdated
Comment thread vhdl_syntax/src/syntax/node_kind.rs Outdated
@skaupper
Copy link
Copy Markdown
Author

I think I got all your comments. To better support bounded lookaheads I introduced a token_index field to the parser.

Calling the unbounded Parser::actual_part now expects the whole token stream to be part of the production, which is not the case if it is followed by a token like RightPar. But I guess that is to be expected, when you call a normally bounded function with usize::MAX 🤷

Let me know what you think.


(We can also move the token_index field to the NodeBuilder. Other than saving a few self.token_index += 1; I don't see the benefit, though.)

@skaupper
Copy link
Copy Markdown
Author

skaupper commented Feb 3, 2025

@Schottkyc137 Not sure if you missed it, or just didn't have time yet but here is a ping anyway :)

@Schottkyc137
Copy link
Copy Markdown
Owner

Sorry about the delay, it's the latter :/
I'll try to review your changes today.

@skaupper
Copy link
Copy Markdown
Author

skaupper commented Feb 3, 2025

Oh okay. Sorry for the ping then 😅
Take your time.

@skaupper
Copy link
Copy Markdown
Author

skaupper commented Feb 4, 2025

I haven't really considered it until now, but providing a bounded variant of Parser::name is probably a good idea as well. That way we could work around its greediness in some situations.

I scanned through the LRM and found a few grammar rules which might benefit from it: entity_aspect, function_call, procedure_call, instantiated_unit, actual_part, formal_part, group_declaration

While extracting the trailing parenthesized list from a name might not be a lot of trouble during analyzation, Parser::name wouldn't really parse it, but pack it in a RawTokens node and that wouldn't be convenient at all.

Do you see anything against it?

Comment thread vhdl_syntax/src/parser/productions/names.rs
Comment thread vhdl_syntax/src/parser/productions/names.rs Outdated
@Schottkyc137
Copy link
Copy Markdown
Owner

I think I got all your comments. To better support bounded lookaheads I introduced a token_index field to the parser.

Calling the unbounded Parser::actual_part now expects the whole token stream to be part of the production, which is not the case if it is followed by a token like RightPar. But I guess that is to be expected, when you call a normally bounded function with usize::MAX 🤷

Let me know what you think.

(We can also move the token_index field to the NodeBuilder. Other than saving a few self.token_index += 1; I don't see the benefit, though.)

Could you motivate for me why a separate token_index is needed and cannot be replaced with the amount of lookahead needed? I.e., why can't you just say 'lookahead for N tokens' instead of saying 'lookahead until index x'?
I think I'm sure I'm missing something obvious here, but couldn't find it.

@Schottkyc137
Copy link
Copy Markdown
Owner

I haven't really considered it until now, but providing a bounded variant of Parser::name is probably a good idea as well. That way we could work around its greediness in some situations.

I scanned through the LRM and found a few grammar rules which might benefit from it: entity_aspect, function_call, procedure_call, instantiated_unit, actual_part, formal_part, group_declaration

While extracting the trailing parenthesized list from a name might not be a lot of trouble during analyzation, Parser::name wouldn't really parse it, but pack it in a RawTokens node and that wouldn't be convenient at all.

Do you see anything against it?

So far I don't see anything against it. It should also be easy enough to revert, if the benefit isn't there. But for the examples you provided, this could make sense.

@skaupper
Copy link
Copy Markdown
Author

skaupper commented Feb 5, 2025

Could you motivate for me why a separate token_index is needed and cannot be replaced with the amount of lookahead needed?

It could be done either way, really. Using an absolute token index as bound instead of a relative token length is less housekeeping if you need to use the bound more than once.

While implementing association_list_bounded, association_element_bounded and actual_part_bounded I ran into the problem, that the caller of association_list_bounded determines the bound (the index/distance to a trailing RightPar or something) and every time a token is parsed (Parser::expect_token etc.) this bound has to be corrected.

If that's not clear enough, take a look at association_element_bounded. If there is a formal_part, the bound for the subsequent actual_part_bounded call will have to be corrected by the amount of tokens, this formal_part consists of (+1 because of the right arrow).

I figured using absolute indices avoids a lot of headache :)

So far I don't see anything against it.

Perfect. I will give it a go then.

@Schottkyc137
Copy link
Copy Markdown
Owner

Could you motivate for me why a separate token_index is needed and cannot be replaced with the amount of lookahead needed?

It could be done either way, really. Using an absolute token index as bound instead of a relative token length is less housekeeping if you need to use the bound more than once.

I see, this makes sense to me now. Generally, I prefer not adding more fields to the Parser, but in this case it should be fine. Still, I'd prefer if this was in the builder, since this is also where the rel_offset and text_offset reside, which serve a similar purpose. The interface is also a lot clearer, so the chance of accidentally forgetting to add a token_index + 1 or similar in some function is reduced.

@skaupper
Copy link
Copy Markdown
Author

I think I got all your comments worked in and I am happy with it.
Do you see anything blocking a merge?

@Schottkyc137
Copy link
Copy Markdown
Owner

I think I got all your comments worked in and I am happy with it. Do you see anything blocking a merge?

Nope, I think this is in a pretty good state now. This truly brings this project forward; name parsing was one of the things I was a bit worried about. Thanks a lot for putting in the work!

@Schottkyc137 Schottkyc137 merged commit 7034996 into Schottkyc137:syntax-crate Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants