Tokenizer/PHP: bug fix for double quoted strings using ${#3604
Merged
gsherwood merged 2 commits intosquizlabs:masterfrom Jun 13, 2022
Merged
Conversation
While creating a sniff for PHPCompatibility to detect the PHP 8.2 deprecation of two of the four syntaxes to embed variables/expressions within text strings, I realized that for select examples of the "type 4" syntax - "Variable variables (“${expr}”, equivalent to (string) ${expr})" -, PHPCS did not, and probably never did, tokenize those correctly.
Double quoted strings in PHPCS are normally tokenized as one token (per line), including any and all embedded variables and expressions to prevent problems in the scope map caused by braces within the string/expression.
However, for some double quoted strings using `${` syntax, this was not the case and these were tokenized as outlined below, which would still lead to the problems in the scope map.
Luckily, these syntax variants aren't used that frequently. Though I suspect if they were, this bug would have been discovered a lot sooner.
Either way, fixed now.
Including adding a dedicated test file based on examples related to the PHP 8.2 RFC.
Note: this test file tests that these embedding syntaxes are tokenized correctly, it doesn't test the other parts of the related `Tokenizers\PHP` code block (re-tokenizing to one token per line, handling of binary casts).
Example of the code for which the tokenizer was buggy:
```php
"${foo["${bar}"]}";
```
Was originally tokenized like so:
```
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
2 | L3 | C 1 | CC 0 | ( 0) | T_ECHO | [ 4]: echo
3 | L3 | C 5 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
4 | L3 | C 6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING | [ 8]: "${foo["
5 | L3 | C 14 | CC 0 | ( 0) | T_DOLLAR_OPEN_CURLY_BRACES | [ 2]: ${
6 | L3 | C 16 | CC 0 | ( 0) | T_STRING_VARNAME | [ 3]: bar
7 | L3 | C 19 | CC 0 | ( 0) | T_CLOSE_CURLY_BRACKET | [ 1]: }
8 | L3 | C 20 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING | [ 4]: "]}"
9 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON | [ 1]: ;
10 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
```
Will be tokenized (correctly) like so with the fix included in this PR:
```
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
2 | L3 | C 1 | CC 0 | ( 0) | T_ECHO | [ 4]: echo
3 | L3 | C 5 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
4 | L3 | C 6 | CC 0 | ( 0) | T_DOUBLE_QUOTED_STRING | [ 18]: "${foo["${bar}"]}"
5 | L3 | C 24 | CC 0 | ( 0) | T_SEMICOLON | [ 1]: ;
6 | L3 | C 25 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
```
Refs:
* https://www.php.net/manual/en/language.types.string.php#language.types.string.parsing
* https://wiki.php.net/rfc/deprecate_dollar_brace_string_interpolation
* https://gist.github.com/iluuu1994/72e2154fc4150f2258316b0255b698f2
b9751e8 to
150c8c4
Compare
Member
|
I had an old version of 1.3.2 lying around so I checked it on that - didn't tokenize correctly. So I think you're right; it's never worked. But it does now. Thanks a lot for fixing this. |
Contributor
Author
|
Thanks for getting this merged. This was an unexpected bonus bug find from the PHP 8.2 deprecation 😆 Also interesting to have it confirmed that my suspicion was correct (and even more interesting that apparently nobody had noticed in all that time). Thanks for checking! |
jrfnl
added a commit
to jrfnl/PHP_CodeSniffer
that referenced
this pull request
Oct 26, 2022
…terpolated strings Similar to previous PR squizlabs#3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
jrfnl
added a commit
to jrfnl/PHP_CodeSniffer
that referenced
this pull request
Oct 26, 2022
…terpolated strings Similar to previous PR squizlabs#3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
gsherwood
pushed a commit
that referenced
this pull request
Dec 22, 2022
…terpolated strings Similar to previous PR #3604 which fixed the tokenization for complex double quoted strings with interpolated variables/expressions and added tests for it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While creating a sniff for PHPCompatibility to detect the PHP 8.2 deprecation of two of the four syntaxes to embed variables/expressions within text strings, I realized that for select examples of the "type 4" syntax - "Variable variables (
“${expr}”, equivalent to (string)${expr})" -, PHPCS did not, and probably never did, tokenize those correctly.Double quoted strings in PHPCS are normally tokenized as one token (per line), including any and all embedded variables and expressions to prevent problems in the scope map caused by braces within the string/expression.
However, for some double quoted strings using
${syntax, this was not the case and these were tokenized as outlined below, which would still lead to the problems in the scope map.Luckily, these syntax variants aren't used that frequently. Though I suspect if they were, this bug would have been discovered a lot sooner.
Either way, fixed now.
Including adding a dedicated test file based on examples related to the PHP 8.2 RFC.
Note: this test file tests that these embedding syntaxes are tokenized correctly, it doesn't test the other parts of the related
Tokenizers\PHPcode block (re-tokenizing to one token per line, handling of binary casts).Refs:
Example of some code for which the tokenizer was buggy:
Was originally tokenized like so:
Will be tokenized (correctly) like so with the fix included in this PR: