feat(web): add classes for use in multi-token correction 🚂#15818
feat(web): add classes for use in multi-token correction 🚂#15818jahorton wants to merge 11 commits intochange/web/suggestion-rangefrom
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
6cedadc to
eaf419f
Compare
34e48e8 to
b9480ec
Compare
2ae39dd to
cd19326
Compare
1baecad to
be45f31
Compare
Build-bot: skip build:web Test-bot: skip
…ly for queue This, plus the upcoming QuotientNodeFinalizer class, will support result forwarding in multi-tokenization contexts during multi-token & multi-tokenization correction.
4d05f5e to
7e594e0
Compare
486cbd7 to
e81ae0a
Compare
| totalCost: number | ||
| } | ||
|
|
||
| export class TokenizationCorrector implements CorrectionSearchable<ReadonlyArray<TokenResult>, TokenizationResultMapping> { |
There was a problem hiding this comment.
Might be helpful to add a jsdoc comment to the class even though you already have a comment at the top. That way it would show up in vscode and similar tools.
| // - a result for correcting a single Token - "TokenResult"? | ||
| // - a result for completing correction for a full Tokenization - "TokenizationResult"? | ||
|
|
||
| export type TokenResult = { |
There was a problem hiding this comment.
I'm surprised that this type is new - I thought I had come across it in previous PRs.
| public readonly tokenization: ContextTokenization; | ||
| private readonly tailCorrectionLength: number | ||
|
|
||
| private readonly _uncorrectables: SearchQuotientNode[]; |
There was a problem hiding this comment.
Why do we have some private fields beginning with underscore and others without?
|
|
||
| export class TokenizationCorrector implements CorrectionSearchable<ReadonlyArray<TokenResult>, TokenizationResultMapping> { | ||
| public readonly tokenization: ContextTokenization; | ||
| private readonly tailCorrectionLength: number |
There was a problem hiding this comment.
nit:
| private readonly tailCorrectionLength: number | |
| private readonly tailCorrectionLength: number; |
| this.tailCorrectionLength = tailCorrectionLength; | ||
|
|
||
| if(tailCorrectionLength < 1) { | ||
| throw new Error(`Length for correction near tail may not be 0.`); |
There was a problem hiding this comment.
- This will also throw for negative numbers which isn't reflected in the error message.
- I'm wondering if it would be helpful to output the actual value? (similar two lines below)
| if(!filterClosure(token)) { | ||
| this._uncorrectables.push(searchModule); | ||
| } else if(index == tailCorrectionLength - 1) { | ||
| this._predictable = searchModule; |
There was a problem hiding this comment.
Do we have to check that this._predictable does not yet have a value?
| // Compute a weighting for each token's search space based the increase in | ||
| // tokenization cost that it represents. |
There was a problem hiding this comment.
Is there a word missing?
| // Compute a weighting for each token's search space based the increase in | |
| // tokenization cost that it represents. | |
| // Compute a weighting for each token's search space based on the increase in | |
| // tokenization cost that it represents. |
| } | ||
| }); | ||
|
|
||
| this._lockedTokenResults = new Map(); |
There was a problem hiding this comment.
I don't understand the name _lockedTokenResults. Why are the TokenResults locked? What does it mean?
Also the name is a bit misleading - sounds like they are the results of locked tokens, i.e. locked tokens that this class will return somewhere. May be better to rename to _lockedTokenResultsMap or something?
| this._lockedTokenResults.set(correctableToUpdate, tokenResult.mapping); | ||
| } | ||
|
|
||
| const resultCost = tokenResult.type != 'none' ? tokenResult.cost : this._lockedTokenResults.get(correctableToUpdate).totalCost |
There was a problem hiding this comment.
nit:
| const resultCost = tokenResult.type != 'none' ? tokenResult.cost : this._lockedTokenResults.get(correctableToUpdate).totalCost | |
| const resultCost = tokenResult.type != 'none' ? tokenResult.cost : this._lockedTokenResults.get(correctableToUpdate).totalCost; |
As mentioned in the description of #15814, up until now, we've always searched for corrections for just one token at a time - the current token. To better support whitespace fat-fingering, we'll need the ability to answer the following question: "Which tokenization pattern is the closest to matching the current text after corrections?" In essence, we need the ability to correct word boundaries, or perhaps to correct the engine's tokenization of the context.
One notable way forward is to consider what valid corrections we can find for each token in the tokenized context. This allows us to search for phrase-level corrections not focused on a specific word, prioritizing the most likely combination of tokenization-pattern and correction-cost for one of the pattern's tokens.
Now, as to the implementation of the new type...
Predictive text generally should not extend words not adjacent to the caret unnecessarily; predictive text exists to predict what is yet to be typed, rather than adjusting what was typed. Corrections validly may adjust what was typed, but should not aim to add to it. To use a simple example, for text such as
the quicl brown fox, correctingquicltoquickmay well be valid, but correcting the same word toquicklywould not make sense - those two extra characters would appear out of nowhere.So, when doing correction across a range of context, rather than a single token, we generally will want to correct all words not adjacent to the token, only allowing prediction for the adjacent token.
Build-bot: skip build:web
Test-bot: skip