Implement [self unread] when using streams for tokenization. #25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implement parsing of UTF-8 characters when using streams for tokenization.
[self unread] is used by many of the PKTokenizerState subclasses, so without
this feature, tokenization of streams is basically useless. This change adds
a circular buffer for all the data read from the stream, and rewinds through
this buffer to handle unreads. This places a limit on the amount of rewinding
that can be done (defaults to 256 unichars) but that should be OK for practical
purposes.
The UTF-8 support brings stream tokenization up to the same support as for
strings. The latter uses NSString.characterAtIndex to get UTF-16 code points,
and returns those from [self read]. For streams the parsing is not as simple,
but the result is now the same.
This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for
streams. Otherwise, the code behaves as before (returning data byte-by-byte)
for backwards compatibility.
This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp
That code has a BSD-style license and is marked as follows: