Skip to content

Conversation

@ewanmellor
Copy link

Implement parsing of UTF-8 characters when using streams for tokenization.

[self unread] is used by many of the PKTokenizerState subclasses, so without
this feature, tokenization of streams is basically useless. This change adds
a circular buffer for all the data read from the stream, and rewinds through
this buffer to handle unreads. This places a limit on the amount of rewinding
that can be done (defaults to 256 unichars) but that should be OK for practical
purposes.

The UTF-8 support brings stream tokenization up to the same support as for
strings. The latter uses NSString.characterAtIndex to get UTF-16 code points,
and returns those from [self read]. For streams the parsing is not as simple,
but the result is now the same.

This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for
streams. Otherwise, the code behaves as before (returning data byte-by-byte)
for backwards compatibility.

This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp

That code has a BSD-style license and is marked as follows:

  • Copyright (C) 2007 Apple Inc. All rights reserved.
  • Copyright (C) 2010 Patrick Gansterer [email protected]

Ewan Mellor and others added 2 commits June 14, 2015 18:52
Implement parsing of UTF-8 characters when using streams for tokenization.

[self unread] is used by many of the PKTokenizerState subclasses, so without
this feature, tokenization of streams is basically useless.  This change adds
a circular buffer for all the data read from the stream, and rewinds through
this buffer to handle unreads.  This places a limit on the amount of rewinding
that can be done (defaults to 256 unichars) but that should be OK for practical
purposes.

The UTF-8 support brings stream tokenization up to the same support as for
strings.  The latter uses NSString.characterAtIndex to get UTF-16 code points,
and returns those from [self read].  For streams the parsing is not as simple,
but the result is now the same.

This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for
streams. Otherwise, the code behaves as before (returning data byte-by-byte)
for backwards compatibility.

This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp

That code has a BSD-style license and is marked as follows:

 * Copyright (C) 2007 Apple Inc.  All rights reserved.
 * Copyright (C) 2010 Patrick Gansterer <[email protected]>
The unread functionality added in 9b79874 used self.offset as
the offset into self.buffer.  That's no good though, because
self.offset is externally expected to be the offset into the
whole input (i.e. the stream) and not the offset into an internal
buffer.

Fix this by adding a separate self.bufOffset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant