Implement [self unread] when using streams for tokenization. #25

ewanmellor · 2015-06-15T01:55:40Z

Implement parsing of UTF-8 characters when using streams for tokenization.

[self unread] is used by many of the PKTokenizerState subclasses, so without
this feature, tokenization of streams is basically useless. This change adds
a circular buffer for all the data read from the stream, and rewinds through
this buffer to handle unreads. This places a limit on the amount of rewinding
that can be done (defaults to 256 unichars) but that should be OK for practical
purposes.

The UTF-8 support brings stream tokenization up to the same support as for
strings. The latter uses NSString.characterAtIndex to get UTF-16 code points,
and returns those from [self read]. For streams the parsing is not as simple,
but the result is now the same.

This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for
streams. Otherwise, the code behaves as before (returning data byte-by-byte)
for backwards compatibility.

This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp

That code has a BSD-style license and is marked as follows:

Implement parsing of UTF-8 characters when using streams for tokenization. [self unread] is used by many of the PKTokenizerState subclasses, so without this feature, tokenization of streams is basically useless. This change adds a circular buffer for all the data read from the stream, and rewinds through this buffer to handle unreads. This places a limit on the amount of rewinding that can be done (defaults to 256 unichars) but that should be OK for practical purposes. The UTF-8 support brings stream tokenization up to the same support as for strings. The latter uses NSString.characterAtIndex to get UTF-16 code points, and returns those from [self read]. For streams the parsing is not as simple, but the result is now the same. This adds a new field called isStreamInUTF8, to enable the UTF-8 parsing for streams. Otherwise, the code behaves as before (returning data byte-by-byte) for backwards compatibility. This includes code derived from http://opensource.apple.com/source/JavaScriptCore/JavaScriptCore-7534.57.3/wtf/unicode/UTF8.cpp That code has a BSD-style license and is marked as follows: * Copyright (C) 2007 Apple Inc. All rights reserved. * Copyright (C) 2010 Patrick Gansterer <[email protected]>

The unread functionality added in 9b79874 used self.offset as the offset into self.buffer. That's no good though, because self.offset is externally expected to be the offset into the whole input (i.e. the stream) and not the offset into an internal buffer. Fix this by adding a separate self.bufOffset.

Ewan Mellor and others added 2 commits June 14, 2015 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement [self unread] when using streams for tokenization. #25

Implement [self unread] when using streams for tokenization. #25

Uh oh!

ewanmellor commented Jun 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement [self unread] when using streams for tokenization. #25

Are you sure you want to change the base?

Implement [self unread] when using streams for tokenization. #25

Uh oh!

Conversation

ewanmellor commented Jun 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant