Tok

This is the code of the first version of my own tokenizer, called Tok, based on Byte Pair Encoding (BPE).

What is a Tokenizer?

A tokenizer is a fundamental component in natural language processing that breaks down raw text into smaller units called tokens. These tokens can be words, subwords, or characters, and are converted into numerical representations that machine learning models can process. The quality of tokenization significantly impacts model performance, affecting everything from training efficiency to the model's ability to understand context and generate coherent text.

Version History

Tok-1: Currently the latest (and only) version of Tok, based on BPE with byte-level encoding and custom <|EOS|> token.

How to use Tok?

Using Tok for your project is super simple: you just need to download the json file of Tok-1 directly from Github or just typing in the terminal

curl -O https://raw.githubusercontent.com/gianndev/Tok/master/tok-1/tok1.json

and the you can use it in your project using Hugging Face's tokenizers library by just inserting in your code

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tok1.json")

License

This project is licensed under the terms of the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tok

What is a Tokenizer?

Version History

How to use Tok?

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Tok

What is a Tokenizer?

Version History

How to use Tok?

License