Skip to content

Latest commit

 

History

History
32 lines (19 loc) · 1.33 KB

File metadata and controls

32 lines (19 loc) · 1.33 KB

Tok

This is the code of the first version of my own tokenizer, called Tok, based on Byte Pair Encoding (BPE).

What is a Tokenizer?

A tokenizer is a fundamental component in natural language processing that breaks down raw text into smaller units called tokens. These tokens can be words, subwords, or characters, and are converted into numerical representations that machine learning models can process. The quality of tokenization significantly impacts model performance, affecting everything from training efficiency to the model's ability to understand context and generate coherent text.

Version History

  • Tok-1: Currently the latest (and only) version of Tok, based on BPE with byte-level encoding and custom <|EOS|> token.

How to use Tok?

Using Tok for your project is super simple: you just need to download the json file of Tok-1 directly from Github or just typing in the terminal

curl -O https://raw.githubusercontent.com/gianndev/Tok/master/tok-1/tok1.json

and the you can use it in your project using Hugging Face's tokenizers library by just inserting in your code

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("path/to/tok1.json")

License

This project is licensed under the terms of the MIT License. See the LICENSE file for details.