RWKV tokenizer should generate uint16 indices, instead of int32

The RWKV tokenizer has a vocabulary size of 65525. Even after adding dummy tokens, the vocabulary size only grows to 65536. Therefore its output index can fit into the "uint16" dtype, which supports up to 65536 tokens.

However, due to this function, `preprocess_data.py` will pick the "int32" dtype instead.

```
def __best_fitting_dtype(vocab_size=None):
    if vocab_size is not None and vocab_size < 65500:
        return np.uint16
    else:
        return np.int32
```
Source: https://github.com/Abel2076/json2binidx_tool/blob/9051dad73f9ef84c45cfe8bb0736f2edfe228619/tools/indexed_dataset.py#L29C7-L29C7



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RWKV tokenizer should generate uint16 indices, instead of int32 #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RWKV tokenizer should generate uint16 indices, instead of int32 #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions