Skip to content

RWKV tokenizer should generate uint16 indices, instead of int32 #6

@desktable

Description

@desktable

The RWKV tokenizer has a vocabulary size of 65525. Even after adding dummy tokens, the vocabulary size only grows to 65536. Therefore its output index can fit into the "uint16" dtype, which supports up to 65536 tokens.

However, due to this function, preprocess_data.py will pick the "int32" dtype instead.

def __best_fitting_dtype(vocab_size=None):
    if vocab_size is not None and vocab_size < 65500:
        return np.uint16
    else:
        return np.int32

Source: https://github.com/Abel2076/json2binidx_tool/blob/9051dad73f9ef84c45cfe8bb0736f2edfe228619/tools/indexed_dataset.py#L29C7-L29C7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions