The RWKV tokenizer has a vocabulary size of 65525. Even after adding dummy tokens, the vocabulary size only grows to 65536. Therefore its output index can fit into the "uint16" dtype, which supports up to 65536 tokens.
However, due to this function, preprocess_data.py will pick the "int32" dtype instead.
def __best_fitting_dtype(vocab_size=None):
if vocab_size is not None and vocab_size < 65500:
return np.uint16
else:
return np.int32
Source: https://github.com/Abel2076/json2binidx_tool/blob/9051dad73f9ef84c45cfe8bb0736f2edfe228619/tools/indexed_dataset.py#L29C7-L29C7