Can you please open source the data set used for pre-training? Now only the model and code are available.