diff --git a/README.md b/README.md index 0cab42493..d49b6d772 100644 --- a/README.md +++ b/README.md @@ -41,37 +41,13 @@ pip install ai2-olmo ## Models overview The core models in the OLMo family released so far are (all trained on the [Dolma dataset](https://huggingface.co/datasets/allenai/dolma)): -| Model | Training Tokens | Context Length | W&B Logs | -|-------|-----------------|----------------|----------| -| [OLMo 1B](https://huggingface.co/allenai/OLMo-1B) | 3 Trillion | 2048 | | -| [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) | 2.5 Trillion | 2048 | [wandb.ai/ai2-llm/OLMo-7B](https://wandb.ai/ai2-llm/OLMo-7B/reports/OLMo-7B--Vmlldzo2NzQyMzk5) | -| [OLMo 7B Twin 2T](https://huggingface.co/allenai/OLMo-7B-Twin-2T) | 2 Trillion | 2048 | | - - -## Fine-tuning - -To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. - -Next, prepare your training config. There are many examples in the [`configs/`](./configs) directory that you can use as a starting point. The most important thing is to make sure the model parameters (the `model` field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line: - -- Update `load_path` to point to the checkpoint you want to start from. -- Set `reset_trainer_state` to `true`. -- Update `data.paths` to point to the `token_ids.npy` file you generated. -- Optionally update `data.label_mask_paths` to point to the `label_mask.npy` file you generated, unless you don't need special masking for the loss. -- Update `evaluators` to add/remove in-loop evaluations. - -Once you're satisfied with your training config, you can launch the training job via `torchrun`. For example: - -``` -torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ - --data.paths=[{path_to_data}/input_ids.npy] \ - --data.label_mask_paths=[{path_to_data}/label_mask.npy] \ - --load_path={path_to_checkpoint} \ - --reset_trainer_state -``` - -Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. +| Model | Training Tokens | Context Length | Training Config | W&B Logs | Data Order File(s) ☨ | +|-------|-----------------|:--------------:|-----------------|----------|--------------------| +| [OLMo 1B](https://huggingface.co/allenai/OLMo-1B) | 3 Trillion | 2048 | [configs/official/OLMo-1B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-1B.yaml) | | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy) | +| [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) | 2.5 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | [wandb.ai/ai2-llm/OLMo-7B](https://wandb.ai/ai2-llm/OLMo-7B/reports/OLMo-7B--Vmlldzo2NzQyMzk5) | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy), [Epoch 2](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wd2gxrza/train_data/global_indices.npy) | +| [OLMo 7B Twin 2T](https://huggingface.co/allenai/OLMo-7B-Twin-2T) | 2 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy) | +> ☨ *See [Inspecting training data](#inspecting-training-data) below for usage.* ## Inference @@ -99,7 +75,6 @@ olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B") print(olmo_pipe("Language modeling is")) ``` - ### Inference on finetuned checkpoints If you finetune the model using the code above, you can use the conversion script to convert a native OLMo checkpoint to a HuggingFace-compatible checkpoint @@ -116,7 +91,111 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B", torch_dtype=torch The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as inputs.input_ids.to('cuda') to avoid potential issues. +## Reproducibility + +### Training + +The configs used to train the official OLMo models are provided in the [`configs/official/`](https://github.com/allenai/OLMo/blob/main/configs/official) directory. + +Note that while the training and validation data is public and free to download, the paths to the data within those configs are pointed at a CloudFlare R2 bucket, which requires an API key for programmatic access. +So in order to use any of these configs to reproduce a training run you'll first have to download the corresponding data to a location of your choosing and then update the paths in the config accordingly. + +You can derive the public HTTP URL from an R2 URL by replacing `r2://olmo-data` with `https://olmo-data.org`. +For example, if the R2 data URL is: + +`r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy` + +then the corresponding public URL is: + +`https://olmo-data.org/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy` + +Once you've updated the data paths in the config you can launch a training run via `torchrun`. For example, to launch the 1B model training on a single 8x GPU node, you would run: + +```bash +torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml +``` + +You can use the same method to launch multi-node jobs as well. See [the documentation](https://pytorch.org/docs/stable/elastic/run.html) for `torchrun` to understand the additional arguments you'll need to configure the rendezvous backend / endpoint. + +### Inspecting training data + +You may be interesting in inspecting the exact tokens that composed a particular batch during the training of one of the OLMo models. +We provide tools to do this, but first you'll need to download the data as above (unless you have an R2 API key) and update the corresponding config accordingly. + +Then take note of the URL of the data order file you want, which can be found in the [Models Overview](#models-overview) table. For example, the data order file for the first epoch of the OLMo-7B model is [https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy). + +Once you have that you can use this snippet to inspect the data within a particular batch: + +```python +import numpy as np +from cached_path import cached_path + +from olmo.config import TrainConfig +from olmo.data import build_memmap_dataset + +# Update these paths to what you want: +data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy") +train_config_path = "configs/official/OLMo-7B.yaml" + + +cfg = TrainConfig.load(train_config_path) +dataset = build_memmap_dataset(cfg, cfg.data) +batch_size = cfg.global_train_batch_size +global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32) + + +def get_batch_instances(batch_idx: int) -> list[list[int]]: + batch_start = batch_idx * batch_size + batch_end = (batch_idx + 1) * batch_size + batch_indices = global_indices[batch_start:batch_end] + batch_instances = [] + for index in batch_indices: + token_ids = dataset[index]["input_ids"].tolist() + batch_instances.append(token_ids) + return batch_instances + + +# Get all 2048 x 2048 token IDs in the first batch. +get_batch_instances(0) +``` + + +## Fine-tuning + +To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. + +Next, prepare your training config. There are many examples in the [`configs/`](https://github.com/allenai/OLMo/blob/main/configs) directory that you can use as a starting point. The most important thing is to make sure the model parameters (the `model` field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line: + +- Update `load_path` to point to the checkpoint you want to start from. +- Set `reset_trainer_state` to `true`. +- Update `data.paths` to point to the `token_ids.npy` file you generated. +- Optionally update `data.label_mask_paths` to point to the `label_mask.npy` file you generated, unless you don't need special masking for the loss. +- Update `evaluators` to add/remove in-loop evaluations. + +Once you're satisfied with your training config, you can launch the training job via `torchrun`. For example: + +``` +torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ + --data.paths=[{path_to_data}/input_ids.npy] \ + --data.label_mask_paths=[{path_to_data}/label_mask.npy] \ + --load_path={path_to_checkpoint} \ + --reset_trainer_state +``` + +Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. ## Evaluation Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo. + +## Citing + +```bibtex +@article{OLMo, + title={OLMo: Accelerating the Science of Language Models}, + author={Dirk Groeneveld and Iz Beltagy and Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and A. Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Khot and William Merrill and Jacob Daniel Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Valentina Pyatkin and Abhilasha Ravichander and Dustin Schwenk and Saurabh Shah and Will Smith and Emma Strubell and Nishant Subramani and Mitchell Wortsman and Pradeep Dasigi and Nathan Lambert and Kyle Richardson and Luke Zettlemoyer and Jesse Dodge and Kyle Lo and Luca Soldaini and Noah A. Smith and Hanna Hajishirzi}, + year={2024}, + url={https://api.semanticscholar.org/CorpusID:267365485}, + journal={arXiv preprint}, +} +``` diff --git a/configs/official/OLMo-1B.yaml b/configs/official/OLMo-1B.yaml new file mode 100644 index 000000000..6f0d9c95a --- /dev/null +++ b/configs/official/OLMo-1B.yaml @@ -0,0 +1,446 @@ +run_name: OLMo-1B +seed: 6198 +dry_run: false + +wandb: + name: ${run_name} + project: olmo-small + +model: + d_model: 2048 + n_heads: 16 + n_layers: 16 + mlp_ratio: 8 + weight_tying: true + alibi: false + rope: true + flash_attention: false # not available on AMD + attention_dropout: 0.0 + attention_layer_norm: false + multi_query_attention: false + include_bias: false + block_type: sequential + layer_norm_type: default + layer_norm_with_affine: false + bias_for_layer_norm: false + attention_layer_norm_with_affine: false + activation_type: swiglu + residual_dropout: 0.0 + embedding_dropout: 0.0 + max_sequence_length: 2048 + vocab_size: 50280 + embedding_size: 50304 + eos_token_id: 50279 + pad_token_id: 1 + init_device: meta + init_fn: mitchell + +compile: null # causes instability on AMD GPUs + +optimizer: + name: adamw + learning_rate: 4.0e-4 + weight_decay: 0.1 + betas: + - 0.9 + - 0.95 + metrics_log_interval: 10 + +scheduler: + name: cosine_with_warmup + t_warmup: 2000 + alpha_f: 0.1 + +tokenizer: + identifier: tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json + truncate_direction: right + +save_folder: ${path.choose:${oc.env:SCRATCH_DIR,no_exist}/checkpoints,/results}/${oc.env:SLURM_JOB_ID,${run_name}} +save_overwrite: false +# Sharded checkpoints (best for restarts) +save_interval: 1000 +save_num_checkpoints_to_keep: 9 +# Unsharded checkpoints (for final storage) +save_interval_unsharded: 10000 +save_num_unsharded_checkpoints_to_keep: -1 + +load_path: null + +max_duration: 739_328 # 3.1T tokens +global_train_batch_size: 2048 +device_train_microbatch_size: 8 + +precision: amp_bf16 + +fsdp: + wrapping_strategy: null + precision: mixed + +max_grad_norm: 1.0 +max_grad_norm_ratio: null + +speed_monitor: + window_size: 20 + +eval_interval: ${save_interval} +eval_subset_num_batches: -1 +device_eval_batch_size: ${device_train_microbatch_size} +evaluators: + # lump all the small datasets together (we still get separate metrics). + - label: v3-small-ppl-validation + data: + num_workers: 0 + drop_last: true + datasets: + v3-small-c4_en-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/c4_en/val/part-0-00000.npy + v3-small-dolma_books-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_books/val/part-0-00000.npy + v3-small-dolma_common-crawl-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_common-crawl/val/part-0-00000.npy + v3-small-dolma_pes2o-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_pes2o/val/part-0-00000.npy + v3-small-dolma_reddit-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_reddit/val/part-0-00000.npy + v3-small-dolma_stack-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_stack/val/part-0-00000.npy + v3-small-dolma_wiki-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_wiki/val/part-0-00000.npy + v3-small-ice-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/ice/val/part-0-00000.npy + v3-small-m2d2_s2orc-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/m2d2_s2orc/val/part-0-00000.npy + v3-small-pile-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/pile/val/part-0-00000.npy + v3-small-wikitext_103-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/wikitext_103/val/part-0-00000.npy + + - label: v2-small-ppl-validation + data: + num_workers: 0 + drop_last: true + datasets: + v2-small-4chan-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/4chan/val.npy + v2-small-c4_100_domains-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_100_domains/val.npy + v2-small-c4_en-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_en/val.npy + v2-small-gab-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/gab/val.npy + v2-small-ice-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ice/val.npy + v2-small-m2d2_s2orc-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_s2orc/val.npy + v2-small-m2d2_wiki-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_wiki/val.npy + v2-small-manosphere-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/manosphere/val.npy + v2-small-mc4_en-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/mc4_en/val.npy + v2-small-pile-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/pile/val.npy + v2-small-ptb-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ptb/val.npy + v2-small-twitterAEE-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/twitterAEE/val.npy + v2-small-wikitext_103-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/wikitext_103/val.npy + + - label: piqa + type: downstream + + - label: hellaswag + type: downstream + + - label: winogrande + type: downstream + + - label: openbook_qa + type: downstream + + # - label: boolq # requires implemention of the pmi_dc matrix + # type: downstream + + - label: sciq + type: downstream + + - label: arc_easy + type: downstream + + # - label: arc_challenge # requires implemention of the pmi_dc matrix + # type: downstream + + - label: copa + type: downstream + + - label: rte + type: downstream + + - label: commitment_bank + type: downstream + + - label: mrpc + type: downstream + + - label: sst2 + type: downstream + +data: + pad_direction: right + num_workers: 0 + drop_last: true + pin_memory: true + prefetch_factor: 16 + persistent_workers: true + timeout: 0 + paths: + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-001-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-002-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-003-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-004-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-004-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-005-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-005-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-006-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-006-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-007-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-008-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-008-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-009-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-009-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-010-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-010-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-011-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-012-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-013-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-014-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-015-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-016-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-017-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-017-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-018-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-018-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-019-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-020-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-020-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-021-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-022-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-023-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-024-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-025-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-025-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-026-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-026-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-027-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-027-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-028-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-029-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-030-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-031-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-032-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-033-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-033-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-034-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-034-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-035-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-035-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-036-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-036-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-037-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-038-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-039-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-039-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-040-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-041-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-042-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-043-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-044-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-045-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-045-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-046-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-047-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-047-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-048-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-049-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-050-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-051-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-052-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-053-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-054-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-055-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-056-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-057-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-058-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-059-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-060-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-061-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-062-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-063-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-064-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-064-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-065-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-065-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-066-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-066-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-067-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-067-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-068-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-068-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-069-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-069-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-070-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-071-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-072-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-073-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-074-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-074-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-075-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-075-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-076-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-076-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-077-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-078-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-078-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-079-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-079-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-080-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-081-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-082-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-083-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-083-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-084-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-085-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-086-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-087-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-088-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-088-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-089-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-089-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-090-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-090-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-091-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-092-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-093-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-094-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-095-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-096-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-096-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-097-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-098-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-099-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-100-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-101-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-102-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-102-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-103-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-104-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-105-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-105-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-106-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-107-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-108-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-109-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-110-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-111-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-112-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-112-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-113-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-114-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-115-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-116-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-117-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-118-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-118-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-119-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-120-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-120-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-121-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-122-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-123-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-124-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-125-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-126-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-126-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-127-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-128-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-129-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-130-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-131-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-132-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-133-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-134-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-135-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-136-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-137-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-138-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-139-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-139-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-140-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-141-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-142-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-143-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-143-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-144-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-145-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-145-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-146-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-147-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-147-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-148-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-149-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-149-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-150-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-151-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-151-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-152-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-152-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-153-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-153-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-154-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-155-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-156-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-156-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-157-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-158-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-158-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-159-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-160-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-160-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-161-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-161-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-162-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-163-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-163-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-164-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-165-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-165-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-166-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-166-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-167-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-167-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-168-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-169-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-170-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-171-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-172-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-173-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-174-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-174-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-175-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-176-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-177-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-178-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-179-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-179-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-180-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-181-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-182-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-183-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-184-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-185-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-185-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-186-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-187-00000.npy diff --git a/configs/official/OLMo-7B.yaml b/configs/official/OLMo-7B.yaml new file mode 100644 index 000000000..1f6aec03d --- /dev/null +++ b/configs/official/OLMo-7B.yaml @@ -0,0 +1,648 @@ +run_name: OLMo-7B +seed: 6198 +dry_run: false + +wandb: + name: ${run_name} + project: olmo-medium + group: OLMo-7B + +model: + d_model: 4096 + n_heads: 32 + n_layers: 32 + mlp_hidden_size: 22016 + weight_tying: false + alibi: false + rope: true + flash_attention: true + attention_dropout: 0.0 + attention_layer_norm: false + multi_query_attention: false + include_bias: false + block_type: sequential + layer_norm_type: default + layer_norm_with_affine: false + bias_for_layer_norm: false + attention_layer_norm_with_affine: false + activation_type: swiglu + residual_dropout: 0.0 + embedding_dropout: 0.0 + max_sequence_length: 2048 + vocab_size: 50280 + embedding_size: 50304 + eos_token_id: 50279 + pad_token_id: 1 + init_device: meta + init_fn: mitchell + +compile: + fullgraph: false + +optimizer: + name: adamw + learning_rate: 3.0e-4 + weight_decay: 0.1 + betas: + - 0.9 + - 0.95 + metrics_log_interval: 10 + +scheduler: + name: linear_with_warmup + t_warmup: 5000 + alpha_f: 0.1 + grad_clip_warmup_steps: 1000 + grad_clip_warmup_factor: 10.0 + +tokenizer: + identifier: tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json + truncate_direction: right + +save_folder: runs/${run_name} +remote_save_folder: null +save_overwrite: true +# Sharded checkpoints (best for restarts) +save_interval: 1000 +save_num_checkpoints_to_keep: -1 +# Unsharded checkpoints (for final storage) +save_interval_unsharded: null +save_num_unsharded_checkpoints_to_keep: -1 + +load_path: null + +max_duration: 2e12T # 2T tokens +global_train_batch_size: 2048 +device_train_microbatch_size: 2 +time_limit: null + +precision: amp_bf16 + +fsdp: + wrapping_strategy: by_block + precision: mixed + +max_grad_norm: 1.0 +max_grad_norm_ratio: null + +speed_monitor: + window_size: 20 + +eval_interval: ${save_interval} +eval_subset_num_batches: -1 +device_eval_batch_size: ${device_train_microbatch_size} +evaluators: + - label: v3-small-ppl-validation + data: + num_workers: 0 + drop_last: true + datasets: + v3-small-c4_en-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/c4_en/val/part-0-00000.npy + v3-small-dolma_books-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_books/val/part-0-00000.npy + v3-small-dolma_common-crawl-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_common-crawl/val/part-0-00000.npy + v3-small-dolma_pes2o-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_pes2o/val/part-0-00000.npy + v3-small-dolma_reddit-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_reddit/val/part-0-00000.npy + v3-small-dolma_stack-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_stack/val/part-0-00000.npy + v3-small-dolma_wiki-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/dolma_wiki/val/part-0-00000.npy + v3-small-ice-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/ice/val/part-0-00000.npy + v3-small-m2d2_s2orc-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/m2d2_s2orc/val/part-0-00000.npy + v3-small-pile-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/pile/val/part-0-00000.npy + v3-small-wikitext_103-validation: + - r2://olmo-data/eval-data/perplexity/v3_small_gptneox20b/wikitext_103/val/part-0-00000.npy + + - label: v2-small-ppl-validation + data: + num_workers: 0 + drop_last: true + datasets: + v2-small-4chan-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/4chan/val.npy + v2-small-c4_100_domains-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_100_domains/val.npy + v2-small-c4_en-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/c4_en/val.npy + v2-small-gab-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/gab/val.npy + v2-small-ice-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ice/val.npy + v2-small-m2d2_s2orc-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_s2orc/val.npy + v2-small-m2d2_wiki-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/m2d2_wiki/val.npy + v2-small-manosphere-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/manosphere/val.npy + v2-small-mc4_en-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/mc4_en/val.npy + v2-small-pile-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/pile/val.npy + v2-small-ptb-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/ptb/val.npy + v2-small-twitterAEE-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/twitterAEE/val.npy + v2-small-wikitext_103-validation: + - r2://olmo-data/eval-data/perplexity/v2_small_gptneox20b/wikitext_103/val.npy + + ########################## + # Downstream evaluations # + ########################## + - label: piqa + type: downstream + + - label: hellaswag + type: downstream + + - label: winogrande + type: downstream + + - label: openbook_qa + type: downstream + + # - label: boolq # requires implemention of the pmi_dc matrix + # type: downstream + + - label: sciq + type: downstream + + - label: arc_easy + type: downstream + + # - label: arc_challenge # requires implemention of the pmi_dc matrix + # type: downstream + + - label: copa + type: downstream + + - label: rte + type: downstream + + - label: commitment_bank + type: downstream + + - label: mrpc + type: downstream + + - label: sst2 + type: downstream + +data: + pad_direction: right + num_workers: 16 + drop_last: true + pin_memory: true + prefetch_factor: 1 + persistent_workers: true + timeout: 0 + paths: + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-000-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-001-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-001-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-002-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-002-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-003-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-003-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-004-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-004-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-005-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-005-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-006-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-006-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-006-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-007-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-007-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-008-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-008-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-008-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-009-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-009-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-010-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-010-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-010-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-011-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-011-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-012-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-012-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-013-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-013-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-013-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-014-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-014-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-014-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-015-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-015-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-016-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-016-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-017-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-017-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-018-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-018-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-019-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-019-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-020-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-020-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-021-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-021-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-022-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-022-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-023-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-023-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-024-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-024-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-025-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-025-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-025-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-026-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-026-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-027-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-027-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-027-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-028-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-028-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-028-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-029-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-029-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-030-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-030-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-031-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-031-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-032-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-032-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-033-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-033-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-033-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-034-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-034-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-034-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-035-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-035-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-036-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-036-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-037-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-037-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-038-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-038-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-039-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-039-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-040-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-040-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-041-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-041-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-042-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-042-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-042-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-043-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-043-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-043-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-044-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-044-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-044-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-045-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-045-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-046-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-046-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-046-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-046-00003.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-047-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-047-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-048-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-048-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-049-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-049-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-050-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-050-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-051-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-051-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-052-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-052-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-052-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-053-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-053-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-053-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-054-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-054-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-055-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-055-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-055-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-056-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-056-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-056-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-057-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-057-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-057-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-058-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-058-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-059-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-059-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-060-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-060-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-061-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-061-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-062-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-062-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-062-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-063-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-063-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-063-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-064-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-064-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-064-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-065-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-065-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-065-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-066-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-066-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-067-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-067-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-068-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-068-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-069-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-069-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-070-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-070-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-071-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-071-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-072-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-072-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-073-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-073-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-074-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-074-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-075-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-075-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-076-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-076-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-077-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-077-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-078-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-078-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-079-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-079-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-080-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-080-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-081-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-081-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-082-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-082-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-083-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-083-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-084-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-084-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-085-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-085-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-086-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-086-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-087-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-087-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-088-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-088-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-089-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-089-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-089-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-090-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-090-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-091-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-091-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-091-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-092-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-092-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-093-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-093-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-093-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-094-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-094-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-094-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-095-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-095-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-096-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-096-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-097-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-097-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-097-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-098-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-098-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-099-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-099-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-100-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-100-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-100-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-101-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-101-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-102-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-102-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-103-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-103-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-104-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-104-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-105-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-105-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-106-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-106-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-106-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-107-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-107-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-108-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-108-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-109-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-109-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-109-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-110-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-110-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-110-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-111-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-111-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-112-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-112-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-113-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-113-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-114-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-114-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-114-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-115-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-115-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-116-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-116-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-117-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-117-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-118-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-118-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-119-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-119-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-120-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-120-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-120-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-121-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-121-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-122-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-122-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-122-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-123-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-123-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-123-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-124-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-124-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-125-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-125-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-126-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-126-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-127-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-127-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-127-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-128-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-128-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-129-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-129-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-129-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-130-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-130-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-131-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-131-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-132-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-132-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-133-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-133-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-133-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-134-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-134-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-134-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-135-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-135-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-135-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-136-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-136-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-137-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-137-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-137-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-138-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-138-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-139-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-139-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-140-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-140-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-141-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-141-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-141-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-142-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-142-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-142-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-143-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-143-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-144-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-144-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-144-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-145-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-145-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-145-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-146-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-146-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-146-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-147-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-147-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-147-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-148-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-148-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-149-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-149-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-149-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-150-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-150-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-150-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-150-00003.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-151-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-151-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-152-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-152-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-153-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-153-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-154-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-154-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-155-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-155-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-155-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-156-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-156-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-157-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-157-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-157-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-158-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-158-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-159-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-159-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-160-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-160-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-161-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-161-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-161-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-162-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-162-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-163-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-163-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-164-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-164-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-165-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-165-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-165-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-166-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-166-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-166-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-167-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-167-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-167-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-168-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-168-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-169-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-169-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-170-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-170-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-171-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-171-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-172-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-172-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-173-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-173-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-173-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-174-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-174-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-174-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-175-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-175-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-175-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-176-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-176-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-176-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-177-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-177-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-178-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-178-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-179-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-179-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-180-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-180-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-181-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-181-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-182-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-182-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-182-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-183-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-183-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-183-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-184-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-184-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-185-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-185-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-185-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-186-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-186-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-186-00002.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-187-00000.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-187-00001.npy + - r2://olmo-data/preprocessed/olmo-mix/v1_5-sample/gpt-neox-20b-pii-special/part-187-00002.npy