Expose memmap dtype in data config#594
Expose memmap dtype in data config#5942015aroras merged 8 commits intoallenai:mainfrom NeuralFabricAI:lx/expose-data-dtype
Conversation
2015aroras
left a comment
There was a problem hiding this comment.
Thank you for the change! Just left some small comments.
olmo/data/__init__.py
Outdated
| return MemMapDataset( | ||
| *paths, | ||
| chunk_size=train_config.model.max_sequence_length, | ||
| memmap_dtype=train_config.data.effective_memmap_dtype, |
There was a problem hiding this comment.
This method is also used to setup the memmaps for evaluation. In that case, the data_config is not the same as train_config.data. We should respect the setting in data_config setting.
| memmap_dtype=train_config.data.effective_memmap_dtype, | |
| memmap_dtype=data_config.effective_memmap_dtype, |
olmo/config.py
Outdated
| @dataclass | ||
| class DataConfig(BaseConfig): | ||
| paths: Optional[List[str]] = None | ||
| memmap_dtype: Optional[str] = "uint16" |
There was a problem hiding this comment.
Could you make this non-optional? I don't think None is useful here.
| memmap_dtype: Optional[str] = "uint16" | |
| memmap_dtype: str = "uint16" |
|
@2015aroras Thanks for the review. Updated the PR to address the comment |
2015aroras
left a comment
There was a problem hiding this comment.
There are style issue due to imports. Make sure to follow the steps here so that required automatic checks pass.
@2015aroras Went though instructions and added all the missing things. Can I get another approval so that it kicks off all the auto checks? |
This is expose dtype in the data config so that we can support reading memmap files with different dtypes