Skip to content

Comments

Expose memmap dtype in data config#594

Merged
2015aroras merged 8 commits intoallenai:mainfrom
NeuralFabricAI:lx/expose-data-dtype
Jun 7, 2024
Merged

Expose memmap dtype in data config#594
2015aroras merged 8 commits intoallenai:mainfrom
NeuralFabricAI:lx/expose-data-dtype

Conversation

@leon-g-xu
Copy link
Contributor

@leon-g-xu leon-g-xu commented May 24, 2024

This is expose dtype in the data config so that we can support reading memmap files with different dtypes

@2015aroras 2015aroras self-requested a review June 6, 2024 22:53
Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change! Just left some small comments.

return MemMapDataset(
*paths,
chunk_size=train_config.model.max_sequence_length,
memmap_dtype=train_config.data.effective_memmap_dtype,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also used to setup the memmaps for evaluation. In that case, the data_config is not the same as train_config.data. We should respect the setting in data_config setting.

Suggested change
memmap_dtype=train_config.data.effective_memmap_dtype,
memmap_dtype=data_config.effective_memmap_dtype,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

olmo/config.py Outdated
@dataclass
class DataConfig(BaseConfig):
paths: Optional[List[str]] = None
memmap_dtype: Optional[str] = "uint16"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this non-optional? I don't think None is useful here.

Suggested change
memmap_dtype: Optional[str] = "uint16"
memmap_dtype: str = "uint16"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@leon-g-xu
Copy link
Contributor Author

@2015aroras Thanks for the review. Updated the PR to address the comment

Copy link
Collaborator

@2015aroras 2015aroras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are style issue due to imports. Make sure to follow the steps here so that required automatic checks pass.

@leon-g-xu
Copy link
Contributor Author

leon-g-xu commented Jun 7, 2024

There are style issue due to imports. Make sure to follow the steps here so that required automatic checks pass.

@2015aroras Went though instructions and added all the missing things. Can I get another approval so that it kicks off all the auto checks?

@2015aroras 2015aroras merged commit 2639279 into allenai:main Jun 7, 2024
@leon-g-xu leon-g-xu deleted the lx/expose-data-dtype branch June 18, 2024 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants