Skip to content

[BUG] seed is unsafe in TF parallel training #4440

@njzjz

Description

@njzjz

Bug summary

Per https://numpy.org/doc/stable/reference/random/parallel.html#sequence-of-integer-seeds

For example, it is common to see users add the worker ID to the root seed, especially with the legacy RandomState code.

# UNSAFE! Do not do this!
worker_seed = root_seed + worker_id
rng = np.random.RandomState(worker_seed)

It is true that for any one run of a parallel program constructed this way, each worker will have distinct streams. However, it is quite likely that multiple invocations of the program with different seeds will get overlapping sets of worker seeds. It is not uncommon (in the author’s self-experience) to change the root seed merely by an increment or two when doing these repeat runs. If the worker seeds are also derived by small increments of the worker ID, then subsets of the workers will return identical results, causing a bias in the overall ensemble of results.

Unlucky, our TF codes use such the logic, as found in #4435 (comment)

seed = jdata["training"].get("seed", None)
if seed is not None:
# avoid the same batch sequence among workers
seed += run_opt.my_rank
seed = seed % (2**32)
dp_random.seed(seed)

DeePMD-kit Version

devel

Backend and its version

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

See above

Steps to Reproduce

See above

Further Information, Files, and Links

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions