Skip to content

Diffusivity uncertainty estimates are invalid or systematically underestimated #449

@bjmorgan

Description

@bjmorgan

Summary

The DiffusionAnalyzer class returns uncertainty estimates (diffusivity_std_dev, etc.) that may significantly underestimate the true statistical uncertainty in the estimated diffusion coefficient. This affects both smoothed (default) and non-smoothed analyses, though for different reasons.

Issue 1: Smoothed analysis (default)

The default analysis (smoothed="max") returns uncertainty values that the docstring notes "make sense only for non-smoothed analyses." However, since these values are returned without warning, users may use them without reading the docstring carefully.

Issue 2: Non-smoothed analysis

For non-smoothed analysis (smoothed=False), the uncertainty is calculated using the standard OLS formula for variance in slope (lines 969-979), which assumes independent, identically distributed residuals. However, MSD data points are serially correlated, as the displacement at time $t_2$ is not independent of the displacement at $t_1$ < $t_2$. This correlation means the OLS standard error formula systematically underestimates the true uncertainty. See e.g., McCluskey et al., J. Chem. Theory Comput. 2024, 21, 1, 79–87 (https://doi.org/10.1021/acs.jctc.4c01249) for discussion of this issue.

Issue 3: WLS weighting

The smoothed analysis uses weighted least squares with w = 1/t (implemented by multiplying both sides of the regression by 1/\sqrt{t}). This accounts for the approximate reduction in the number of time origins at longer lag times, but does not account for the intrinsic scaling of MSD variance, which for an ideal random walk is \propto t^2. This means the current weighting is not formal WLS (which would require w \propto 1 / Var(MSD)). The code comment on line 953 describes this as "weight by variance", which may be misleading.

Impact

Users relying on the reported uncertainties may underestimate the true statistical uncertainty in their diffusion coefficients, potentially leading to over-confident conclusions.

Possible remediation

One relatively simple-to-implement option would be to replace the uncertainty attributes (diffusivity_std_dev, etc.) with @property methods that raise informative warnings or errors when accessed, explaining the limitation and directing users to alternative approaches, such as:

  • Estimating uncertainties from multiple independent simulations
  • Using packages designed for accurate estimation of uncertainties from MSD regression, such as kinisi

This would be a breaking change, but would prevent users from unknowingly using unreliable uncertainty estimates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions