Summary
The DiffusionAnalyzer class returns uncertainty estimates (diffusivity_std_dev, etc.) that may significantly underestimate the true statistical uncertainty in the estimated diffusion coefficient. This affects both smoothed (default) and non-smoothed analyses, though for different reasons.
Issue 1: Smoothed analysis (default)
The default analysis (smoothed="max") returns uncertainty values that the docstring notes "make sense only for non-smoothed analyses." However, since these values are returned without warning, users may use them without reading the docstring carefully.
Issue 2: Non-smoothed analysis
For non-smoothed analysis (smoothed=False), the uncertainty is calculated using the standard OLS formula for variance in slope (lines 969-979), which assumes independent, identically distributed residuals. However, MSD data points are serially correlated, as the displacement at time $t_2$ is not independent of the displacement at $t_1$ < $t_2$. This correlation means the OLS standard error formula systematically underestimates the true uncertainty. See e.g., McCluskey et al., J. Chem. Theory Comput. 2024, 21, 1, 79–87 (https://doi.org/10.1021/acs.jctc.4c01249) for discussion of this issue.
Issue 3: WLS weighting
The smoothed analysis uses weighted least squares with w = 1/t (implemented by multiplying both sides of the regression by 1/\sqrt{t}). This accounts for the approximate reduction in the number of time origins at longer lag times, but does not account for the intrinsic scaling of MSD variance, which for an ideal random walk is \propto t^2. This means the current weighting is not formal WLS (which would require w \propto 1 / Var(MSD)). The code comment on line 953 describes this as "weight by variance", which may be misleading.
Impact
Users relying on the reported uncertainties may underestimate the true statistical uncertainty in their diffusion coefficients, potentially leading to over-confident conclusions.
Possible remediation
One relatively simple-to-implement option would be to replace the uncertainty attributes (diffusivity_std_dev, etc.) with @property methods that raise informative warnings or errors when accessed, explaining the limitation and directing users to alternative approaches, such as:
- Estimating uncertainties from multiple independent simulations
- Using packages designed for accurate estimation of uncertainties from MSD regression, such as kinisi
This would be a breaking change, but would prevent users from unknowingly using unreliable uncertainty estimates.
Summary
The
DiffusionAnalyzerclass returns uncertainty estimates (diffusivity_std_dev, etc.) that may significantly underestimate the true statistical uncertainty in the estimated diffusion coefficient. This affects both smoothed (default) and non-smoothed analyses, though for different reasons.Issue 1: Smoothed analysis (default)
The default analysis (
smoothed="max") returns uncertainty values that the docstring notes "make sense only for non-smoothed analyses." However, since these values are returned without warning, users may use them without reading the docstring carefully.Issue 2: Non-smoothed analysis
For non-smoothed analysis ($t_2$ is not independent of the displacement at $t_1$ < $t_2$ . This correlation means the OLS standard error formula systematically underestimates the true uncertainty. See e.g., McCluskey et al., J. Chem. Theory Comput. 2024, 21, 1, 79–87 (https://doi.org/10.1021/acs.jctc.4c01249) for discussion of this issue.
smoothed=False), the uncertainty is calculated using the standard OLS formula for variance in slope (lines 969-979), which assumes independent, identically distributed residuals. However, MSD data points are serially correlated, as the displacement at timeIssue 3: WLS weighting
The smoothed analysis uses weighted least squares with w = 1/t (implemented by multiplying both sides of the regression by 1/\sqrt{t}). This accounts for the approximate reduction in the number of time origins at longer lag times, but does not account for the intrinsic scaling of MSD variance, which for an ideal random walk is \propto t^2. This means the current weighting is not formal WLS (which would require w \propto 1 / Var(MSD)). The code comment on line 953 describes this as "weight by variance", which may be misleading.
Impact
Users relying on the reported uncertainties may underestimate the true statistical uncertainty in their diffusion coefficients, potentially leading to over-confident conclusions.
Possible remediation
One relatively simple-to-implement option would be to replace the uncertainty attributes (
diffusivity_std_dev, etc.) with@propertymethods that raise informative warnings or errors when accessed, explaining the limitation and directing users to alternative approaches, such as:This would be a breaking change, but would prevent users from unknowingly using unreliable uncertainty estimates.