Skip to content

Reintroduce set_atom_data with sparse by_species and by_index forms #65

@bjmorgan

Description

@bjmorgan

Summary

StructureScene had a set_atom_data(key, values) method until commit 363d493, including a sparse form that accepted dict[int, value] and filled missing atoms with NaN (numeric) or "" (string). The sparse form was dropped when AtomData was introduced; users now have to build full-length arrays by hand, including NaN-filling atoms they do not care about.

Reintroduce set_atom_data() as a method on StructureScene, with sparse forms keyed by species label and/or atom index.

Proposed API

scene.set_atom_data(
    key,
    values=None,
    *,
    by_species=None,   # dict[str, scalar | 1-D | 2-D array]
    by_index=None,     # dict[int, scalar | 1-D array]
)
  • values (positional): a full-length array-like. No longer accepts dict[int, value] -- use by_index= instead.
  • by_species: maps species labels to values. Scalars broadcast across all atoms of that species; 1-D arrays give explicit per-atom values (length = count of atoms of that species).
  • by_index: maps specific atom indices to values.
  • Exactly one of: values, or any non-empty combination of by_species / by_index. Error if values is mixed with a sparse kwarg. Error if all three are omitted.
  • Returns None.

Unspecified atoms

  • Numeric data: fill with NaN.
  • Categorical (string) data: fill with None, stored as an object-dtype array. A unicode (<U...) array cannot hold None (it coerces to the literal string "None"), so the implementation must explicitly build object-dtype arrays for categorical input. _is_categorical_missing already treats None as missing.

1-D vs 2-D inference

  • Output is 1-D (n_atoms,) unless something promotes it to 2-D (n_frames, n_atoms).
  • A by_species value with shape (n_frames, n_selector_atoms) promotes.
  • A by_index value with shape (n_frames,) promotes.
  • When promoted, scalar and 1-D by_species values broadcast across the frame axis.

Ambiguous case: if n_frames == n_species_atoms, a 1-D by_species value of that length could mean "static per-atom" or "per-frame trajectory shared across the species". Rule: 1-D by_species is always interpreted as per-atom static. Users wanting a shared per-frame trajectory across a species must pass an explicit 2-D array (for example via np.broadcast_to).

Precedence

If a species appears in by_species and an atom of that species also appears in by_index, the by_index value wins. That is the point of allowing both: "all Mn atoms get charge 2.0, except atom 3 which is a defect site at 1.9".

Validation

  • Unknown species labels in by_species raise ValueError.
  • Out-of-range or negative indices in by_index raise ValueError.
  • Shape mismatches raise ValueError with the expected shape in the message.

Constructor symmetry -- descoped

Widen the constructor's atom_data parameter to accept the same sparse forms.

Descoped during design review: the two-step flow (StructureScene(...) then scene.set_atom_data(..., by_species=...)) is cleaner than any of the constructor-widening options considered (magic-key dicts, a new AtomDataSpec type, or dict-key-type sniffing). The constructor stays at dict[str, ArrayLike].

Changelog

This reintroduces a public API that existed until commit 363d493, so it wants a user-facing changelog entry.

Dependencies

Best landed after #64. The "build a full array and assign" pattern is cleaner when the resulting array is guaranteed immutable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions