polars quantify_mutations issue

### Description of the bug

The `ddl.tl.quantify_mutations` works fine when `split_locus=False`, but errors when `split_locus=True`.
I think this again has to do with polars data types. This is my `vdj` after running `ddl.pp.quantify_mutations(vdj)`, it has those strange-looking `_right` columns which I think come from an incorrect join:

```python
Lazy Dandelion object with n_obs = 3665 and n_contigs = 6998
    data: sequence_id, sequence, rev_comp, productive, v_call, d_call, j_call, sequence_alignment, germline_alignment, junction, junction_aa, v_cigar, d_cigar, j_cigar, stop_codon, vj_in_frame, locus, c_call, junction_length, np1_length, np2_length, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, v_score, v_identity, v_support, d_score, d_identity, d_support, j_score, j_identity, j_support, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3, cell_id, consensus_count, umi_count, c_sequence_alignment, c_germline_alignment, c_sequence_start, c_sequence_end, c_score, c_identity, junction_aa_length, fwr1_aa, fwr2_aa, fwr3_aa, fwr4_aa, cdr1_aa, cdr2_aa, cdr3_aa, sequence_alignment_aa, v_sequence_alignment_aa, d_sequence_alignment_aa, j_sequence_alignment_aa, clone_id, v_call_10x, d_call_10x, j_call_10x, junction_10x, junction_10x_aa, j_support_igblastn, j_score_igblastn, j_call_igblastn, j_call_blastn, j_identity_blastn, j_alignment_length_blastn, j_number_of_mismatches_blastn, j_number_of_gap_openings_blastn, j_sequence_start_blastn, j_sequence_end_blastn, j_germline_start_blastn, j_germline_end_blastn, j_support_blastn, j_score_blastn, j_sequence_alignment_blastn, j_germline_alignment_blastn, d_support_igblastn, d_score_igblastn, d_call_igblastn, d_call_blastn, d_identity_blastn, d_alignment_length_blastn, d_number_of_mismatches_blastn, d_number_of_gap_openings_blastn, d_sequence_start_blastn, d_sequence_end_blastn, d_germline_start_blastn, d_germline_end_blastn, d_support_blastn, d_score_blastn, d_sequence_alignment_blastn, d_germline_alignment_blastn, d_source, v_call_genotyped, germline_alignment_d_mask, sample_id, c_call_10x, j_call_multimappers, j_call_multiplicity, j_call_sequence_start_multimappers, j_call_sequence_end_multimappers, j_call_support_multimappers, mu_count, mu_freq, rearrangement_status, extra, ambiguous, mu_freq_right, mu_count_right
    metadata: cell_id, mu_freq, mu_count
    layout: layout for 3666 vertices, layout for 523 vertices
    graph: networkx graph of 3666 vertices, networkx graph of 523 vertices 
    distances: distance matrix of shape (3666, 3666)
```

### Minimal reproducible example

```python
vdj.store_germline_reference(
    corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
    germline=".../database/germlines/imgt/human/vdj",
    org="human",
)

ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
ddl.pp.quantify_mutations(vdj, split_locus=True)
```

### The error message produced by the code above

```pytb
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[146], line 8
      1 vdj.store_germline_reference(
      2     corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
      3     germline=".../database/germlines/imgt/human/vdj",
      4     org="human",
      5 )
      7 ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
----> 8 ddl.pp.quantify_mutations(vdj, split_locus=True)
      9 ddl.tl.transfer(adata, vdj)

File .../dandelion/lib/python3.14/site-packages/dandelion/external/immcantation/polars/shazam_polars.py:196, in quantify_mutations(data, split_locus, sequence_column, germline_column, region_definition, mutation_definition, frequency, combine, **kwargs)
    182     if pd_df[col].dtype == object:
    183         pd_df[col] = pd_df[col].apply(
    184             lambda x: (
    185                 None
   (...)    194             )
    195         )
--> 196 r_out_pl = pl.from_pandas(pd_df)
    198 if isinstance(data, DandelionPolars):
    199     # Append new columns to data._data via sequence_id join
    200     base_df = data._data

File .../dandelion/lib/python3.14/site-packages/polars/convert/general.py:707, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index)
    704     return wrap_s(pandas_to_pyseries("", data, nan_to_null=nan_to_null))
    705 elif isinstance(data, pd.DataFrame):
    706     return wrap_df(
--> 707         pandas_to_pydf(
    708             data,
    709             schema_overrides=schema_overrides,
    710             rechunk=rechunk,
    711             nan_to_null=nan_to_null,
    712             include_index=include_index,
    713         )
    714     )
    715 else:
    716     msg = f"expected pandas DataFrame or Series, got {qualified_type_name(data)!r}"

File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/dataframe.py:1138, in pandas_to_pydf(data, schema, schema_overrides, strict, rechunk, nan_to_null, include_index)
   1129         arrow_dict[str(idxcol)] = plc.pandas_series_to_arrow(
   1130             # get_level_values accepts `int | str`
   1131             # but `index.names` returns `Hashable`
   (...)   1134             length=length,
   1135         )
   1137 for col_idx, col_data in data.items():
-> 1138     arrow_dict[str(col_idx)] = plc.pandas_series_to_arrow(
   1139         col_data, nan_to_null=nan_to_null, length=length
   1140     )
   1142 arrow_table = pa.table(arrow_dict)
   1143 return arrow_to_pydf(
   1144     arrow_table,
   1145     schema=schema,
   (...)   1148     rechunk=rechunk,
   1149 )

File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/other.py:39, in pandas_series_to_arrow(values, length, nan_to_null)
     37 first_non_none = get_first_non_none(values.values)  # type: ignore[arg-type]
     38 if isinstance(first_non_none, str):
---> 39     return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
     40 elif first_non_none is None:
     41     return pa.nulls(length or len(values), pa.large_utf8())

File /well/jknight-hinks/users/lwn344/conda/skylake/envs/dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:365, in pyarrow.lib.array()

File .../dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:91, in pyarrow.lib._ndarray_to_array()

File .../dandelion/lib/python3.14/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: Expected bytes, got a 'int' object
```

### OS information

HPC

### Version information

```
dandelion==1.0.0a1.dev8 pandas==2.3.3 numpy==2.3.5 matplotlib==3.10.8 networkx==3.6.1 scipy==1.17.1
```

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars quantify_mutations issue #630

Description of the bug

Minimal reproducible example

The error message produced by the code above

OS information

Version information

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

polars quantify_mutations issue #630

Description

Description of the bug

Minimal reproducible example

The error message produced by the code above

OS information

Version information

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions