Skip to content

polars quantify_mutations issue #630

@wlason

Description

@wlason

Description of the bug

The ddl.tl.quantify_mutations works fine when split_locus=False, but errors when split_locus=True.
I think this again has to do with polars data types. This is my vdj after running ddl.pp.quantify_mutations(vdj), it has those strange-looking _right columns which I think come from an incorrect join:

Lazy Dandelion object with n_obs = 3665 and n_contigs = 6998
    data: sequence_id, sequence, rev_comp, productive, v_call, d_call, j_call, sequence_alignment, germline_alignment, junction, junction_aa, v_cigar, d_cigar, j_cigar, stop_codon, vj_in_frame, locus, c_call, junction_length, np1_length, np2_length, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, v_score, v_identity, v_support, d_score, d_identity, d_support, j_score, j_identity, j_support, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3, cell_id, consensus_count, umi_count, c_sequence_alignment, c_germline_alignment, c_sequence_start, c_sequence_end, c_score, c_identity, junction_aa_length, fwr1_aa, fwr2_aa, fwr3_aa, fwr4_aa, cdr1_aa, cdr2_aa, cdr3_aa, sequence_alignment_aa, v_sequence_alignment_aa, d_sequence_alignment_aa, j_sequence_alignment_aa, clone_id, v_call_10x, d_call_10x, j_call_10x, junction_10x, junction_10x_aa, j_support_igblastn, j_score_igblastn, j_call_igblastn, j_call_blastn, j_identity_blastn, j_alignment_length_blastn, j_number_of_mismatches_blastn, j_number_of_gap_openings_blastn, j_sequence_start_blastn, j_sequence_end_blastn, j_germline_start_blastn, j_germline_end_blastn, j_support_blastn, j_score_blastn, j_sequence_alignment_blastn, j_germline_alignment_blastn, d_support_igblastn, d_score_igblastn, d_call_igblastn, d_call_blastn, d_identity_blastn, d_alignment_length_blastn, d_number_of_mismatches_blastn, d_number_of_gap_openings_blastn, d_sequence_start_blastn, d_sequence_end_blastn, d_germline_start_blastn, d_germline_end_blastn, d_support_blastn, d_score_blastn, d_sequence_alignment_blastn, d_germline_alignment_blastn, d_source, v_call_genotyped, germline_alignment_d_mask, sample_id, c_call_10x, j_call_multimappers, j_call_multiplicity, j_call_sequence_start_multimappers, j_call_sequence_end_multimappers, j_call_support_multimappers, mu_count, mu_freq, rearrangement_status, extra, ambiguous, mu_freq_right, mu_count_right
    metadata: cell_id, mu_freq, mu_count
    layout: layout for 3666 vertices, layout for 523 vertices
    graph: networkx graph of 3666 vertices, networkx graph of 523 vertices 
    distances: distance matrix of shape (3666, 3666)

Minimal reproducible example

vdj.store_germline_reference(
    corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
    germline=".../database/germlines/imgt/human/vdj",
    org="human",
)

ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
ddl.pp.quantify_mutations(vdj, split_locus=True)

The error message produced by the code above

---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[146], line 8
      1 vdj.store_germline_reference(
      2     corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
      3     germline=".../database/germlines/imgt/human/vdj",
      4     org="human",
      5 )
      7 ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
----> 8 ddl.pp.quantify_mutations(vdj, split_locus=True)
      9 ddl.tl.transfer(adata, vdj)

File .../dandelion/lib/python3.14/site-packages/dandelion/external/immcantation/polars/shazam_polars.py:196, in quantify_mutations(data, split_locus, sequence_column, germline_column, region_definition, mutation_definition, frequency, combine, **kwargs)
    182     if pd_df[col].dtype == object:
    183         pd_df[col] = pd_df[col].apply(
    184             lambda x: (
    185                 None
   (...)    194             )
    195         )
--> 196 r_out_pl = pl.from_pandas(pd_df)
    198 if isinstance(data, DandelionPolars):
    199     # Append new columns to data._data via sequence_id join
    200     base_df = data._data

File .../dandelion/lib/python3.14/site-packages/polars/convert/general.py:707, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index)
    704     return wrap_s(pandas_to_pyseries("", data, nan_to_null=nan_to_null))
    705 elif isinstance(data, pd.DataFrame):
    706     return wrap_df(
--> 707         pandas_to_pydf(
    708             data,
    709             schema_overrides=schema_overrides,
    710             rechunk=rechunk,
    711             nan_to_null=nan_to_null,
    712             include_index=include_index,
    713         )
    714     )
    715 else:
    716     msg = f"expected pandas DataFrame or Series, got {qualified_type_name(data)!r}"

File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/dataframe.py:1138, in pandas_to_pydf(data, schema, schema_overrides, strict, rechunk, nan_to_null, include_index)
   1129         arrow_dict[str(idxcol)] = plc.pandas_series_to_arrow(
   1130             # get_level_values accepts `int | str`
   1131             # but `index.names` returns `Hashable`
   (...)   1134             length=length,
   1135         )
   1137 for col_idx, col_data in data.items():
-> 1138     arrow_dict[str(col_idx)] = plc.pandas_series_to_arrow(
   1139         col_data, nan_to_null=nan_to_null, length=length
   1140     )
   1142 arrow_table = pa.table(arrow_dict)
   1143 return arrow_to_pydf(
   1144     arrow_table,
   1145     schema=schema,
   (...)   1148     rechunk=rechunk,
   1149 )

File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/other.py:39, in pandas_series_to_arrow(values, length, nan_to_null)
     37 first_non_none = get_first_non_none(values.values)  # type: ignore[arg-type]
     38 if isinstance(first_non_none, str):
---> 39     return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
     40 elif first_non_none is None:
     41     return pa.nulls(length or len(values), pa.large_utf8())

File /well/jknight-hinks/users/lwn344/conda/skylake/envs/dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:365, in pyarrow.lib.array()

File .../dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:91, in pyarrow.lib._ndarray_to_array()

File .../dandelion/lib/python3.14/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: Expected bytes, got a 'int' object

OS information

HPC

Version information

dandelion==1.0.0a1.dev8 pandas==2.3.3 numpy==2.3.5 matplotlib==3.10.8 networkx==3.6.1 scipy==1.17.1

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions