-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Description of the bug
The ddl.tl.quantify_mutations works fine when split_locus=False, but errors when split_locus=True.
I think this again has to do with polars data types. This is my vdj after running ddl.pp.quantify_mutations(vdj), it has those strange-looking _right columns which I think come from an incorrect join:
Lazy Dandelion object with n_obs = 3665 and n_contigs = 6998
data: sequence_id, sequence, rev_comp, productive, v_call, d_call, j_call, sequence_alignment, germline_alignment, junction, junction_aa, v_cigar, d_cigar, j_cigar, stop_codon, vj_in_frame, locus, c_call, junction_length, np1_length, np2_length, v_sequence_start, v_sequence_end, v_germline_start, v_germline_end, d_sequence_start, d_sequence_end, d_germline_start, d_germline_end, j_sequence_start, j_sequence_end, j_germline_start, j_germline_end, v_score, v_identity, v_support, d_score, d_identity, d_support, j_score, j_identity, j_support, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3, cell_id, consensus_count, umi_count, c_sequence_alignment, c_germline_alignment, c_sequence_start, c_sequence_end, c_score, c_identity, junction_aa_length, fwr1_aa, fwr2_aa, fwr3_aa, fwr4_aa, cdr1_aa, cdr2_aa, cdr3_aa, sequence_alignment_aa, v_sequence_alignment_aa, d_sequence_alignment_aa, j_sequence_alignment_aa, clone_id, v_call_10x, d_call_10x, j_call_10x, junction_10x, junction_10x_aa, j_support_igblastn, j_score_igblastn, j_call_igblastn, j_call_blastn, j_identity_blastn, j_alignment_length_blastn, j_number_of_mismatches_blastn, j_number_of_gap_openings_blastn, j_sequence_start_blastn, j_sequence_end_blastn, j_germline_start_blastn, j_germline_end_blastn, j_support_blastn, j_score_blastn, j_sequence_alignment_blastn, j_germline_alignment_blastn, d_support_igblastn, d_score_igblastn, d_call_igblastn, d_call_blastn, d_identity_blastn, d_alignment_length_blastn, d_number_of_mismatches_blastn, d_number_of_gap_openings_blastn, d_sequence_start_blastn, d_sequence_end_blastn, d_germline_start_blastn, d_germline_end_blastn, d_support_blastn, d_score_blastn, d_sequence_alignment_blastn, d_germline_alignment_blastn, d_source, v_call_genotyped, germline_alignment_d_mask, sample_id, c_call_10x, j_call_multimappers, j_call_multiplicity, j_call_sequence_start_multimappers, j_call_sequence_end_multimappers, j_call_support_multimappers, mu_count, mu_freq, rearrangement_status, extra, ambiguous, mu_freq_right, mu_count_right
metadata: cell_id, mu_freq, mu_count
layout: layout for 3666 vertices, layout for 523 vertices
graph: networkx graph of 3666 vertices, networkx graph of 523 vertices
distances: distance matrix of shape (3666, 3666)Minimal reproducible example
vdj.store_germline_reference(
corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
germline=".../database/germlines/imgt/human/vdj",
org="human",
)
ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
ddl.pp.quantify_mutations(vdj, split_locus=True)The error message produced by the code above
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
Cell In[146], line 8
1 vdj.store_germline_reference(
2 corrected=".../tigger/tigger_heavy_igblast_db-pass_genotype.fasta",
3 germline=".../database/germlines/imgt/human/vdj",
4 org="human",
5 )
7 ddl.pp.create_germlines(vdj, additional_args=["--vf", "v_call_genotyped"])
----> 8 ddl.pp.quantify_mutations(vdj, split_locus=True)
9 ddl.tl.transfer(adata, vdj)
File .../dandelion/lib/python3.14/site-packages/dandelion/external/immcantation/polars/shazam_polars.py:196, in quantify_mutations(data, split_locus, sequence_column, germline_column, region_definition, mutation_definition, frequency, combine, **kwargs)
182 if pd_df[col].dtype == object:
183 pd_df[col] = pd_df[col].apply(
184 lambda x: (
185 None
(...) 194 )
195 )
--> 196 r_out_pl = pl.from_pandas(pd_df)
198 if isinstance(data, DandelionPolars):
199 # Append new columns to data._data via sequence_id join
200 base_df = data._data
File .../dandelion/lib/python3.14/site-packages/polars/convert/general.py:707, in from_pandas(data, schema_overrides, rechunk, nan_to_null, include_index)
704 return wrap_s(pandas_to_pyseries("", data, nan_to_null=nan_to_null))
705 elif isinstance(data, pd.DataFrame):
706 return wrap_df(
--> 707 pandas_to_pydf(
708 data,
709 schema_overrides=schema_overrides,
710 rechunk=rechunk,
711 nan_to_null=nan_to_null,
712 include_index=include_index,
713 )
714 )
715 else:
716 msg = f"expected pandas DataFrame or Series, got {qualified_type_name(data)!r}"
File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/dataframe.py:1138, in pandas_to_pydf(data, schema, schema_overrides, strict, rechunk, nan_to_null, include_index)
1129 arrow_dict[str(idxcol)] = plc.pandas_series_to_arrow(
1130 # get_level_values accepts `int | str`
1131 # but `index.names` returns `Hashable`
(...) 1134 length=length,
1135 )
1137 for col_idx, col_data in data.items():
-> 1138 arrow_dict[str(col_idx)] = plc.pandas_series_to_arrow(
1139 col_data, nan_to_null=nan_to_null, length=length
1140 )
1142 arrow_table = pa.table(arrow_dict)
1143 return arrow_to_pydf(
1144 arrow_table,
1145 schema=schema,
(...) 1148 rechunk=rechunk,
1149 )
File .../dandelion/lib/python3.14/site-packages/polars/_utils/construction/other.py:39, in pandas_series_to_arrow(values, length, nan_to_null)
37 first_non_none = get_first_non_none(values.values) # type: ignore[arg-type]
38 if isinstance(first_non_none, str):
---> 39 return pa.array(values, pa.large_utf8(), from_pandas=nan_to_null)
40 elif first_non_none is None:
41 return pa.nulls(length or len(values), pa.large_utf8())
File /well/jknight-hinks/users/lwn344/conda/skylake/envs/dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:365, in pyarrow.lib.array()
File .../dandelion/lib/python3.14/site-packages/pyarrow/array.pxi:91, in pyarrow.lib._ndarray_to_array()
File .../dandelion/lib/python3.14/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowTypeError: Expected bytes, got a 'int' objectOS information
HPC
Version information
dandelion==1.0.0a1.dev8 pandas==2.3.3 numpy==2.3.5 matplotlib==3.10.8 networkx==3.6.1 scipy==1.17.1
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working