Use KDTree instead of mdtraj.compute_contacts in _filter_unphysical_traj_masks to mitigate MemoryError and improve performance by ahmedselim2017 · Pull Request #158 · microsoft/bioemu

ahmedselim2017 · 2025-07-30T14:44:47Z

Hi, first of all thanks for the project!

I was running bioemu for a structure with 994 residues for 10,000 samples. The model and side chain generation ran successfully but I got MemoryError: Unable to allocate 445. GiB for an array with shape (10000, 11954402) and data type float32 in the save_pdb_and_xtc function.

After some digging, I found that the error was caused by the mdtraj.compute_contacts function that is used to find clashes in the _filter_unphysical_traj_masks function. As the mdtraj.compute_contacts function works by first calculating the distances between atoms for each frame of the trajectory (ignoring i+1th and i+2th residues for residue i), it tries to create a huge array with a shape (10000, 11954402). To avoid creating this huge array, I processed each frame one by one.

Also, since we are not directly interested in the distances between all atom pairs but whether if there are any clashes, I have used KDTree to check if there are any clashes without directly computing the distance matrices, which improved the performance of finding clashes dramatically as you can see from the figure below where I found the clashes for the first N frames of the original trajectory. The mdtraj.compute_contacts can only find clashes up to 4000 frames as it gaves memory errors after that.

ahmedselim2017 · 2025-07-30T14:47:07Z

@microsoft-github-policy-service agree

franknoe · 2025-07-30T15:02:13Z

Thank you for your contribution @ahmedselim2017 ! We'll review this asap.

src/bioemu/convert_chemgraph.py

franknoe · 2025-07-30T20:24:13Z

I am guessing that for small systems, computing all pairs is more efficient and this approach wins beyond a certain system size. Can we test for a couple of systems of different sizes, and implement a switchover if this assumption is correct?

ahmedselim2017 · 2025-07-31T06:36:03Z

I have performed the same benchmark for another protein with 214 residues and KDTree was still faster (although the difference was smaller compared to the larger structure). If you have any other proteins in mind I can test them too, but I would need to run bioemu & side-chain generation first which may take some time.

franknoe · 2025-07-31T09:00:05Z

Thanks! For a simple benchmark you can just run repeats of "GGS", which is extremely flexible, and will also produce a lot of clashes for longer chains. Maybe 50 and 100 residues will be informative - looking at these plots I'm getting the impression that 100 residues may be a good switchover point.

Which GPU are you using?

ahmedselim2017 · 2025-07-31T14:26:32Z

No problem! For the other proteins, I have used A6000 but as it has some other queued jobs, I am using RTX4080 for 50 and 100 residue test runs.

ahmedselim2017 · 2025-08-04T15:04:07Z

The MD equilibrium phase of the GGS repeated sequences did not progressed 1 step in >~10 hours probably because of the unnatural sequence. I will restart the runs with real proteins but it will probably take a similar time.

franknoe · 2025-08-04T21:42:28Z

Oh, that's weird. Any idea why that may happen @josejimenezluna ?

I think the code shouldn't just get stuck when encountering a weird sequence. In case if this is a problem of not finding an alignment at least an error message should be returned. Can we create a separate issue for this?

ahmedselim2017 · 2025-08-06T11:16:01Z

The sidechain generation was crashed because of some issues with the mdtraj.join call in reconstruct_sidechains function. I have solved the issue in #160 but now need to regenerate sidechains, which will again take some time...

ahmedselim2017 · 2025-08-14T11:22:09Z

Hi again, I have ran the same benchmark with 50 and 100 residue proteins and as you have predicted for 100 residues or less KDTree was slower so we can use 100 residues as a cutoff.

Oh, that's weird. Any idea why that may happen @josejimenezluna ?

I think the code shouldn't just get stuck when encountering a weird sequence. In case if this is a problem of not finding an alignment at least an error message should be returned. Can we create a separate issue for this?

By the way I have came across the same issue in other runs that uses --filter_samples False too. Maybe the MD runs do not converge on the nonphysical samples?

franknoe · 2025-08-14T16:49:05Z

Sounds good! Leaving this to reviewed by @josejimenezluna after his return to office.

ahmedselim2017 · 2025-09-15T13:14:19Z

Hi, hope you are doing well! Are there any updates about the review? If you like, I can run some more tests but 100 residues looks like a good lower bound cutoff as KDTree is nearly the same with mdtraj.compute_contacts.

sarahnlewis · 2025-09-22T16:04:17Z

Thank you for the great work @ahmedselim2017. I wanted to add a test so I have made a branch from yours, where I added a test checking that the kdtree and mdtraj versions give the same answers for some junk coordinates. The branch is here: https://github.com/microsoft/bioemu/tree/sarahlewis/kdtree-tests. Could you please check if you're happy with those changes, if so, incorporate them in your branch, and I'll approve and merge your PR? I don't want to just make a PR directly from my branch because it feels like taking credit for your work.

One small thing I noticed (because of the hacky way I constructed the test trajectory) is that if a trajectory's 'time' attribute is too short, 'enumerate(traj)' does not iterate through all the frames. So I have changed your kdtree implementation to use 'range(len(traj))' instead.

ahmedselim2017 · 2025-09-22T19:36:24Z

Of course! Thanks for the feedback, added your changes.

sarahnlewis

Thanks for the nice work

ahmedselim2017 added 2 commits July 30, 2025 04:58

Use KDTree to compute contacts in _filter_unphysical_traj_masks

1e45c33

small fixes

1a0752b

josejimenezluna reviewed Jul 30, 2025

View reviewed changes

src/bioemu/convert_chemgraph.py Outdated Show resolved Hide resolved

ahmedselim2017 added 2 commits July 30, 2025 18:10

use mdtraj.utils.in_units_of to change units

b132873

style changes

99fcbc8

use KDTree only for more than 100 residues when finding clashes

a837744

refactoring, tests and trajectory enumeration fix

cffe7dd

sarahnlewis approved these changes Sep 22, 2025

View reviewed changes

sarahnlewis enabled auto-merge (squash) September 22, 2025 20:02

Merge branch 'main' into main

dd60dfd

sarahnlewis merged commit ccba8db into microsoft:main Sep 26, 2025
4 checks passed

Conversation

ahmedselim2017 commented Jul 30, 2025

Uh oh!

ahmedselim2017 commented Jul 30, 2025

Uh oh!

franknoe commented Jul 30, 2025

Uh oh!

Uh oh!

franknoe commented Jul 30, 2025

Uh oh!

ahmedselim2017 commented Jul 31, 2025

Uh oh!

franknoe commented Jul 31, 2025

Uh oh!

ahmedselim2017 commented Jul 31, 2025

Uh oh!

ahmedselim2017 commented Aug 4, 2025

Uh oh!

franknoe commented Aug 4, 2025

Uh oh!

ahmedselim2017 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahmedselim2017 commented Aug 14, 2025

Uh oh!

franknoe commented Aug 14, 2025

Uh oh!

ahmedselim2017 commented Sep 15, 2025

Uh oh!

sarahnlewis commented Sep 22, 2025

Uh oh!

ahmedselim2017 commented Sep 22, 2025

Uh oh!

sarahnlewis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ahmedselim2017 commented Aug 6, 2025 •

edited

Loading