Add mmCIF -> PDB conversion by a-r-j · Pull Request #107 · BioPandas/biopandas

a-r-j · 2022-08-06T05:00:58Z

Description

Adds support for writing PDB files to filestreams.

I started adding the conversion method proposed by @mrauha.

I ran into an issue writing the test:

from pandas.testing import assert_frame_equal

def test_mmcif_pdb_conversion():
    pdb =PandasPdb().fetch_pdb("3EIY")
    mmcif = PandasMmcif().fetch_mmcif("3EIY")
    mmcif_pdb = mmcif.get_pandas_pdb()
    
    assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
    assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

Which gives an error relating to atom numbering in the HETATM dataframes:

AssertionError                            Traceback (most recent call last)
/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in <cell line: 11>()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7)     assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
      [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8)     assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))
---> [11](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=10) test_mmcif_pdb_conversion()

/Users/arianjamasb/github/biopandas/dev.ipynb Cell 7' in test_mmcif_pdb_conversion()
      [6](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=5) mmcif_pdb = mmcif.get_pandas_pdb()
      [8](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=7) assert_frame_equal(pdb.df["ATOM"].drop(columns=["line_idx"]), mmcif_pdb.df["ATOM"].drop(columns=["line_idx"]))
----> [9](vscode-notebook-cell:/Users/arianjamasb/github/biopandas/dev.ipynb#ch0000003?line=8) assert_frame_equal(pdb.df["HETATM"].drop(columns=["line_idx"]), mmcif_pdb.df["HETATM"].drop(columns=["line_idx"]).reset_index(drop=True))

    [... skipping hidden 2 frame]

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:52, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_libs/testing.pyx:167, in pandas._libs.testing.assert_almost_equal()

File ~/opt/anaconda3/envs/biopandas/lib/python3.8/site-packages/pandas/_testing/asserters.py:682, in raise_assert_detail(obj, message, left, right, diff, index_values)
    679 if diff is not None:
    680     msg += f"\n[diff]: {diff}"
--> 682 raise AssertionError(msg)

AssertionError: DataFrame.iloc[:, 1] (column name="atom_number") are different

DataFrame.iloc[:, 1] (column name="atom_number") values are different (100.0 %)
[index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
[left]:  [1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, ...]
[right]: [1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1346, 1347, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1356, 1357, 1358, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1366, 1367, 1368, 1369, 1370, 1371, 1372, 1373, 1374, 1375, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1392, 1393, 1394, 1395, 1396, 1397, 1398, 1399, 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, ...]

Any idea of the source of the discrepancy?

Update: This seems to be fixed now! 🥳

Related issues or pull requests

#97

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./biopandas/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under biopandas/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./biopandas -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./biopandas/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./biopandas

merge main

rasbt · 2022-08-06T15:15:10Z

Thanks for adding that! For the test, just curious, does this happen with other structures as well?

a-r-j · 2022-08-06T16:19:01Z

It's consistent across a few examples. I narrowed it down to TER records being the problem.

E.g.

# 3eiy.pdb
ATOM   1329  NZ  LYS A 175       8.861  36.709 -13.496  1.00 63.20           N  
ATOM   1330  OXT LYS A 175      10.451  35.432 -10.086  1.00 55.71           O  
TER    1331      LYS A 175                                                      
HETATM 1332  K     K A 176      24.990  43.276   0.005  0.50 24.45           K  
HETATM 1333 NA    NA A 177       1.633  34.181  11.897  1.00 26.73          NA

# 3eiy.cif
ATOM   1329 N  NZ  . LYS A 1 196 ? 8.861   36.709 -13.496 1.00 63.20 ? 175 LYS A NZ  1 
ATOM   1330 O  OXT . LYS A 1 196 ? 10.451  35.432 -10.086 1.00 55.71 ? 175 LYS A OXT 1 
HETATM 1331 K  K   . K   B 2 .   ? 24.990  43.276 0.005   0.50 24.45 ? 176 K   A K   1 
HETATM 1332 NA NA  . NA  C 3 .   ? 1.633   34.181 11.897  1.00 26.73 ? 177 NA  A NA  1

Now, I'm not sure how reliably TER records are placed in PDB files (some info here ) and if we can be super comfortable just ignoring it. I imagine it would cause a bit of a headache trying to make selections consistent between the same structure in PDB & .cif formats.

rasbt · 2022-08-06T21:16:22Z

Hmm... I think that we could have it without the TER records and just a note about that in the docs for now. So that people are aware. Otherwise, inferring the TER records would be some additional work (I guess that could be done based on the CIF structure index? But I don't have that much experience with TER). My guess is the primary usecase of BioPandas is data analysis, and the TER records are not super necessary anyways.

So, for now I'd say let's warn people about the TER entries but not worry about it.

pep8speaks · 2022-08-07T15:08:18Z

Hello @a-r-j! Thanks for updating this PR.

In the file biopandas/mmcif/pandas_mmcif.py:

Line 534:89: E501 line too long (95 > 88 characters)
Line 536:89: E501 line too long (106 > 88 characters)

Comment last updated at 2022-09-03 21:28:28 UTC

a-r-j · 2022-08-07T15:09:33Z

Sure, I added an offset arg so people at least have an access point. Hopefully that's a sufficient remedy for most usecases.

rasbt · 2022-08-07T17:25:08Z

I think that works! Do you have a unit test handy? Otherwise, I can help looking into this!

a-r-j · 2022-08-08T12:30:44Z

Think I managed to infer it automatically based on the number of chains. There was also an issue in ATOM atom numbering for multichain proteins (again, TER records). So I figured we can infer the appropriate offsets for the ATOM and HETATM dfs from the number of chains.

Added appropriate unittest and seems to check out.

rasbt · 2022-08-13T21:32:34Z

Wow this looks good! Currently traveling and will take one more close look in a calm moment, when I am not out and about. Thanks a lot!

rasbt · 2022-08-18T07:31:52Z

Arg totally missed this and am currently traveling (on mobile). I will make a reminder to check this out as soon as I am back on my computer again!

rasbt

Looks great, there is just this one little thing regarding the keys I noticed.

biopandas/mmcif/engines.py

rasbt · 2022-09-03T21:10:41Z

biopandas/mmcif/pandas_mmcif.py

+        """
+        pandaspdb = PandasPdb()
+        # for a in ["ATOM", "HETATM"]:
+        for a in self.df.keys():


Hm, but doesn't that include the OTHERS key as well? Maybe that could be more specific to the keys we actually care about

Hmm, yes, good point. We could add an arg that defaults to ["ATOM", "HETATM"] to allow some more control. What do you think?

Yes, I think that's the "safer" choice

review-notebook-app · 2022-09-03T21:28:28Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rasbt · 2022-09-03T21:29:02Z

Just did a little documentation update (sorry for the delay, I was traveling during the last couple of weeks, and there's tons of stuff to catch up on)

a-r-j · 2023-01-04T18:01:24Z

Happy New Year @rasbt ! Anything I can do to get this PR merged? :)

rasbt

That looks awesome. Thanks so much. And sorry for the long delay. Somehow it slipped through the cracks and I totally missed this PR.

There's just a small comment about the naming above.

rasbt · 2023-01-04T19:16:07Z

biopandas/mmcif/pandas_mmcif.py

        self.code = self.data["entry"]["id"][0].lower()
        return self
+
+    def get_pandas_pdb(self, offset_chains: bool = True, records: List[str] = ["ATOM", "HETATM"]) -> PandasPdb:


should this perhaps be called convert_to_pandas_pdb? Just so that it doesn't indicate it's fetching the PDB

rasbt · 2023-01-04T19:17:42Z

docs/CHANGELOG.md

+
+##### New Features
+
+- Added a new `PandasMmcif.get_pandas_pdb()` class that converts the mmCIF file into a PDB structure. (Via [Arian Jamasb](https://github.com/a-r-j), PR #[107](https://github.com/rasbt/biopandas/pull/107/files))


see above about the method naming

rasbt · 2023-01-04T20:36:46Z

biopandas/__init__.py

 #

-__version__ = "0.4.2"
+__version__ = "0.5.0"


Ah sorry to come back to this, but

__version__ = "0.5.0" should probably be __version__ = "0.5.0dev" so that we can collect some of the other things in the PR, and then we can change it to 0.5.0 when the the next release version goes out

Yep good point, I'll try to wrap the other two PRs tonight too :)

a-r-j added 2 commits August 6, 2022 06:55

add mmcif->pdb conversion BioPandas#97

76e207c

Merge pull request #2 from rasbt/main

e3d985b

merge main

Add offset param to deal with atom numbering

e61edba

a-r-j added 2 commits August 8, 2022 14:26

improve atom number offsetting and add unit test

ad806a4

Merge branch 'main' into main

25bb4c9

mrauha mentioned this pull request Aug 29, 2022

Support for mmCIF Files a-r-j/graphein#127

Open

rasbt reviewed Sep 3, 2022

View reviewed changes

add get_pandas_pdb to the docs

7e4c2fc

a-r-j added 3 commits September 27, 2022 19:36

add mutable record selection to conversion

32bfc59

Added to PDB stream method

6f9aef5

Add List import

1f120f2

rasbt requested changes Jan 4, 2023

View reviewed changes

Rename conversion method

5ee7f99

rasbt approved these changes Jan 4, 2023

View reviewed changes

rasbt requested changes Jan 4, 2023

View reviewed changes

bump version string to 0.5.0dev

c45cf7c

rasbt approved these changes Jan 4, 2023

View reviewed changes

rasbt merged commit da45886 into BioPandas:main Jan 4, 2023


		##### New Features

		- Added a new `PandasMmcif.get_pandas_pdb()` class that converts the mmCIF file into a PDB structure. (Via [Arian Jamasb](https://github.com/a-r-j), PR #[107](https://github.com/rasbt/biopandas/pull/107/files))

Conversation

a-r-j commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues or pull requests

Pull Request Checklist

Uh oh!

rasbt commented Aug 6, 2022

Uh oh!

a-r-j commented Aug 6, 2022

Uh oh!

rasbt commented Aug 6, 2022

Uh oh!

pep8speaks commented Aug 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2022-09-03 21:28:28 UTC

Uh oh!

a-r-j commented Aug 7, 2022

Uh oh!

rasbt commented Aug 7, 2022

Uh oh!

a-r-j commented Aug 8, 2022

Uh oh!

rasbt commented Aug 13, 2022

Uh oh!

rasbt commented Aug 18, 2022

Uh oh!

rasbt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rasbt Sep 3, 2022

Choose a reason for hiding this comment

Uh oh!

a-r-j Sep 5, 2022

Choose a reason for hiding this comment

Uh oh!

rasbt Sep 5, 2022

Choose a reason for hiding this comment

Uh oh!

review-notebook-app bot commented Sep 3, 2022

Uh oh!

rasbt commented Sep 3, 2022

Uh oh!

a-r-j commented Jan 4, 2023

Uh oh!

rasbt left a comment

Choose a reason for hiding this comment

Uh oh!

rasbt Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

rasbt Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

rasbt Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

a-r-j Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

a-r-j commented Aug 6, 2022 •

edited

Loading

pep8speaks commented Aug 7, 2022 •

edited

Loading