Skip to content

mmCIF coordinate parser

Edoardo Sarti edited this page Mar 11, 2022 · 1 revision

This function is intended for parsing the coordinate section of mmCIF files coming from the Protein Data Bank resources.

The function returns a data structure with all the collected information, and a data structure with some statistics on the parsed file.

Parsing the mmCIF file coordinates

Remark: this parser contains filters that select and modify the information of the original record.

Here is a list of the operations performed by the parser:

  • In case of multiple models, it keeps only the first model
  • It only keeps ATOM and HETATM records (note: TER records do not exist in mmCIF)
  • Eliminates double records. Double records are unfrequent errors in PDB files and should be deleted
  • Classifies AltLoc records. AltLoc records are alternative locations of some residues: in order for having a real structure, an AltLoc should be chosen and the others must be discarded. The AltLoc ID resets at each chain, but not at each residue: this means that within a chain, residues can have a subset of alternative locations. Thus, for each chain, AltLoc residues are recorded, together with their atoms' occupancy. The AltLoc corresponding to the highest per-residue occupancy is chosen. If there is a tie, the AltLoc with the highest number of residues is chosen. If there still is a tie, the one being recorded first is chosen.
  • Checks whether the file can be translated in PDB format. For this, the number of atoms must be <100,000 and all chain IDs must be one character long.
  • Keeps UNK residues, but only records backbone atoms and the C_beta, if present
  • Residues that are not recognized by the residue parser (i.e., they are not among the 20 standard residues, and they are not UNK) are treated as UNK if the chain's entity is a polypeptide, and are discarded otherwise. In order to perform this check, the "entity" loop in the header of the mmCIF file has to be accessed: if the parser does not find this information, all these residues are discarded, regardless of the nature of the chain
  • If the line is a HETATM and the residue id an MSE (Selenomethionine), it will be edited in a MET residue (ATOM record). Editing includes, of course, the replacement of the SE (selenium) atom with the S atom of the MET residue (placed at the same coordinates). All other HETATM record is discarded.

Clone this wiki locally