Sunday, January 04, 2009

Antithrombotic graph visualization - Chemistry 2.0 and the ChemSpider journal

"It’s not Information Overload. It’s Filter Failure. If you have the same problem for a long time, maybe it is not a problem, maybe it is a fact ... some of it is rethinking social norm ... we have continuously to improve filters, that just broke !" [Clay Shirky]

"We intend to demonstrate how modern web technologies can be used to dramatically enhance the type of information that can be communicated using web-based tools over standard online publishing approaches." [ChemSpider journal (editorial board)]
Inspired by the story about the chemical-study-group example of Clay Shirky (after minute 15), I got again reminded about the information overload problem. I agree totally with Clay that a long persisting problem, encourages a desire to get it changed. As shown by Tony from ChemSpider does the life science community have the (technical) ability to enrich chemical data.

So, are there just (social) barriers left? According to VonHippel do single users rarely show enough investment into innovations, if they think that this is only their single problem, but not the problem of a larger community (see chapter four). But wait, I was always assuming that the community is having this problem together, not only some single people living with their data silos.

Again, and in-line with Clay and VonHippel, are there good reasons to rethink social norms for improving information filter processes, e.g. the ones with buried, wrong, hidden, or shrouded data.

Now, lets have a look at an antithrombotic drug design data example, based on community web communication.
I used the Fontaine et al. factor Xa data set (EC 3.4.21.6) from cheminformatics.org and enriched it with PDB ligand structures (final structure file). Factor Xa inhibitors are inhibiting the trypsin-like serine protease, involved in the blood coagulation process.

What has this data now to do with drug design? One very reasonable question could be
Which XRay ligands are closest to the Fontaine et al. structure-activity relationship data for allowing structure-based drug design?
In short, there might be many ways to give an answer. Here, we are now interested in checking the functionality of ChemSpider and blue obelisk tools, e.g. Pybel. The idea is also to move towards web-based and scientific collaboration models, e.g. via google docs.

First, since I strongly believe in the benefits of curated data and community efforts, I was checking how much factor Xa data was already contributed and curated within ChemSpider. Honestly, I was facing a couple of challenges (or filter failures) by trying to answer this question. It is already easy to upload data to ChemSpider, but it is less efficient to find out, if those molecules are already registered there? Please mind, that especially people in industry have to protect intellectual properties and can not always upload data immediately, but maybe contribute on a technical level or after patent expiration. Based on the InChIKey (1, 2, 3) it was possible to create an InChIKey2CID script. Though, this web-based retrieval is still extremely slow and suboptimal, and certainly not usable for thousands of molecules. Nonetheless, none of the factor Xa molecules was uploaded to ChemSpider, not even the PDB ligands (see flat file result). So, still some work to do in the future ;-)

Second, for now ChemSpider supports mainly a quick similarity check on InChIKeys, neglecting stereo information. Strictly speaking is this only one very specfific way of a similarity search. In other words, users are left with the option for a PubChem search, which covers also the ChemSpider data. Maybe there exist already a web-service for returning for a single molecule query the most similar or the most five similar compounds, but I have not found out how.
At the end, it was easier to follow Noels Pybel examples (general, fp, sd) and creating a molsim2gxml script. The script will not only calculate fingerprint similarities between molecules, but also an XGML output for the data visualization.

Xa ligand similarity (top 5 compounds).
Original graph files (yEd): gml, graphml, xgml, svg.
Tulip acccepts also the GML format.




Xa ligand similarity (best compound, most similar).
Original graph files (yEd): gml, graphml, xgml, svg.
Tulip acccepts also the GML format.

Finally, we obtain two graphs showing the similarity (connectivity) of the factor Xa SAR training data (blue), test data (green), and the XRay data (red). Each node represents one molecule, each edge links either the most similar compound (lower graph), or the five most similar compounds (upper graph).
In the lower graph, we see that only two XRay structures (indicated in red) are close to the factor Xa SAR data in the training set (indicated in blue). One is a zinc cation, which is in this case not interesting for ligand design, the other is the ZEN-PDB ligand. This ligand can be found in the PDB structures 1j17, 1ql7, 1ql8, 1ql9, and 1v2k, which could be a good starting point for structure-based drug design (SBDD) and structure-property-relationship (SPR) transfer. On the other hand can we see that the overall distance between the Fontaine SAR data (blue and green) and the XRay data (red) can be easily spotted. So, be warned when trying any SPR transfer, and we have not even started talking about 3D ligands and the protein flexibility !;-)

So, yes, some of the public SAR data can be linked to Xray information and could support structure-based design. Anyway, a fast answer to the original question took some time. And could this now be uploaded to support the drug design community and avoiding this kind of filter failure problem for other community members? I am not aware that the EBI, or NCBI services allow data upload.
On the other hand, would ChemSpider and the ChemSpider journal (editorial board) benefit from additional services, like fast InChIKey checks, or similarity searches.

References
  • molsim2xgml python script
  • inchikey2cid python script
  • yWorks - the diagramming company
  • F. Fontaine, M. Pastor, I. Zamora, F. Sanz
    Anchor-GRIND: Filling the Gap between Standard 3D QSAR and the GRid-INdependent Descriptors
    J. Med. Chem., 2005, 48, 2687-2694. DOI 10.1021/jm049113+
  • N. M. O'Boyle, C. Morley and G. R. Hutchison
    Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit
    Chem. Cent. J. 2008, 2, 5.
    DOI 10.1186/1752-153X-2-5
  • R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C. Steinbeck, J. Wegner, E. L. Willighagen
    The Blue ObeliskInteroperability in Chemical Informatics
    J. Chem. Inf. Model., 2006, 46, 991–998
    DOI 10.1021/ci050400b
  • D. Rauh, G. Klebe, M. T. Stubbs
    Understanding Protein–Ligand Interactions: The Price of Protein Flexibility
    Journal of Molecular Biology, 2004, 335, 1325-1341.
  • DOI 10.1016/j.jmb.2003.11.041
See also

1 comments:

Joerg Kurt Wegner said...

Rajarshi has done a time complexity testing for the REST service and comes to the same conclusion. Direct database queries are much faster than we one-by-one web-service request, so we will need something in-between.

I agree also, that a fast server-based similarity cartridge would be really useful, e.g. the Swamidass-Baldi approach, perfect for databases, or LINGO, which is already part of the OpenEye tools.