{"id":40,"date":"2024-04-20T07:54:00","date_gmt":"2024-04-20T07:54:00","guid":{"rendered":"http:\/\/rarathemesdemo.com\/perfect-portfolio\/?post_type=rara-portfolio&#038;p=40"},"modified":"2024-08-06T00:04:38","modified_gmt":"2024-08-06T00:04:38","slug":"molecular-graph-generation-vae","status":"publish","type":"rara-portfolio","link":"https:\/\/alex-jimenez.com\/?rara-portfolio=molecular-graph-generation-vae","title":{"rendered":"Molecular Graph Decomposition"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">Source Code: <a href=\"https:\/\/github.com\/alexjimenez99\/Junction-Tree-Chem\/tree\/main\" data-type=\"link\" data-id=\"https:\/\/github.com\/alexjimenez99\/Junction-Tree-Chem\/tree\/main\">GitHub<\/a><\/h3>\n\n\n\n<p><strong>Motivation Behind the Project<\/strong><\/p>\n\n\n\n<p>The inspiration for this project is from the following paper:&nbsp;<em>Junction Tree Variational Autoencoder for Molecular Graph Generation by Jin et al.<\/em>&nbsp;I wanted to recreate the work done by&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/1802.04364\" target=\"_blank\" rel=\"noreferrer noopener\">Jin et al<\/a>&nbsp;to explore junction trees, chem-informatics, and test my coding abilities. The outcome of the decomposition is not identical to Jin et al&#8217;s representation, but that&#8217;s because I wanted to do this from scratch and likely some design decisions differed. This package could be used&nbsp;in the future&nbsp;as a feature-rich input for machine learning. I see some major advantages to this type of decomposition for Transformers.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Math Decomposition<\/h2>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter\"><img fetchpriority=\"high\" decoding=\"async\" width=\"300\" height=\"186\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Smiles-WP-2-300x186.png\" alt=\"\" class=\"wp-image-737\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Smiles-WP-2-300x186.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Smiles-WP-2-97x60.png 97w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Smiles-WP-2.png 589w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 1<\/strong>: SMILES Decomposition<\/figcaption><\/figure><\/div>\n\n\n<p>This project is a Python package that takes chemical SMILES and encodes them into a graph structure using a variety of chemical properties and preprocessing techniques below is an example of what a SMILE projection of a molecule might look like. Background on SMILES can be found&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Simplified_molecular-input_line-entry_system\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.&nbsp;<\/p>\n\n\n\n<p>The target format of our data looks much different than our SMILES representation. Before we show what this format will look like, here is a list of properties used to decompose a molecule into its unique aspects.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Chemical Properties<\/h4>\n\n\n\n<ul>\n<li>Aromaticity\/Rings<\/li>\n\n\n\n<li>Bond Type (Single, Double, Triple)<\/li>\n\n\n\n<li>Chirality<\/li>\n\n\n\n<li>Tertiary\/Quaternary Atoms<\/li>\n\n\n\n<li>Bond Isomerism (Cis\/Trans\/None)<\/li>\n\n\n\n<li>Valence<\/li>\n<\/ul>\n\n\n\n<p>The molecule was than composed into a junction tree using the methodology used in Jin et al. The following is an image of the algorithm written by Jin et al. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"333\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM-1024x333.png\" alt=\"\" class=\"wp-image-738\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM-1024x333.png 1024w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM-300x97.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM-768x249.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM-185x60.png 185w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-25-at-7.51.19-AM.png 1472w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2<\/strong>: Tree Decomposition Algorithm for Chemical Structures<\/figcaption><\/figure><\/div>\n\n\n<p>A junction tree of the molecule pairs relevant groups of a molecule together. In this case, we define V2 as the ring structures of a molecule, which are important since they can be rigid and provide lots of information about their steric properties. V1 is defined as the bonds that do not belong to any ring. V0 is the intersection of three or more clusters, which in the case of a carbon atom, represents tertiary and quaternary carbons. This is important, because chirality, functional groups, and resonance structures can be around these points. These are then composed into a junction tree. An example of a junction tree can be seen below, where image (a) could represent the original connectivity of our molecule and image (b) represents the maximum spanning tree of our molecule. More information on maximum-spanning trees can be found&nbsp;<a href=\"https:\/\/mathworld.wolfram.com\/MaximumSpanningTree.html#:~:text=A%20maximum%20spanning%20tree%20is,the%20command%20FindSpanningTree%5Bg%5D.\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.&nbsp;<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"431\" height=\"247\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Junction-Tree-Example.png\" alt=\"\" class=\"wp-image-739\" style=\"width:591px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Junction-Tree-Example.png 431w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Junction-Tree-Example-300x172.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Junction-Tree-Example-330x190.png 330w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/04\/Junction-Tree-Example-105x60.png 105w\" sizes=\"(max-width: 431px) 100vw, 431px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 3<\/strong>: Maximum Spanning Tree Decomposition Conceptual Visualization<\/figcaption><\/figure><\/div>\n\n\n<p>Applying this methodology and doing some additional preprocessing eventually leads to our molecules being represented as an N X N sparse matrix.&nbsp;<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"304\" height=\"263\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Adjacency-Matrix.png\" alt=\"\" class=\"wp-image-765\" style=\"width:402px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Adjacency-Matrix.png 304w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Adjacency-Matrix-300x260.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Adjacency-Matrix-69x60.png 69w\" sizes=\"(max-width: 304px) 100vw, 304px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 4<\/strong>: Adjacency Matrix Representing Connections of Atoms in Chemical Structure<\/figcaption><\/figure><\/div>\n\n\n<p>Self-looping was added to all adjacency matrices. Self-looping messages&nbsp;have shown to be&nbsp;useful in deep learning for&nbsp;<a href=\"https:\/\/www.pyg.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">graph neural networks<\/a>; this ensures that the node properties affect its&nbsp;own&nbsp;node when training a model. An example of what self-looping messages look like can be seen in the following figure, where n1 and n3 have self-looping.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"254\" height=\"347\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Self-Looping.png\" alt=\"\" class=\"wp-image-766\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Self-Looping.png 254w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Self-Looping-220x300.png 220w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Self-Looping-44x60.png 44w\" sizes=\"(max-width: 254px) 100vw, 254px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 5<\/strong>: Self Looping in Graph Structures<\/figcaption><\/figure><\/div>\n\n\n<p>Since we&#8217;re interested in more than just the connectivity of our molecules, we must incorporate the chemical information into these feature decompositions for meaningful features, which leads us to the next section <\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Vocabulary Building<\/h2>\n\n\n\n<p>Now that we&#8217;ve covered the math decomposition section, we&#8217;ll dig into how the vocabulary was decomposed. We mentioned in earlier that the following properties were being tracked for each molecule during the decomposition.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Chemical Properties<\/h4>\n\n\n\n<ul>\n<li>Aromaticity\/Rings<\/li>\n\n\n\n<li>Bond Type (Single, Double, Triple)<\/li>\n\n\n\n<li>Chirality<\/li>\n\n\n\n<li>Tertiary\/Quaternary Atoms<\/li>\n\n\n\n<li>Bond Isomerism (Cis\/Trans\/None)<\/li>\n\n\n\n<li>Valence<\/li>\n<\/ul>\n\n\n\n<p>Using prior knowledge, I decided this would be a sufficient list for uniquely identifying molecules and substructures. As a result, a unique dictionary structure was created for each SMILES object using these properties. Each dictionary holds crucial information such as atom-to-index key-value pairs, bond connectivity, and numerical indices assigned by RDKit to indicate all of the other properties stated above.&nbsp;<\/p>\n\n\n\n<p>All the indices for these properties were then converted into strings that summarized the atom, bond type, chirality, atom degree,  bond isomerism, and valence. An example of some extracted dictionary entries can be seen below. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"340\" height=\"370\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-1.02.33-PM.png\" alt=\"\" class=\"wp-image-767\" style=\"width:422px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-1.02.33-PM.png 340w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-1.02.33-PM-276x300.png 276w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-1.02.33-PM-55x60.png 55w\" sizes=\"(max-width: 340px) 100vw, 340px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 6<\/strong>: Vocabulary Obtained From Molecular Decomposition<\/figcaption><\/figure><\/div>\n\n\n<p>The original dataset used to create the first embeddings in this dictionary was the&nbsp;<a href=\"https:\/\/www.kaggle.com\/datasets\/basu369victor\/zinc250k\" target=\"_blank\" rel=\"noreferrer noopener\">Zinc250K<\/a>&nbsp;dataset from Kaggle. From the 250,000 entries, 369 unique atomic structures were identified. Since I knew these embeddings wouldn&#8217;t cover the full chemical space, I exported this vocabulary as a JSON file. This would serve as the starting point for a tokenizing algorithm.&nbsp;&nbsp;<\/p>\n\n\n\n<p>After copying the original embeddings into a separate file, functionality was added to dynamically update the token dictionary. To ensure that the algorithm was correctly identifying structures, I made a couple of functions that would draw the RDKit structures based on the vocabulary token given in XX. The following is an example of the token given and the structure returned; in this case, I requested a carbon-oxygen double bonded with &#8220;none&#8221; isomerism and a valence of 4 and 2 for the carbon and oxygen respectively.&nbsp;<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"740\" height=\"734\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-7.16.53-PM.png\" alt=\"\" class=\"wp-image-769\" style=\"width:778px;height:auto\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-7.16.53-PM.png 740w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-7.16.53-PM-300x298.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-7.16.53-PM-150x150.png 150w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Screenshot-2024-05-09-at-7.16.53-PM-60x60.png 60w\" sizes=\"(max-width: 740px) 100vw, 740px\" \/><figcaption class=\"wp-element-caption\">Figure 7: Visual Certification of Vocabulary and Molecular Decomposition Algorithm<\/figcaption><\/figure><\/div>\n\n\n<p>Another verification of this process was the visualization of the junction tree that was generated, more specifically the&nbsp;<a href=\"https:\/\/mathworld.wolfram.com\/MaximumSpanningTree.html#:~:text=A%20maximum%20spanning%20tree%20is,the%20command%20FindSpanningTree%5Bg%5D.\" target=\"_blank\" rel=\"noreferrer noopener\">maximum spanning tree<\/a>&nbsp;over the junction tree structure that was generated. A maximum spanning tree does not have loops, so that&#8217;s the first visual verification of the structure. The next verification is&nbsp;assuring that&nbsp;all nodes are connected, and&nbsp;no&nbsp;errors occurred in creating these structures. To do this, we&nbsp;assure&nbsp;that adjacent nodes have at least one shared atom, and&nbsp;adjacent&nbsp;ring structures should have two shared atoms.&nbsp;<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"950\" height=\"944\" src=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees.png\" alt=\"\" class=\"wp-image-772\" srcset=\"https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees.png 950w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees-300x298.png 300w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees-150x150.png 150w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees-768x763.png 768w, https:\/\/alex-jimenez.com\/wp-content\/uploads\/2024\/05\/Maximum-Spanning-Trees-60x60.png 60w\" sizes=\"(max-width: 950px) 100vw, 950px\" \/><figcaption class=\"wp-element-caption\">Figure 8: Visual Verification of Molecular Decomposition Analyzing Maximum Spanning Tree in NetworkX<\/figcaption><\/figure><\/div>\n\n\n<p>So after these two visual verifications on multiple structures, confidently I can say the tree decomposition algorithm has worked. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Disclaimer<\/h3>\n\n\n\n<p><a href=\"https:\/\/deepchem.io\" data-type=\"link\" data-id=\"https:\/\/deepchem.io\">DeepChem<\/a> offers a variety of beddings that are compatible with graph nueral networks. I learned about this following the creation of this algorithm. I still think it was invaluable to go through this process myself to understand what&#8217;s going on under the hood. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>*Image credit to Ryu et from &#8220;Deeply learning molecular structure-property relation- ships using attention- and gate-augmented graph con- volutional network&#8221;<\/p>\n","protected":false},"author":1,"featured_media":795,"comment_status":"closed","ping_status":"closed","template":"","rara_portfolio_categories":[10,3],"_links":{"self":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/rara-portfolio\/40"}],"collection":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/rara-portfolio"}],"about":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/types\/rara-portfolio"}],"author":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=40"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=\/wp\/v2\/media\/795"}],"wp:attachment":[{"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=40"}],"wp:term":[{"taxonomy":"rara_portfolio_categories","embeddable":true,"href":"https:\/\/alex-jimenez.com\/index.php?rest_route=%2Fwp%2Fv2%2Frara_portfolio_categories&post=40"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}