Priority based functional group identification of organic molecules using machine learning

R. Nalla; R. Pinge; Manish Narwaria; B. Chaudhury

doi:10.1145/3152494.3152522

Functional groups in organic compounds determine the properties of the compounds/molecules. When multiple functional groups are present, the dominant functional group determines majority of the properties of the compound. Hence priority based identification of functional groups is an important problem in chemistry. Fourier-transform Infrared spectroscopy (FTIR) is a commonly used spectroscopic method for identifying the presence or absence of functional groups within a compound, and the current approach for this task mainly relies on visual inspection and analysis of the FTIR spectral data. However, such visual identification process by humans is error prone, especially when patterns in the FTIR spectrum overlap, resulting in loss of uniqueness of features which help in identification of different functional groups in the unknown sample. Therefore, the main goal of this paper is to develop a machine-learning based classification system which can perform priority based functional group identification of organic molecules. To the best of our knowledge, this is the first effort to address this problem using machine learning (ML), and a unique aspect of our study is the incorporation of domain specific information into the process of classification by employing a set of priority rules generated from expert knowledge. We have carried out extensive study on real IR spectral data, first using a rule based approach and then using ML in an effort to improve the classification accuracy. Our analysis indicates that the basic rule based method is reasonably effective in predicting the presence (or absence) of functional groups. However, such approach is practically not accurate enough for the more challenging problem of priority based identification, and ML based classification offers much higher identification accuracies in this case. The primary reason is that ML algorithm can adaptively exploit data patterns to classify the functional group unlike the rule-based approach which uses a fixed set of rules for the said purpose. Finally, we have also carried out extensive statistical analysis of the results by using confidence intervals and permutation tests, in an effort to gain more descriptive information about the learning process, and not simply treat it as a black box. © 2018 Association for Computing Machinery.

Journal	Data powered by SciSpaceACM International Conference Proceeding Series
Publisher	Data powered by SciSpaceAssociation for Computing Machinery