Jesus Vazquez

Machine Learning (ML) has gained momentum as a critical component of Natural Language Processing (NLP), a suite of analytical techniques for discerning meaning from vast text corpuses. Specifically, learning word embeddings (numerical vector representations of words in high-dimensional spaces) has gained enormous popularity as a tool for deriving semantic relationships and similarities between words. However, the application of word embeddings and their subsequent interpretation is underexplored in the biomedical domain. In this research project, we explore the use of word embeddings to glean similarity and semantic relationships between biomedical entities (e.g. genes, cellular functions, diseases and drugs) from PubMed, a corpus of 28 million biomedical abstracts produced over the past 52 years. We are specifically interested in testing the effect of Name Entity Recognition (NER) on the efficacy of the word embeddings in capturing previously-known relationships. We are also comparing different similarity scores and developing methods to assess how well these learned embeddings recapitulate variou aspects of our prior biomedical knowledge.