# Research Projects

## Modeling drug interactions in the body

**Mentors:** Alex Dickson, Kin Sing Steven Lee, Tom Dixon, Rajat Kumar Pal

When drugs are administered, they undergo processes such as adsorption, distribution, metabolism and excretion (commonly referred to as ADME). These can be described by a set of ordinary differential equations, allowing us to predict the time course of drug action, including the length of time it remains in the body. In this project we will use computational modeling to investigate how the microscopic properties of a drug molecule affect its efficacy.

At the end of this project students will be able to model systems of ODEs, connect models to experimental data, ask and answer precise research questions, program in Python, and work with code repositories.

Students applying to this project should be able to program in python.

**Keywords:** drug design, python, ordinary differential equations

## Machine learning models to integrate massive amounts of text and molecular data

**Mentors:** Arjun Krishnan, Anna Yannakopoulos, Christopher Mancuso, Nathaniel Hawkins

Currently, we have data from more than 2 million genomics experiments. Each experiment is a high-throughput measurement of the activities of tens of thousands of genes/molecules in a biological sample that could come from various tissues, diseases, and experimental conditions. While these genomics data are invaluable for researchers to reuse and build on, finding the relevant subsets of data is challenging because of the lack of systematic and complete sample annotations. The focus of this REU project is to develop machine learning approaches that combine automated natural language processing with analysis of –omics data to systematically predict standardized annotations to every genomics sample.

At the end of this project, students will: (i) be more well-versed with data wrangling and visualization; (ii) become more skilled with statistical and machine learning toolkits; (iii) engage with problems and challenges in data-driven biology; (iv) build, organize, and sustain a complex computational project; and (iv) communicate and collaborate with scientists across disciplines.

Students applying to this project should have some familiarity with data-wrangling and visualization in Python (with NumPy, Pandas, Matplotlib/Seaborn). Familiarity with machine learning toolkits in Python (Scikit-Learn) is a bonus.

**Keywords:** Machine learning, Data-driven biology, Complex diseases, Natural language processing, Omics data integration

## Land cover/use changes in ecologically sensitive places

**Mentors:** Nathan Moore, Dan Wanyama

Students will learn to do satellite image processing of any ecologically sensitive region, for example, Mongolia's massive overgrazing problem, Tanzania's loss of forest cover, Amazon deforestation, etc.

At the end of this project students will be able to perform some complex image processing, apply interesting tools in land change detection, and evaluate uncertainty (i.e. did anything actually happen?). A lot of data management skills will be developed, along with figuring out correlation vs causation, and maybe even developing some simple models of land cover change.

Students applying to this project should be able to program in python.

**Keywords:** climate, land cover, land use

## Topological Time Series Analysis

**Mentors:** Liz Munch, Firas Khasawneh, Sarah Tymochko

Topological Data Analysis (TDA) is a young field of data analysis that encodes shape and structure of data using tools from mathematics, specifically topology. Our data comes in the form of a time series, such as a heart beat or car vibration. We will convert the data into a point cloud using embedding techniques from nonlinear time series analysis, and apply TDA methods to the resulting data to study and differentiate the systems which generate the data.

At the end of this project, the students will understand basic time series analysis techniques, understand the basics of persistent homology, a tool from TDA, and produce topological signatures of the output for analysis purposes.

Students applying to this project will use python and should have previously completed coursework in linear algebra and discrete math/graph theory. Completion of topology coursework is a bonus.

**Keywords:** topological data analysis, time series analysis

## Mapping burned areas from commercial satellite data

**Mentors:** David Roy, Haiyan Huang

Fires burn extensive areas and have important ecological, climate and anthropogenic impacts. Students will work on a NASA funded project to develop a burned area product for Africa. Specifically, the PLANET 3m commercial satellite (https://www.planet.com/) will soon be available on a global near daily basis. This student project will develop a semi-automated method to map burned areas from PLANET image time series and then scale it to African coverage.

At the end of the project students will be able to process satellite data, have an understanding of remote sensing techniques, image processing approaches, and data handling.

Students applying to this project will use C, Linux and remote sensing packages for data visualization. Students should have previously taken a machine learning course.

**Keywords:** Remote Sensing, Image Processing, Machine Learning

## Breaking barriers in single-cell genomics for microorganisms

**Mentors:** Ashley Shade, Jim Cole, Nejc Stopnisek

This bioinformatics project will involve design of nucleotide probes (barcodes) to effectively capture genome content from a diverse pool of single microbial cells. We will design different sets of probes and then test their potential efficacy in silico by asking how well the probes capture known diversity represented in microbial genome databases.

At the end of this project, students will be able to submit jobs on the high performance computing cluster, create scripts for probe design, design and execute in silico tests to assess the quality and potential of different probe combinations, and retrieve and work with big data from public genomic databases. The students will also learn how to use bioinformatics tools that support microbiome research.

Students applying to this project will code in python and R.

**Keywords:** Bioinformatics

## Biodiversity and geodiversity across scales

**Mentors:** Phoebe Zarnetske, Beth Gerstner

The student will develop tools to visualize the relationships across spatial scales between biodiversity and geodiversity. Geodiversity is the variation in Earth's abiotic features and processes, and has the potential to buffer biodiversity against climate change. Target areas for the research include North America, Central America, and South America. The student will work with satellite remote sensing data, and georeferenced point data from the National Ecological Observatory Network (NEON), Global Biodiversity Information Facility (GBIF). Google Earth Engine, Java, R, Python, and Shiny are potential languages and platforms with this project.

At the end of this project, students will be able to: (i) explain the linkages between biodiversity and ecosystem functions; (ii) understand the effects of climate change and land use change on biodiversity; (iii) know how to work with geospatial data in R, Python, Google Earth Engine; (iv) contribute to advancing species distribution models of data-poor species; (v) summarize geodiversity data using R packages; and (vi) develop a Shiny app.

Students applying to this project would benefit by having previously taken an ecology or statistics course but this is not required.

**Keywords:** Biodiversity, Climate Change, Spatial Modeling, Species Distribution Modeling, Satellite Remote Sensing

## Visualizing high dimensional datasets

**Mentors:** Rongrong Wang, He Lyu

Nowadays, analyzing high dimensional data is a major problem facing science. Given a new dataset, the first thing a data analysts will do is visualize it using various state-of-the-art techniques in order to obtain a basic understanding of the structure of the data. This project focuses on designing and testing novel data visualization methods that are better than the state-of-the-art methods in preserving the geometry of the dataset.

By the end of this project, students will be familiar with various data visualization and dimensionality reduction methods and will gain experience in coding and applying them to real datasets. They will be exposed to cutting edge research topics in the field of manifold learning.

Students applying to this project should be able to program using python and have previously completed one semester of numerical linear algebra and one semester of partial differential equations (PDE).

**Keywords:** machine learning