In my data science with engineering applications class, I learned how practical data science can be helpful in other fields. In this project, I used clustering and principal component analysis to see if Metformin, a drug used for diabetes, is effective in treating cancer.
Some Background
The idea behind the project was to apply unsupervised learning approaches to identify genes with significant differential expression across single-cell subpopulations induced by therapeutic treatment. Or, in the scheme of the real-life application, the researchers at the University of Illinois in collaboration with the Mayo Clinic used data science techniques to measure how effective Metformin, a drug typically prescribed for those with Type 2 diabetes, is against cancer.
Some Technical Background
The overall goal of the project is to determine where Metformin is effective in treating cancer. Testing out every gene and cell would be too costly in terms of time and resources, so isolating more promising candidates would definitely reduce the workload.
MiMoSA stands for “mixture model-based single-cell analysis”. In other words, this was the workflow the researchers came up with to cluster and analyze cells.
Gene expression, the activity level of a gene, is, in essence, measured by the RPKM, reads per kilobase per million.
For this project, pandas was used for loading datasets, NumPy was used for vector manipulation, Matplotlib and Seaborn were used for plotting, and sklearn was used for its ML algorithm implementations.
Part 1: Getting Familiar
Naturally, the first thing to do was to load the data. The baseline and Metformin-treated cell data was presented in a .csv with 1,170 genes and about 170+ cells. Each of us were assigned baseline and Metformin-treated cells to look at the variation of gene expression across genes for our given cells. To do this, I created slices of the RPKM data for my assigned cells and applied the log function to each slice. This was fed into Seaborn to make a probability density plot of each cell’s overall gene expression.
We were also assigned a specific gene to observe how gene expression varied across different cells. A similar process was followed – slice the data according to the specific gene, apply the log function to it, then plot the probabilty density function.
We now had to see if there was any statistically significant effect the Metformin had on the genes. To do this, we used two-sample KS testing, which in essence tests whether two one-dimensional probability distributions differ. Different significance levels were tested, from 0.1 down to 0.001. We reject the null hypothesis that the two distributions are similar if the p-value returned is smaller than the significance we set. There were 833 genes common to both the baseline and Metformin datasets. Of these, 309 genes had different expressions. I wrote some code out to differentiate which genes had different expressions at various significance levels.
From the first part, it’s clear to see that Metformin seems to have an effect somewhere. But is there a way to further visualize it?
Part 2: Fun With Clustering
This is where we got to apply our knowledge of clustering. Using sklearn, we clustered using Gaussian mixture models (GMM) and k-means. To begin clustering, we started off by extracting all the RPKM data from both data sets. We fed this into sklearn’s GaussianMixture, using two clusters for the baseline data and three clusters for the Metformin data. From the baseline data, there were clusters of sizes 152 and 17, while the Metformin data had clusters of 161, 11, and 5. From this, we created new mean baseline and Metformin data sets according to the clusters, using the means of each cluster for each gene. We defined clusters based on membership size, and filtered the data to meet certain conditions.
To account for inherently low expression genes, we removed common genes from B_u from M_x. Afterward, we kept common genes to both data sets, keeping only downregulated genes. From this, we created a dataframe containing downregulated genes and the values from their clusters, all before applying the log + 1 transformation to normalize the data. The final result is this boxplot, showing the log of the mean gene expression per cluster. Notice how cluster x has zero expression.
Running the same process again with k-means, we obtained similar results.
Part 3: Principal Component Analysis
The idea behind principal component analysis is to use orthogonal transformations to convert a set of observations of possibly correlated variables into linearly uncorrelated variables, the principal components. To do this, the data’s eigenvalues and eigenvectors were extracted using NumPy’s linear algebra suite. Taking the eigenvalues and vectors’ values and pairing them, then sorting them, we can do a cumulative sum to see how many components can explain the variance.
Why is this important? A good goal is to reach 95% explained variance. As you can see, we did it with two components. But, where does it take us now? Well, when we investigate the biplot, here’s what we can see:
These are the biplots of both the baseline and Metformin datasets. The colored dots represent cells, where the colors represent their cluster assignment from the previous section. The four lines of the biplot represent the magnitude and direction of the most influential genes that create variance among the data along the axes. We can see some interesting things happened once Metformin was introduced. In the baseline biplot, RPL18A and RPL37A had a change in direction once Metformin was introduced. RPL18A’s change in direction was rather mild, but RPL37A’s change was quite drastic.
The Takeaways
Let’s learn about learning. In school, there’s probably a point where you’re given information by a teacher, and you’re supposed to use that information to draw conclusions about similar problems (apples are fruit, but are oranges fruit?). This is a very simplistic example of supervised learning, where in machine learning, a system is given examples and is supposed to draw relations between inputs and give labels as outputs. In this assignment, however, we used clustering to perform unsupervised learning, where a system is supposed to find hidden patterns or structures to the input. The clustering we did is an example of unsupervised learning.
K-means clustering is basically defining k groups, assigning the data randomly to these k groups, evaluating how these assignments are, then reassigning the data to create better fits. What’s great about it is that it’s simple to understand and rather quick to use. However though, it has some difficulties in being a hard assignment. What that means is, when it assigns a data point, it is 100% assigning that data point to some group without consideration of the other groups.
This is where Gaussian mixture models differ. In essence, we again define n groups, but instead of being centroids, these are probability distributions. The process follows a similar route – assign data to these distributions, evaluate fits, and adjust the parameters of the distributions to create better fits. Rather than the hard assignment of k-means, GMM does soft assignment. When assigning data points, the system can consider that some data points have chances to be part of different distributions. For example, a data point could have an 80% chance of being part of distribution A, but a 20% of being part of distribution B.
Now is probably a good time to talk about choosing how many clusters to create. There’s two methodologies that I know of to choose cluster sizes – use domain knowledge, or use criteria. Domain knowledge is advantageous to have. I’d love a career in consultation, though that would mean I’d be working on projects from different disciplines and domains. In this data class, I’ve had to learn just as much on subject matter (i.e. supercomputer maintenance, medical research, and network security) as I did with the data concepts themselves. With domain knowledge, it removes some guesswork to the process since you would have an intuition on what’s needed and what to look for. Otherwise, at least in this project, there are silhouette scores and BIC to look at. A silhouette score is a metric of how well data points are assigned to clusters using distance metrics. It considers both how dissimilar a data point is from its own cluster (smaller is better) and how badly matched a data point is from a neighboring cluster (larger is better). Together, the closer this metric is to 1, the better a particular data point is matched to a cluster. The average scores of all the points in a cluster measure how tightly grouped all the points in the cluster are. In essence, choose a cluster size where the silhouette score is high. The Bayesian information criterion is a criterion that considers the likelihood function and parameters estimated by a model. In order to discourage overfitting by adding parameters, a penalty is added for having too many parameters. Overall, when using BIC, choose a cluster size with lower BIC scores.