Last summer, I spent three months doing machine-learning research in a Bangkok data science and engineering lab through U of T’s Engineering Science Summer Research Opportunities Program. Before arriving in Thailand, I had questions. What is engineering research? What is it like to work in machine learning? I had read a lot about generative artificial intelligence (AI) with the ability to write entire fake news articles and write fiction and poetry based on a few starter lines from a user, or create ‘deepfake’ footage of celebrities and political figures using images of their faces found online. Can an AI have original thoughts, too, or make academic discoveries? Is there a difference between machine learning and human education?
How do machines learn?
Neural networks, a common type of machine learning model, learn inductively from data. This means that they make general assumptions from many specific examples. This paralleled how I learned to behave alone in a foreign country. I got used to following the locals: bowing forward with my hands folded to say hello in the wai, a greeting gesture; covering my legs and shoulders when entering temple grounds; taking off my shoes to go inside; ending my sentences with “kha” to sound polite. I got used to doing many things without knowing why.
This is what many machine learning models do. They copy patterns in data without being told or educated as to why these patterns exist. During the process of building or ‘training’ a neural network, the machine is shown a series of training examples to learn from. These consist of the values of several independent variables, or features, along with the true value of the output that it is tasked with predicting.
Training is the process of learning the weight of each feature in predicting the output variable. Basic models can predict the price of your car insurance from demographic information, or your chances of having cancer from genetic data. A generative model might predict the next word in a fake news story based on all of the previous words, or the value of each pixel in the next frame of a video based on previous ones and videos it has seen in the past.
Complex models like these, with many layers of internal neurons, are often described as ‘black box’ systems because their internal decision-making logic can’t be understood in human decision-making terms; it is often impossible to find out what each neuron represents. Most neural networks are built to reproduce the data they are trained on, and have no sense of causality. Can you call that learning, when they do the farthest thing from thinking for themselves?
Applying machine learning to global health
Are there ways we can take advantage of this inductive ‘learning’ style — learning without being educated, if you will? This summer, I worked in bioinformatics, a field at the intersection of machine learning and biology, in which the answer is: yes. There are many ways.
Cells are goldmines of data. Inside each human cell, there is DNA which contains 20,000–25,000 genes, each of which encodes a recipe for producing a unique protein with a specific function. Factors in the cell’s environment determine the expression of each gene, or the rate at which the cell manufactures that gene’s specific protein.
Each human cell has its own gene expression profile, which is the reason why the cells in your eye, in your skin, and in your gut are different. Even though they all carry the same genetic code, they are producing different proteins that do different things. Gene expression can be measured on a microarray chip from messenger RNA, a gene-specific molecule involved in protein synthesis, which turns it into rich data that can be used to train neural networks to predict critical medical events. Such predictions can include a cancer patient’s responses to different drugs, or the likelihood of a pregnancy resulting in a preterm birth. Although the functions of many genes are unknown, patterns in past data are often all that is needed to make accurate predictions.
In my research, I looked at gene expression data — not from humans, but from malaria parasites that were obtained from patient blood samples. In the past decade, resistance to the state-of-the-art malaria drug dihydroartemisinin (DHA) has been spreading through southeast Asia. If this resistance reaches Africa, much of our progress in the control of the disease will be reversed.
If we produce a machine learning model that can process thousands of gene expression values and predict malarial drug resistance much better than a random guess, we could potentially use this information to track the movement of the drug-resistant parasite strain, and develop measures to control carrier mosquitoes — all without ever knowing the mechanism of DHA resistance. Education is not required.
What should be considered academic?
Why is this considered research worthy of being done at a university, in a lab? Arguably, a biologist would be better equipped to tackle this problem in a scientific way than a machine learning specialist would. My summer consisted mostly of trying different model architectures and tuning hyperparameters to see which model performed best. It didn’t feel very scientific; I tried different software, read about different ways of preprocessing the features, and waited for my models to train, which can take hours due to the millions of computations involved.
Although commonly thought of as cutting-edge, the neural network actually originated in the ’40s as a demonstration of neural plasticity, and the backpropagation algorithm used in training came in the ’60s. Now that computer hardware can train complex neural networks in reasonable timeframes — hours or days instead of years — much of present-day machine learning work is just learning how to make the tool work with different problems. Compared to ‘purer’ areas that have heavy theoretical bases, such as math, physics, and biology, is the trial-and-error process of machine learning research really academic at all?
A similar question to pose: did I belong in academia too? Living so far from home, I had a sense of being uneducated compared to those around me. I was following customs without knowing why. I had never been entirely surrounded by academic people. None of us Canadian interns were able to learn Thai in the short time we were abroad, and there were few people I could talk with besides them and my colleagues. In our spare time, over meals and on day trips, we would talk about math, thought experiments, and politics — linear algebra, Maxwell’s demon, and Hong Kong. I learned a lot, but I sometimes had trouble holding my own in our conversations at restaurant tables and on the street.
As a result, many of my memories from Bangkok are of quietly listening and looking out windows. From taxis onto roads lined with yellow and white sashes hanging in neat arcs across fences and bridge railings, left over from the coronation of the new King Rama X in May, larger-than-life pictures of him standing between neatly trimmed shrubs on the traffic islands. From the rows of balconies looking out of Bangkok State Tower, each with a pillared balustrade. The golden cables of the Rama VIII bridge across the Chao Phraya River. From restaurants, parents, children, to groups of students laughing in pressed uniforms. Temple walls decorated with triangular pieces of coloured porcelain. The family of stray dogs that was always running down the alley beside our residence, chasing, playing, following each other on adventures.
Machine education: human-interpretable machine learning
Machine learning is still a growing research field. Are there, after all, ways that we can use it to produce new knowledge? If a model’s architecture is simple enough to avoid the black box effect — when the process that a machine uses to determine the output is obscured — the weights of different features can be used to infer their relative importance in predicting output, which can give researchers more information about a problem. When predicting malarial DHA resistance, if there was a way to know the function of each gene, something could be learned about the resistance mechanism.
In my work, I tried to make this possible. Although the functions of many malaria genes are unknown, researchers have grouped many of them into gene sets that code for proteins that play related roles or are found in the same locations within the cell. I used a relatively new technique called ‘feature transformation’ to condense expression values for 5,000 genes, the functions of which are largely unknown, into about 100 values of gene set expression.
If we can find patterns in the most important gene sets for predicting DHA resistance, we can actually learn about how the resistance works and identify targets within the cell for the development of new malaria treatment drugs.
Machine miseducation: learned bias
Does human interpretability make our machine learning model educated? After all, it’s still just looking for patterns. In fact, there is very little blood sample data available, and in many cases, the random sampling caused my model to find coincidental correlations of drug resistance with genes that were actually related to other traits, such as the life cycle stage of the parasite. The correlation didn’t generalize well to the problem, so when I tested my trained model on new data, it performed poorly. This problem is called overfitting: the performance of the model was skewed by bias in its training data.
In my case, the bias was likely random and caused by the small training sample size. However, overfitting creates an equity problem in cases where undesired patterns exist in datasets that are thought to be unbiased. For example, machine learning models which were implemented by law enforcement have predicted that Black prison inmates were more likely to reoffend due to systemic racism in past policing, and hiring algorithms have labelled women as poorer candidates for STEM jobs because of past underrepresentation. Even human-interpretable machine learning models reproduce the data they are trained on, in all its good and evil, without knowing why. This hardly sounds like education to me.
Engineering research: academia meets application
Is machine learning research less academic? Perhaps. Engineering research is fundamentally different from running experiments in the natural or physical sciences. Many of us data science lab interns were new to academic research, and we talked about this often. All of the lab projects were impactful — students were working on finding biomarkers for cancer diagnoses, text mining to find metabolic pathways in literature databases, and interpreting emotions from brainwave data. Engineering is not about answering the universe’s questions. To me, it’s something a little more humble but every bit as noble: it’s about making things that work and can help people.
If our machines will never be educated, at least they have found patterns that can help us diagnose and eradicate diseases. Although my model did not discover any new drugs, it contributed to a paper, and hopefully it will help with bioinformatics research that will be used to control malaria. Working in Bangkok, I discovered that learning and education are not mutually exclusive for humans, or for machines. I spent a lot of time listening rather than speaking, which has left me with many memories: of wai-ing and walking barefoot in temples, of pillared balconies, porcelain patterns on walls, parents, children, and students.
Although machine learning models cannot yet form thoughts or come up with cures for disease, they can mimic the things we do without knowing why, like our sentence structure, our media, and the way our cells behave. After all, maybe these are the most human things about us.
———
Photos by Nicola Lawford
Design by William Xiao