From massive wildfires to melting icebergs, news of the harmful impact of climate change is all around us. Many of the researchers who use Galileo—in the fields of alternative energy or flood modeling, for example—are crafting engineering solutions to these complex problems through careful modeling and computationally intensive simulations.
But social scientists will be quick to explain that predicting the outcome of a particular policy is not always simple, especially when it requires human action. Human behavior and decision making are hard to predict. Even when technical solutions are found, they have to be implemented in the real world, which is infinitely more complicated than the most advanced simulations. Therein lies the challenge.
Computational statistical models help economists and political scientists make sense of this complexity and predict the potential outcome of a course of action. This work, in turn, helps policymakers implement appropriate proposals. Galileo promises to greatly increase the ease, efficiency, and speed of this kind of modeling work by providing researchers with drag-and-drop, on-demand access to cloud computing resources.
Working Together to Face Complex Challenges
As part of the Coupled Natural and Human Systems program funded by the National Science Foundation, economist Dr. Greg Howard, along with colleagues in hydrology, biology, psychology and engineering, sought to tackle the problem of harmful algal blooms in Lake Erie. To contribute to the project and contextualize the work of his colleagues, Dr. Howard used the statistical package Stata to analyze survey data gathered from 396 farmers in the Maumee watershed in northwestern Ohio.
Estimating a random parameters (or mixed) logit model, using the control function method, and bootstrapping the standard errors, it took Greg three days to run the job on his laptop (24 GB RAM, intel core i7 processor). By contrast, Galileo immediately and seamlessly connected Greg to a google cloud instance, where he was able to run the same model in only thirteen hours. And unlike other cloud setups, Galileo does not require social scientists or other domain experts like Dr. Howard to develop a completely new skillset just to access computing resources. He simply had to drag and drop his project into the Galileo interface.
The Problem: Harmful Algal Blooms
Although they have received relatively little attention in national and global news, harmful algal blooms are becoming more and more frequent in many parts of the world. They are responsible for effects ranging from the mass die off of marine life to the poisoning of our water supply, and Lake Erie is especially at risk.
The problem in Lake Erie has become increasingly complicated for policymakers to solve because of the nature of the pollution at its root. While harmful algal blooms in Lake Erie in the 1960s were principally caused by pollution from residential and industrial emitters, their rise since the 2000s has been fueled by farm runoff from thousands of farms, primarily in the Maumee watershed. Nutrients in fertilizer cause eutrophication, which is the over-enrichment of a body of water with nutrients and minerals. While it was relatively simple to regulate the pollution of the 1960s, called “point source pollution,” it’s much more complicated to deal with nonpoint-source pollution from farms because it is often exceedingly difficult to identify which farms are contributing the most pollution.
In this situation, how might government actors convince farmers to voluntarily reduce runoff from their fields? And how can they best achieve this in a cost-effective manner, avoiding exorbitant spending for incentives, which ultimately costs taxpayers? These are the questions Dr. Howard is working to answer through his research.
He tackled the problem by carrying out a large-scale survey with the aim of understanding farmers’ preferences surrounding voluntary incentive programs run by the US Department of Agriculture. He used the survey data to build a model in Stata that would estimate farmers’ preferences on all sorts of aspects of a prospective program.
These included monetary compensation but also changes in other aspects of the program. Practical measures were also considered, such as the specifics of what conservation practice would be adopted on the farm, the length of the payment contract (installments over shorter or longer periods of time) and administrative setups (bureaucratic hurdles of greater or lesser complexity). In the survey, he varied the different aspects of the program, which allowed him to econometrically or statistically identify how people felt about specific program attributes and assign dollar values to them by estimating how farmers trade off monetary and non-monetary aspects of the program.
Dr. Howard’s model allowed him to adjust aspects of these hypothetical future programs to see how the changes might affect program adoption. He could then ask questions such as, “if you want people to adopt a different practice, how much more would you have to pay them?” with the goal of informing the larger multidisciplinary study funded by the NSF. His results will be analyzed alongside hydrology studies in order to optimize program adoption (in terms of level of adoption and location) and meaningfully interrupt eutrophication.
Why did the model require so much computation?
The computational intensity of the work derives from the complexity of the model.
In looking at farmers’ preferences, Dr. Howard has to take into account the fact that different people want different things, and the factors that drive these differences are often unobservable. Estimating the distribution of these different, or heterogeneous, preferences requires a random parameters logit or mixed logit. While a simpler model could run on Greg’s laptop quickly, the more complex model takes longer. What dramatically increases runtime, however, is the process of teasing out the variables with real causal impact.
In this case, Dr. Howard employed the control function method to control for the potential endogeneity of one of his variables of interest, which meant he had to estimate multiple models in different stages. He then had to bootstrap the standard errors. This requires many, many runs of the same model, or 1,000 replications in this specific instance. On Greg’s own laptop, this took approximately three days. With Galileo running Stata/MP on 64 cores, it ran in 13 hours.
While this particular model took three days, on other projects he has run models that have taken three weeks. Using Galileo to run Stata in the cloud on multiple cores can massively speed up this process.
In the current cloud environment, economists and other domain experts effectively need to moonlight as coders or software developers just to access additional computing resources. This presents an enormous disruption to their real work. Galileo simplifies this process significantly and allows researchers to focus on their work rather than network infrastructure.