A niche for Machine Learning in geoscience

Is a 'black box' useful in science?

Despite good performance across a range of remote sensing and time series applications, machine learning methods are often considered 'black-boxes'. That is, they are often unexplainable because we do not know why/how they are making decisions to get to certain answers. Because machine learning models are often empirically-derived (data-driven) models, their main weakness is that they are often incapable of generating practical (generalizable) predictions that can be transferred to other regions or to other similar kinds of problems. In science, we want to know the underlying physics of how the world works. It is crucial to find mathematical, physics-based relationships between the predictor and response variables. These predictive equations can be generalized to predict outside of the scope of observations already made and applied to other models. Machine learning algorithms do not provide mathematical relationships between variables that can be generalized for application in other models or for a better understanding of the physics involved (Hastie, 2009). They may find certain variables more important or draw specific relationships between predictor and response variables, but often these relationships are ambiguous in the outputs or the exact relationships are unidentifiable. So is there value in an unexplainable or non-generalizable result in science?

The simplest answer is: it depends on your goal.

Advantages over traditional statistical approaches

Machine learning has the capacity to outperform more traditional statistical approaches in many classification, regression, and forecasting problems (Lary et al., 2016). Traditional methods require making many pre-assumptions about the data and its distribution such as: linearity (data can be represented as a straight line), homoscedasticity (the spread of data around the average is constant), error mean of 0 (differences between actual data and the model cancel out, having an average error of 0 and ensuring the model is unbiased and efficient), each data point is independent (individual observations are not influenced by or associated with each other), or normality (data is perfectly symmetrically distributed across a normal Gaussian bell-curve) (Lütkepohl, 2007). The user must have intimate knowledge of the data to know what test will be applicable and will have to make assumptions that do not necessarily hold up in every single situation.

Alternatively, machine learning makes few pre-assumptions and automatically improves with experience (Hastie, 2009; Murphy, 2012). In general, it is meant to be able to parse large amounts of data, learn from it, and then apply that knowledge to predict something in the real world. Machine learning tends to be computationally very inexpensive in comparison to traditional numerical methods. Another key advantage for using some machine learning techniques that is extremely valuable for remote sensing in particular, is that the trained model can be updated with new information as more is acquired; it can be an online learner (Zhang et al., 2017). Many similar machine learning techniques would require a completely new model to be produced from scratch when more data is ready to be added. However, some (e.g., Long-Short Term Memory) allow the user to make continuous predictions while continuously updating to a stronger, better informed model with new incoming data. Some require no parameter fine-tuning (Hochreiter and Schmidhuber, 2006). Some also generalize well, can handle sparse data, and can rapidly learn to distinguish between widely separated events that would be entirely missed by traditional models or human observation.

Machine learning techniques are also capable of modeling nonparametric, high dimensional, noisy, and nonlinear datasets (Lary et al., 2016). Many relationships in the real world are nonlinear and noisy. Numerical methods determine primary variable relationships and, if the model terms are calculated to explain a significant amount of the variance within the data, the model is deemed useful. However, as an example, this may mean that only 30-70% of the variance in the data is explained (depending on the degrees of freedom) and the rest of the variance, deemed as an error term, goes without expressly being modeled. Physics-based models are limited by availability of a priori information about the processes and the limited capacity of the modeler implementing them to identify those processes (Lütkepohl, 2007). If a modeler does not know to model a specific phenomenon, the model will not incorporate it and the model will be incomplete. Machine learning, however, can learn from and understand the relationships buried within noise to create better predictive models. The primary relationships modeled in traditional models tend to be physical phenomenon that are readily apparent and of interest at the scale of the system being modeled. The noise or chaos beyond those relationship, though, is still physically-derived phenomena that occurs somewhere between the sub-atomic and planet scale, but has been overlooked or ignored by the modeler. With enough data, a machine learning algorithm would identify a larger percentage of the relationships found in the noise than the traditional models, especially for highly complex problems like climate models. All of these traits combined mean that 'black box' machine learning technique could offer a more rapid means of understanding environmental problems without the need to have a complete theoretical grasp of the systems (Lary et al., 2016). That is, it could make better predictions because it can discern the unknown relationships from the data.

A need in polar science

In glaciology, modeling outlet glacier dynamical responses to contemporary environmental changes has proven to be a challenge using traditional numerical methods, but data-driven machine learning techniques could enhance that learning curve. Glacier systems are complex, highly variable, and require longer time series than are sometimes available to determine the physics behind some of the more rapid changes taking place in the last two decades. Often there are no direct observations of critical processes because these systems exist below a kilometer or two of ice in hard to reach and remote locations of the world. To model these systems using traditional methods requires complex and computationally expensive numerical methods. For these reasons, our knowledge of the physics is still incomplete and physics-based models have not yet been able to adequately reproduce in situ observations (Joughin et al., 2012). These models have been advancing rapidly over the last two decades, but have not yet been incorporated into the complex climate models that are traditionally used to predict major climate changes in the foreseeable future making an accurate estimation of sea level change challenging. Estimates from sea level contributions from the ice sheets vary widely, causing sea level rise predictions to also vary widely from 0.3 to 2.5 m by 2010 (Jevrejeva et al., 2014; DeConto and Pollard, 2016). Adequate coastal planning cannot be achieved with that scope of uncertainty.

Where can machine learning help? While complex climate models that include ice sheet variability within them may be several years out, we may be able to create climate forecasts that are more accurate, with a fraction of the number of lines of code, in a very short amount of time, and with limited knowledge of the systems. Machine learning often performs better for forecasting highly complex systems, no matter how much of a theoretical understanding of the system was captured in the traditional model; that complex of a system is impossible to model perfectly via numerical methods because of its scope. Alternatively, machine learning can be used as emulators to improve the performance of physics-based models. Emulators are simplified models that mimic the behavior of more complex climate models. They can be used to emulate a piece of a physics-based model to speed up and simplify the overall model.

Additionally, there is an incredible need for automating analyses as data volumes exponentially expand and we must conduct entire ice sheet analyses for very small (sub-kilometer) to massive features. Much work has gone into automating the detection of glacier and ice shelf fronts (Baumhoer et al., 2019; Mohajerani et al., 2019; Cheng et al., 2021), a task historically conducted by hand using hundreds of hours of intern and researcher hours. Machine learning has also been applied for sea ice monitoring (Dumitri et al., 2019), identifying supraglacial lakes (Derscherl et al., 2020), ice crevasse detection (Williams et al., 2014), iceberg tracking (e.g., Barbat et al., 2021), and many other uses. The image classification methods predominantly utilize U-Nets, a class of Convolutional Neural Networks (CNN) architectures that excels at image segmentation...Still under construction

Our work: Image classification in polar regions

When we use a satellite image, we are often only interested in making measurements from certain kinds of surfaces. For example, I am often interested in measuring sea surface temperatures, which requires that I accurately know where the ocean exists in an image and that I exclude temperatures over ice or cloud. Classifying satellite images of polar regions is a particularly challenging problem, though, because different kinds of surfaces can look really similar. Ocean can be hard to distinguish from bare land in the polar regions because both are dark and can have relatively warm temperatures compared to ice. Even more problematic is the identification of clouds because they can be thin and nearly transparent, therefore looking like the surface they cover, or can be white or gray like ice.

Appearing to be similar to our eye means that in an image, they may have similar spectral properties and therefore be hard to mathamatically tell apart using a computer and traditional statitistical methods. That is, in the electromagnetic spectrum, they may look the same in different wavelengths that represent the visible spectrum (i.e. the colors of the rainbow that we can see by eye), near infrared, shortwave infrared, or thermal infrared (heat). Still under construction....

Making AI open


-->