Using a Random Forest model to predict the distribution of benthic biomass in the Bering Sea
|
Model description
|
What is a Random Forest model?
A Random Forest is an algorithmic model that provides highly accurate predictions in complex datasets, can incorporate a large number of predictor variables, can automatically handle interactions, can handle missing data, and is simple to apply and interpret. Learn more about Random Forests.Building a benthic biomass model for the Bering Sea
We used publicly available benthic biomass data from sampling stations in the northern and eastern Bering Sea to construct the model. The distribution of sampling stations is shown in the figure to the left.
We standardized the data to a common unit (g wet weight m/sqm), and divided the wet weight of benthic biomass at each sampling station into five categories to facilitate graphic presentation: <50 g/sqm , 50-100 g/sqm, 100-200 g/sqm , 200-500 g/sqm, and >500 g m/sqm.
Environmental data
The distribution and abundance of benthic invertebrates is influenced by a number of environmental variables. We obtained environmental data for six important environmental predictor variables from publicly available sources (sediment grain size, distance to coastline, water depth, chlorophyll a concentration, sea ice cover, sea surface temperature). We imported each environmental variable as a separate layer into ArcGIS, and used a point grid across the Bering Sea with a grid cell size of 10x10 km. Thus, our data were not suited to model local scale variations, but are appropriate for a model surface covering the entire Bering Sea.
Constructing the Random Forest model
We overlaid each sampling station with the layers of all environmental predictor variables in ArcGIS, and thus obtained environmental data for each sampling station. Using the dependent variable (benthic biomass category 1-5) and predictor variables for each of the 624 sampling stations we constructed 1500 classification trees and used a random subset of 64% of the data without replacement to build single trees. We chose m to maximize classification accuracy, and report this as the percentage of sampling stations for which the category of benthic biomass was predicted correctly. We conducted our analyses in R 2.7.1 with the add-on package randomForest, version 4.5-25.
The model constructed with m = 6 classified 78.2% of sampling stations correctly. Different runs with varying m (3-7) resulted in marginally poorer classification accuracy (77.9% at m = 7, to 72.6% at m = 3). Overall, this model performed extremely well in that it was able to predict the correct benthic biomass at more than 75% of sampling stations!
Prediction of benthic biomass across the Bering Sea
We used the environmental variables available at a 10 km grid cell resolution across the entire Bering Sea to predict benthic biomass in each grid cell based on the Random Forest algorithm developed from 624 sampling stations. The map below shows the distribution of benthic biomass across the Bering Sea in 5 categories: <50 g/sqm , 50-100 g/sqm, 100-200 g/sqm , 200-500 g/sqm, and >500 g m/sqm.