R is a powerful statistical and programming language. Despite its reputation of being hard to learn, it is more and more used in different areas of research and has become an essential tool in oceanography and marine ecology. For instance, R is specifically used to read, process and represent in situ oceanographic data through the use of specific packages (e.g. oce) or more generally, to manage satellite data in order to produce high temporal and spatial resolution maps useful to synoptically explore and monitoring vast areas of the world oceans.
In this post we briefly describe a practical use of R in conjunction with satellite data to identify marine bioregions of the Labrador Sea (an arm of the North Atlantic Ocean between the Labrador Peninsula and Greenland) with different patters in the phytoplankton seasonal cycle (https://en.wikipedia.org/wiki/Phytoplankton). Phytoplankton, are microscopic plants that occupy the lowest level of the marine food chain. Their presence in the surface water is revealed because of their chlorophyll-a and other photosynthetic pigments, which changes the color of ocean waters. Nowadays, satellite ocean color sensors are routinely used to estimate the concentrations of chlorophyll-a and other parameters in the surface water of the oceans. All this data are freely available for research and educational purposes.
The approach used for the identification of the bioregions is therefore based on the use of the chlorophyll-a concentration, an index for phytoplankton biomass. The Globcolour project (http://www.globcolour.info), which combines data from several satellites to reduce spatial and temporal gaps, provides a set of different satellite parameters including estimates of chlorophyll-a. The data are provided at several temporal (daily images, 8-day composite images and monthly averages) and spatial (1 km, 25 km, 100 km) resolutions and stored into NetCDF (https://en.wikipedia.org/wiki/NetCDF) files, a format that include metadata information in addition to the data sets. In our case, among other information, each file contains latitude and longitude values to identify each pixel on the grid. For our purpose we downloaded 8-day composite images (about one image every week, from the year 1998 to 2015) with a spatial resolution of 25 km. To work with NetCDF files we used the R package ncdf4, which replaces the former ncdf package.
Once the time series have been downloaded and unzipped (.nc files), to reach our objective several steps were needed:
By using the functions nc_open and ncvar_get contained in the R package ncdf4, the .nc files were opened and the chlorophyll-a values (pixels) extracted together with the spatial coordinates and date.
Subsequently, by assigning to each pixel the corresponding values of latitude and longitude, id-pixel (i.e. each pixel was numbered) and id-date (i.e. year, month and day of the year) a large data frame was created. Basically, within the data frame each pixel was identified uniquely.
A 8-day climatological time series of chlorophyll-a concentrations was created by averaging over the period 1998-2015 each pixel within the area of interest (i.e. averaging all the first weeks, all the second weeks, etc.).
The resulting time series was normalized (https://en.wikipedia.org/wiki/Feature_scaling) in order to scale values between 0 and 1.
On the normalized climatology previously obtained (see point 3 and 4), a cluster analysis was carried out to identify marine regions of similarity (clusters).
To perform the cluster analysis we used the function k-means (package stats). The Calinski-Harabasz index was used to evaluate the optimal number of clusters. However, more detailed information about the procedure previously described can be found in D'Ortenzio and Ribera d'Alcalà 2009 and Lacour et al. 2015.
The final outcome of this analysis is shown in the figure below.
As we can see two main areas were identified: the bioregion 1 (the yellow area) located north of about 60°N and the bioregion 2 (green area) located south of 60°N. The two bioregions present a different climatological phytoplankton biomass cycle (bloom). In the northern part (bioregion 1) of the Labrador Sea the bloom starts earlier (around day 102 - dashed line in the figure) and it is more intense (more than 1.75 mg/m3). Conversely, in the southern part (bioregion 2) the bloom starts later (day 128) and it is less intense (less than 1.75 mg/m3). Note that, for simplicity, the bloom onset (represented by the dashed line and usually used as a warning bell for possible changes in trophic interactions and biogeochemical processes) was identified as the time when the chlorophyll-a concentration increases to the threshold of 1.0 mg/m3. Finally, the figure was created by using three R packages: rasterVis, ggplot2 and gridExtra.
Overall, the simple example used here has shown how the concomitant use of statistical methods implemented through the use of R and satellite data can help to characterize vast oceanic areas and thus to better illustrate ecosystems functioning and possibly their response to environmental changes.
D'Ortenzio F., Ribera d'Alcalà M. (2009) On the trophic regimes of the Mediterranean Sea: a satellite analysis. Biogeosciences, 6, 139-148
Lacour, L., Claustre H., Prieur L., D’Ortenzio F. (2015), Phytoplankton biomass cycles in the North Atlantic subpolar gyre: A similar mechanism for two different blooms in the Labrador Sea, Geophysical Research Letters, 42