2. Exploring ship echosounder data from the Pacific Hake survey#

Jupyter notebook accompanying the manuscript:

Echopype: A Python library for interoperable and scalable processing of ocean sonar data for biological information
Authors: Wu-Jung Lee, Emilio Mayorga, Landung Setiawan, Kavin Nguyen, Imran Majeed, Valentina Staneva

2.1. Introduction#

2.1.1. Goals#

Illustrate a common workflow for echosounder data conversion, calibration and use. This workflow leverages the standardization applied by echopype and the power, ease of use and familiarity of libraries in the scientific Python ecosystem.
Extract and visualize data with relative ease using geospatial and temporal filters.

2.1.2. Description#

This notebook uses EK60 echosounder data collected during the 2017 Joint U.S.-Canada Integrated Ecosystem and Pacific Hake Acoustic Trawl Survey (‘Pacific Hake Survey’) to illustrate a common workflow for data conversion, calibration and analysis using echopype and core scientific Python software packages, particularly xarray, GeoPandas, pandas and NumPy.

Two days of cloud-hosted .raw data files are accessed by echopype directly from an Amazon Web Services (AWS) S3 “bucket” maintained by the NOAA NCEI Water-Column Sonar Data Archive. The total data used are 170 .raw files at approximately 25 MB each (1 Hz pinging rate from first light to dusk), corresponding to approximately 4.2 GB. With echopype, each file is converted to a standardized representation based on the SONAR-netCDF4 v1.0 convention and saved to the cloud-optimized Zarr format.

Data stored in the netCDF-based SONAR-netCDF4 convention can be conveniently and intuitively manipulated with xarray in combination with related scientific Python packages. Mean Volume Backscattering Strength (MVBS) is computed with echopype from each raw data file and exported to a netCDF file. Here, we define two geographical bounding boxes encompassing two ship tracks and use these to extract corresponding timestamp intervals from the GPS data, and then the corresponding MVBS data based on those intervals. Finally, these extracted MVBS subsets are plotted as track echograms.

2.1.3. Outline#

Establish AWS S3 file system connection and generate list of target EK60 .raw files
Process S3-hosted raw files with echopype: convert, calibrate and export to standardized files
Extract and process GPS locations from the Platform group of converted raw files
Read MVBS and plot track echograms for time periods corresponding to two ship tracks

2.1.4. Running the notebook#

This notebook can be run with a conda environment created using the conda environment file https://github.com/OSOceanAcoustics/echopype-examples/blob/main/binder/environment.yml. The notebook creates two directories, if not already present: ./exports/hakesurvey_convertedzarr and ./exports/hakesurvey_calibratednc. netCDF and Zarr files will be exported there.

2.1.5. Note#

We encourage importing echopype as ep for consistency.

from pathlib import Path

import fsspec
import numpy as np
import geopandas as gpd
import xarray as xr

import matplotlib.pyplot as plt
from shapely.geometry import box
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

import echopype as ep
from echopype.qc import exist_reversed_time

import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

2.2. Establish AWS S3 file system connection and generate list of target EK60 `.raw` files#

Access and inspect the publicly accessible NCEI WCSD S3 bucket on the AWS cloud as if it were a local file system. This will be done through the Python fsspec file system and bytes storage interface. We will use fsspec.filesystem.glob (fs.glob) to generate a list of all EK60 .raw data files in the bucket, then filter on file names for target dates of interest.

The directory path on the ncei-wcsd-archive S3 bucket is s3://ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/. All .raw files from the 2017 Hake survey cruise are found here.

fs = fsspec.filesystem('s3', anon=True)

bucket = "ncei-wcsd-archive"
rawdirpath = "data/raw/Bell_M._Shimada/SH1707/EK60"

s3rawfiles = fs.glob(f"{bucket}/{rawdirpath}/*.raw")

# print out the last two S3 raw file paths in the list
s3rawfiles[-2:]

['ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/Summer2017-D20170913-T180733.raw',
 'ncei-wcsd-archive/data/raw/Bell_M._Shimada/SH1707/EK60/Winter2017-D20170615-T002629.raw']

Generate list of target EK60 .raw files from AWS S3 bucket based on dates. The dates are found in the middle string token (e.g., “D20170913”). Select files from 2 days, 2017-07-28 and 2017-07-29.

s3rawfiles = [
    s3path for s3path in s3rawfiles 
    if any([f"D2017{datestr}" in s3path for datestr in ['0728', '0729']])
]

print(f"There are {len(s3rawfiles)} target raw files available")

There are 170 target raw files available

2.4. Extract and process GPS locations from the `Platform` group of converted raw files#

Use xarray.open_mfdataset to open the Platform group from all the converted raw netcdf files as a single concatenated (combined) xarray dataset. Then extract GPS time1 (time stamp), latitude and longitude from this group and transform that data into a GeoPandas GeoDataFrame containing point-geometry objects that are readily manipulated via geospatial operations. A GeoDataFrame adds geospatial capabilities to a Pandas DataFrame.

Due to the presence of multiple time coordinates in this group, care must be taken in defining how the concatenation (combine) operation is to be performed. This is captured in the arguments passed to open_mfdataset.

%%time
platform_ds = xr.open_mfdataset(
    str(converted_dpath / '*.zarr'), group='Platform', 
    engine='zarr',
    data_vars='minimal', coords='minimal',
    combine='nested'
)

CPU times: user 32.5 s, sys: 1.2 s, total: 33.7 s
Wall time: 34.6 s

platform_ds

<xarray.Dataset> Size: 11MB
Dimensions:              (channel: 3, time1: 244846, time2: 88959)
Coordinates:
  * channel              (channel) <U37 444B 'GPT  18 kHz 009072058c8d 1-1 ES...
  * time1                (time1) datetime64[ns] 2MB 2017-07-28T00:05:36.10331...
  * time2                (time2) datetime64[ns] 712kB 2017-07-28T00:05:34.897...
Data variables: (12/20)
    MRU_offset_x         float64 8B nan
    MRU_offset_y         float64 8B nan
    MRU_offset_z         float64 8B nan
    MRU_rotation_x       float64 8B nan
    MRU_rotation_y       float64 8B nan
    MRU_rotation_z       float64 8B nan
    ...                   ...
    sentence_type        (time1) object 2MB dask.array<chunksize=(244846,), meta=np.ndarray>
    transducer_offset_x  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    transducer_offset_y  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    transducer_offset_z  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
    vertical_offset      (time2) float64 712kB dask.array<chunksize=(88959,), meta=np.ndarray>
    water_level          float64 8B 9.15
Attributes:
    platform_code_ICES:  315
    platform_name:       Bell M. Shimada
    platform_type:       Research vessel

We can use time1 (the timestamps for NMEA datagrams, or GPS timestamps) to examine the exact timestamp interval spanned by the combined dataset.

print(f"{platform_ds.time1.values.min()}, {platform_ds.time1.values.max()}")

2017-07-28T00:05:36.103315000, 2017-07-30T00:17:59.381813000

To create the GeoPandas GeoDataFrame, first transform the latitude and longitude arrays to a single Pandas DataFrame that retains the time1 coordinate as a common index. This is done by using the DataFrame to_dataframe method together with a Pandas join operation. Then, a point GeoDataFrame is created from this DataFrame.

gps_df = platform_ds.latitude.to_dataframe().join(platform_ds.longitude.to_dataframe())

gps_df.head(3)

	latitude	longitude
time1
2017-07-28 00:05:36.103315	43.533072	-124.683998
2017-07-28 00:05:37.511097	43.533080	-124.684005
2017-07-28 00:05:37.669428	43.533167	-124.684000

gps_gdf = gpd.GeoDataFrame(
    gps_df,
    geometry=gpd.points_from_xy(gps_df['longitude'], gps_df['latitude']), 
    crs="epsg:4326"
)

A simple, easily generated map plot of the point GeoDataFrame

gps_gdf.plot(markersize=2)

<Axes: >

_images/ms_PacificHake_EK60_cruisetracks_29_1.png

2.5. Read MVBS and plot track echograms for time periods corresponding to two ship tracks#

2.5.1. Read MVBS as a concatenated dataset#

Use xarray.open_mfdataset again to read and concatenate (combine) data files into a single xarray Dataset. This time, we’re reading the MVBS netCDF files.

%%time
MVBS_ds = xr.open_mfdataset(
    str(calibrated_dpath / 'MVBS_*.nc'), 
    data_vars='minimal', coords='minimal',
    combine='by_coords'
)

CPU times: user 2.14 s, sys: 180 ms, total: 2.32 s
Wall time: 3.16 s

MVBS_ds

<xarray.Dataset> Size: 32MB
Dimensions:            (channel: 3, ping_time: 8819, echo_range: 150)
Coordinates:
  * ping_time          (ping_time) datetime64[ns] 71kB 2017-07-28T00:05:20 .....
  * channel            (channel) <U37 444B 'GPT  18 kHz 009072058c8d 1-1 ES18...
  * echo_range         (echo_range) float64 1kB 0.0 5.0 10.0 ... 740.0 745.0
Data variables:
    Sv                 (channel, ping_time, echo_range) float64 32MB dask.array<chunksize=(3, 60, 150), meta=np.ndarray>
    water_level        float64 8B 9.15
    frequency_nominal  (channel) float64 24B dask.array<chunksize=(3,), meta=np.ndarray>
Attributes:
    processing_software_name:     echopype
    processing_software_version:  0.8.4
    processing_time:              2024-04-25T17:58:23Z
    processing_function:          commongrid.compute_MVBS

Replace the channel dimension and coordinate with the frequency_nominal variable containing actual frequency values. Note that this step is possible only because there are no duplicated frequencies present.

MVBS_ds = ep.consolidate.swap_dims_channel_frequency(MVBS_ds)

MVBS_ds

<xarray.Dataset> Size: 32MB
Dimensions:            (frequency_nominal: 3, ping_time: 8819, echo_range: 150)
Coordinates:
  * ping_time          (ping_time) datetime64[ns] 71kB 2017-07-28T00:05:20 .....
  * echo_range         (echo_range) float64 1kB 0.0 5.0 10.0 ... 740.0 745.0
  * frequency_nominal  (frequency_nominal) float64 24B 1.8e+04 3.8e+04 1.2e+05
Data variables:
    Sv                 (frequency_nominal, ping_time, echo_range) float64 32MB dask.array<chunksize=(3, 60, 150), meta=np.ndarray>
    channel            (frequency_nominal) <U37 444B 'GPT  18 kHz 009072058c8...
    water_level        float64 8B 9.15
Attributes:
    processing_software_name:     echopype
    processing_software_version:  0.8.4
    processing_time:              2024-04-25T17:58:23Z
    processing_function:          commongrid.compute_MVBS

2.5.2. Extract MVBS along two N-S tracks selected via geographical bounding boxes#

Define rectangular bounding boxes around two tracks oriented North-South, then plot a reference map showing all the GPS points (in red), the two bounding boxes (black), and the OOI mooring location (CE04 Oregon Offshore, yellow star) examined in the accompanying Jupyter notebook.

tracksouth_bbox = gpd.GeoSeries(box(-125.17, 43.65, -125.14, 43.84), crs=gps_gdf.crs)
tracknorth_bbox = gpd.GeoSeries(box(-125.17, 43.98, -125.14, 44.17), crs=gps_gdf.crs)

basemap = cimgt.OSM()

_, ax = plt.subplots(
    figsize=(7, 7), subplot_kw={"projection": basemap.crs}
)
bnd = gps_gdf.geometry.bounds
ax.set_extent([bnd.minx.min() - 1, bnd.maxx.max() + 2, 
               bnd.miny.min() - 0.8, bnd.maxy.max() + 2.2])
ax.add_image(basemap, 7)
ax.gridlines(draw_labels=True, xformatter=LONGITUDE_FORMATTER, yformatter=LATITUDE_FORMATTER)

# GPS points
gps_gdf.plot(ax=ax, markersize=0.1, color='red', 
             transform=ccrs.PlateCarree())

# Bounding box for selected tracks
tracksouth_bbox.plot(ax=ax, edgecolor="black", linewidth=1.2, facecolor='none', 
                     transform=ccrs.PlateCarree())
tracknorth_bbox.plot(ax=ax, edgecolor="black", linewidth=1.2, facecolor='none', 
                     transform=ccrs.PlateCarree())

# OOI CE04 Oregon Offshore mooring location
plt.plot(-124.95, 44.37, marker='*', color='yellow', markersize=13, 
         transform=ccrs.PlateCarree());

_images/ms_PacificHake_EK60_cruisetracks_40_0.png

Clip the GPS locations GeoPandas GeoDataFrame gps_gdf generated from the Platform group with the two rectangular regions, tracksouth_bbox and tracknorth_bbox. Extract from those clipped GeoDataFrames (tracksouth_gps_gdf and tracknorth_gps_gdf) the minimum and maximum time1 timestamps for each track and use them to select the corresponding time span in the MVBS data.

Finally, create a MVBS Dataset subset for each track (tracksouth_MVBS_ds and tracknorth_MVBS_ds) by taking advantage of xarray’s “label”-based selection capability via the sel method. We select MVBS_ds data with ping_time values within the timestamp interval “slice” extracted from the geographical track. As we can see above, ping_time is a coordinate variable in the MVBS_ds Dataset.

tracksouth_gps_gdf = gpd.clip(gps_gdf, tracksouth_bbox)
tracksouth_MVBS_ds = MVBS_ds.sel(
    ping_time=slice(tracksouth_gps_gdf.index.min(), tracksouth_gps_gdf.index.max())
)

tracknorth_gps_gdf = gpd.clip(gps_gdf, tracknorth_bbox)
tracknorth_MVBS_ds = MVBS_ds.sel(
    ping_time=slice(tracknorth_gps_gdf.index.min(), tracknorth_gps_gdf.index.max())
)

2.5.3. Plot MVBS echograms for the two N-S tracks, for all 3 frequencies#

The final step is to create echogram plots (range vs ping_time) for each echosounder frequency and each of the two selected ship tracks. That’s 6 subplots. We first define two functions to simplify the task. plot_echograms plots the echograms of the 3 frequencies as a column of subplots, extracting the data for each frequency by using xarray’s isel index selector method, which uses index counts rather than values. As we can see above, frequency_nominal is a coordinate of the Sv (MVBS) DataArray.

def track_interval_str(trackdt):
    """ Create the timestamp interval title string for a plot column
    """
    track_interval_title_str = (
        f"{trackdt.index.min().strftime('%b-%d %H:%MZ')}"
        f" to {trackdt.index.max().strftime('%b-%d %H:%MZ')}"
    )
    return track_interval_title_str

def plot_echograms(ds, freq_len, column_idx):
    """Plot echograms of the 3 frequencies for xarray dataset ds,
       as a column of subplots"""
    for f in range(freq_len):
        ax = axes[f][column_idx]
        # Select Sv data by frequency using the frequency_nominal coordinate index "f",
        # then plot the echogram of the selected data
        ds.Sv.isel(frequency_nominal=f).plot(
            ax=ax, 
            x='ping_time',
            y='echo_range',
            yincrease=False,
            vmin=-80,
            vmax=-50,
        )
        if f < 2:
            ax.set_xlabel(None);

freq_len = len(MVBS_ds.frequency_nominal)

fig, axes = plt.subplots(nrows=freq_len, ncols=2, constrained_layout=True, figsize=(16, 16))

fig.suptitle(
    (f"    South Track, {track_interval_str(tracksouth_gps_gdf)}"
     "                                         "
     f"North Track, {track_interval_str(tracknorth_gps_gdf)}"),
    fontsize=16)

plot_echograms(tracksouth_MVBS_ds, freq_len, column_idx=0) # left column
plot_echograms(tracknorth_MVBS_ds, freq_len, column_idx=1) # right column

_images/ms_PacificHake_EK60_cruisetracks_45_0.png

2.6. Package versions#

import datetime, s3fs
print(f"echopype: {ep.__version__}, xarray: {xr.__version__}, geopandas: {gpd.__version__}, "
      f"fsspec: {fsspec.__version__}, s3fs: {s3fs.__version__}")

print(f"\n{datetime.datetime.utcnow()} +00:00")

echopype: 0.8.4, xarray: 2024.3.0, geopandas: 0.14.3, fsspec: 2024.3.1, s3fs: 2024.3.1

2024-04-25 18:08:15.744108 +00:00

Echopype Examples

Exploring ship echosounder data from the Pacific Hake survey

Contents

2. Exploring ship echosounder data from the Pacific Hake survey#

2.1. Introduction#

2.1.1. Goals#

2.1.2. Description#

2.1.3. Outline#

2.1.4. Running the notebook#

2.1.5. Note#

2.2. Establish AWS S3 file system connection and generate list of target EK60 `.raw` files#

2.3. Process S3-hosted raw files with `echopype`: convert, calibrate and export to standardized files#

2.3.1. echopype processing#

2.3.2. Test for time reversals#

2.3.3. Examine the EchoData object for one of the data files#

2.4. Extract and process GPS locations from the `Platform` group of converted raw files#

2.5. Read MVBS and plot track echograms for time periods corresponding to two ship tracks#

2.5.1. Read MVBS as a concatenated dataset#

2.5.2. Extract MVBS along two N-S tracks selected via geographical bounding boxes#

2.5.3. Plot MVBS echograms for the two N-S tracks, for all 3 frequencies#

2.6. Package versions#

Echopype Examples

Exploring ship echosounder data from the Pacific Hake survey

Contents

2. Exploring ship echosounder data from the Pacific Hake survey#

2.1. Introduction#

2.1.1. Goals#

2.1.2. Description#

2.1.3. Outline#

2.1.4. Running the notebook#

2.1.5. Note#

2.2. Establish AWS S3 file system connection and generate list of target EK60 .raw files#

2.3. Process S3-hosted raw files with echopype: convert, calibrate and export to standardized files#

2.3.1. echopype processing#

2.3.2. Test for time reversals#

2.3.3. Examine the EchoData object for one of the data files#

2.4. Extract and process GPS locations from the Platform group of converted raw files#

2.5. Read MVBS and plot track echograms for time periods corresponding to two ship tracks#

2.5.1. Read MVBS as a concatenated dataset#

2.5.2. Extract MVBS along two N-S tracks selected via geographical bounding boxes#

2.5.3. Plot MVBS echograms for the two N-S tracks, for all 3 frequencies#

2.6. Package versions#

2.2. Establish AWS S3 file system connection and generate list of target EK60 `.raw` files#

2.3. Process S3-hosted raw files with `echopype`: convert, calibrate and export to standardized files#

2.4. Extract and process GPS locations from the `Platform` group of converted raw files#