Learn to explore, search, and use marine data from the EDITO Data Lake
Presented by Samuel Fooks (VLIZ)
For all the PDFs and code, check out the workshop GitHub repository
EDITO stands for the European Digital Twin of the Ocean.
It is a European infrastructure to:
Offers:
The data available in the EU DTO consists of a STAC (SpatioTemporal Asset Catalog) as well Data storage on S3 buckets
EDITO Data Lake
EDITO Data Lake uses modern cloud storage solutions to host public datasets. These datasets are stored in:
Explore: 38 million occurrence records
EDITO offers a standardized STAC (SpatioTemporal Asset Catalog) built on CMEMS and EMODnet data, designed to integrate diverse marine and environmental datasets.
A gateway to an interoperable ocean of FAIR data
STAC = SpatioTemporal Asset Catalog
A community standard for:
Used across satellites, models, and in-situ data.
Learn more: stacspec.org
Catalogs โ High-level groupings (e.g., "All CMEMS data") Collections โ Thematic datasets (e.g., temperature, sea level) Items โ Individual assets with time+space (e.g., file for 2024-01-01) Assets โ Actual data files: GeoTIFF, Zarr, Parquet...
Each has consistent metadata (bbox, datetime, etc.)
viewer.dive.edito.eu
We can follow the STAC structure to the EUROBIS database exported in parquet
Catalog -> Catalog -> Collection -> Item EMODnet -> Biodiversity -> Occurrence data -> Occurrence data eurobis database observations
Can also view in your browser radiantearth.github.io/stac-browser
Base URL for STAC:
https://api.dive.edito.eu/data/
Docs: Interact with Data API
ARCO = Analysis Ready Cloud Optimized
EDITO adopts modern cloud-friendly formats:
Let's explore each format!
Zarr is used for chunked N-dimensional arrays (like NetCDF but cloud-native)
Ideal for model outputs, time series, climate reanalyses Works well with xarray, kerchunk, zarr-python
xarray
kerchunk
zarr-python
zarr.readthedocs.io
import zarr import xarray as xr xr.open_zarr("https://s3...zarr/", consolidated=True)
Parquet = columnar tabular format, very efficient GeoParquet = Parquet + geospatial metadata
Good for point observations, events, tracks, etc. Efficient for large queries and spatial joins
parquet.apache.org geoparquet.org
import pyarrow.dataset as ds import s3fs fs = s3fs.S3FileSystem(anon=True) dataset = ds.dataset("s3://...your-parquet-folder...", filesystem=fs, format="parquet") df = dataset.to_table().to_pandas() print(df.head())
Lets go read that parquet https://s3.waw3-1.cloudferro.com/emodnet/biology/eurobis_occurrence_data/eurobis_occurrences_geoparquet_2024-10-01.parquet
Using a pre configured service on EDITO explore_data/view_parquet
import pystac_client url = "https://api.dive.edito.eu/data/collections" editocollections = pystac_client.Client.open(url) collections = list(editocollections.get_collections()) print("Found collections:", len(collections)) for col in collections[:5]: print(col.id, ":", col.title) items = col.get_items() itemlist = list(items) for item in itemlist: print(item.properties['title']) print(item.assets)
library(rstac) stac_endpoint <- "https://api.dive.edito.eu/data/" collections <- stac(stac_endpoint) %>% rstac::collections() %>% get_request() length(collections$collections) # how many
R packages like arrow, sf, terra also help with asset processing.
arrow
sf
terra
Understand the EDITO API and data stack Find and filter collections/items Read Parquet or Zarr data with Python or R
Go explore: my-ocean.dive.edito.eu viewer.dive.edito.eu Questions? Reach us at: edito-infra-dev@mercator-ocean.eu Docs: Interact with EDITO Data
Happy exploring!