Skip to content

Using EDITO Datalab - Complete Tutorial

Learn how to use EDITO Datalab for marine data analysis with this comprehensive guide! From finding services to advanced data processing and storage, this tutorial covers everything you need to know.

🚀 Quick Start

  1. Follow the presentation: Using the Datalab
  2. Explore the examples: Check out the various data viewers and scripts
  3. Start analyzing: Use the provided templates and examples

🎯 What You'll Learn

  • Navigate to EDITO Datalab and find services
  • Configure RStudio, Jupyter, or VSCode services
  • Search the STAC catalog for marine data collections
  • Process Parquet files (biodiversity data) and Zarr data (oceanographic data)
  • Combine different data types spatially
  • Save and manage data using personal storage
  • Work with both R and Python environments

🚀 Quick Start (15 minutes)

Step 1: Find Services

  1. Go to datalab.dive.edito.eu
  2. Browse the service catalog
  3. Choose your preferred environment:
  4. RStudio for R analysis and visualization
  5. Jupyter for Python notebooks and machine learning
  6. VSCode for mixed R/Python projects and development

Step 2: Configure Service

  • Select appropriate CPU/memory resources (2-8 cores, 4-16GB RAM)
  • Choose your preferred environment
  • Launch the service (credentials are automatically configured)

Step 3: Run Analysis

  • Use the provided scripts to get started
  • Search STAC catalog for data collections
  • Process Parquet and Zarr files
  • Combine and analyze marine data

📁 Complete Workflow Scripts

R Scripts (Ready to Run)

  • r/01_stac_search.R - Search EDITO STAC catalog for marine data collections
  • Connects to STAC API and lists available collections
  • Filters for biodiversity-related data
  • Shows data access URLs and formats
  • r/02_read_parquet.R - Read and process biodiversity data from Parquet files
  • Direct access to EUROBIS marine species data
  • Data filtering and basic analysis
  • r/03_personal_storage.R - Complete personal storage workflow
  • Connect to EDITO storage (credentials auto-configured)
  • Upload/download data to/from personal storage
  • Process and save marine data in multiple formats

Python Scripts (Interactive Workflow)

  • python/01_get_stac_collections.py - Get and explore STAC collections
  • Interactive collection discovery
  • Search functionality for specific data types
  • Saves collections metadata for next steps
  • python/02_search_stac_assets.py - Search for specific data assets
  • Filter collections by keywords (biodiversity, ocean, etc.)
  • Find Parquet and Zarr data assets
  • Interactive asset selection
  • python/03_get_zarr_to_df.py - Process oceanographic Zarr data
  • Convert Zarr arrays to DataFrames
  • Handle large datasets with smart sampling
  • Spatial data processing
  • python/04_get_parquet_data.py - Process biodiversity Parquet data
  • Read Parquet files from S3
  • Data exploration and filtering
  • Schema analysis and sample extraction
  • python/05_combine_and_save.py - Complete data combination workflow
  • Select and combine Parquet + Zarr datasets
  • Spatial data integration
  • Save to local files and personal storage
  • Metadata generation and tracking

Additional Tools

  • python/check_credentials.py - Verify storage credentials
  • python/run_full_demo.py - Run complete workflow automatically

🛠️ Services Available

RStudio Service

Perfect for: - Statistical analysis and spatial data processing - Data visualization and reporting - R-based marine research workflows - Quick data exploration and analysis

Getting Started: 1. Launch RStudio service in EDITO Datalab 2. Run r/01_stac_search.R to discover data collections 3. Use r/02_read_parquet.R to process biodiversity data 4. Try r/03_personal_storage.R for data management

Jupyter Service

Perfect for: - Machine learning and data science - Interactive data exploration - Python-based analysis and visualization - Notebook-based research workflows

Getting Started: 1. Launch Jupyter service in EDITO Datalab 2. Run the Python scripts in sequence: - python/01_get_stac_collections.py - python/02_search_stac_assets.py - python/03_get_zarr_to_df.py - python/04_get_parquet_data.py - python/05_combine_and_save.py

VSCode Service

Perfect for: - Mixed R/Python projects - Large codebases and development - Collaborative research - Advanced data processing workflows

Getting Started: 1. Launch VSCode service in EDITO Datalab 2. Open the using_datalab folder 3. Run either R or Python scripts as needed 4. Use integrated terminal for command-line tools

📊 Data Formats & Sources

STAC (SpatioTemporal Asset Catalog)

  • Purpose: Find and discover marine datasets
  • API: https://api.dive.edito.eu/data/
  • Use: Search for available data collections
  • Scripts: 01_stac_search.R, 01_get_stac_collections.py

Parquet (Biodiversity Data)

  • Purpose: Efficient tabular data storage for occurrence records
  • Use: Marine species observations, biodiversity data
  • Example: EUROBIS marine species occurrence data
  • Scripts: 02_read_parquet.R, 04_get_parquet_data.py
  • Features: Fast querying, columnar storage, schema evolution

Zarr (Oceanographic Data)

  • Purpose: Cloud-optimized array data for large datasets
  • Use: Oceanographic data, climate reanalyses, satellite data
  • Tools: xarray, zarr-python
  • Scripts: 03_get_zarr_to_df.py
  • Features: Chunked storage, parallel access, compression

Personal Storage (MyFiles)

  • Purpose: Your personal cloud storage for data and results
  • Access: Automatically configured in EDITO services
  • Use: Save processed data, share results, backup analysis
  • Scripts: 03_personal_storage.R, 05_combine_and_save.py
  • Formats: CSV, Parquet, JSON, any file type

🎥 Video Examples

The tutorial includes video demonstrations of: - Service Configuration: RStudio, Jupyter, and VSCode setup - STAC Search: Finding and exploring marine data collections - Data Processing: Working with Parquet and Zarr data - Personal Storage: Uploading and managing data in MyFiles - Complete Workflow: End-to-end data analysis pipeline

🚀 Getting Started

Option 1: Quick Start (15 minutes)

  1. Launch a service at datalab.dive.edito.eu
  2. Run one script to get familiar with the workflow
  3. Explore the data and see what's available

Option 2: Complete Workflow (1 hour)

  1. Start with R: Run r/01_stac_search.R to discover data
  2. Process data: Use r/02_read_parquet.R for biodiversity data
  3. Manage storage: Try r/03_personal_storage.R for data management
  4. Advanced Python: Run the Python scripts in sequence for full workflow

Option 3: Automated Demo

  1. Run the complete demo: python/run_full_demo.py
  2. Watch the process: See all steps automated
  3. Examine results: Check the output files and storage

📖 Additional Resources

🤝 Contributing

Found an issue or have suggestions? Please contribute to improve this workshop!


Ready to start? Go to datalab.dive.edito.eu and launch your first service! 🌊🐠