Using EDITO Datalab - Complete Tutorial¶
Learn how to use EDITO Datalab for marine data analysis with this comprehensive guide! From finding services to advanced data processing and storage, this tutorial covers everything you need to know.
🚀 Quick Start¶
- Follow the presentation: Using the Datalab
- Explore the examples: Check out the various data viewers and scripts
- Start analyzing: Use the provided templates and examples
🎯 What You'll Learn¶
- Navigate to EDITO Datalab and find services
- Configure RStudio, Jupyter, or VSCode services
- Search the STAC catalog for marine data collections
- Process Parquet files (biodiversity data) and Zarr data (oceanographic data)
- Combine different data types spatially
- Save and manage data using personal storage
- Work with both R and Python environments
🚀 Quick Start (15 minutes)¶
Step 1: Find Services¶
- Go to datalab.dive.edito.eu
- Browse the service catalog
- Choose your preferred environment:
- RStudio for R analysis and visualization
- Jupyter for Python notebooks and machine learning
- VSCode for mixed R/Python projects and development
Step 2: Configure Service¶
- Select appropriate CPU/memory resources (2-8 cores, 4-16GB RAM)
- Choose your preferred environment
- Launch the service (credentials are automatically configured)
Step 3: Run Analysis¶
- Use the provided scripts to get started
- Search STAC catalog for data collections
- Process Parquet and Zarr files
- Combine and analyze marine data
📁 Complete Workflow Scripts¶
R Scripts (Ready to Run)¶
r/01_stac_search.R- Search EDITO STAC catalog for marine data collections- Connects to STAC API and lists available collections
- Filters for biodiversity-related data
- Shows data access URLs and formats
r/02_read_parquet.R- Read and process biodiversity data from Parquet files- Direct access to EUROBIS marine species data
- Data filtering and basic analysis
r/03_personal_storage.R- Complete personal storage workflow- Connect to EDITO storage (credentials auto-configured)
- Upload/download data to/from personal storage
- Process and save marine data in multiple formats
Python Scripts (Interactive Workflow)¶
python/01_get_stac_collections.py- Get and explore STAC collections- Interactive collection discovery
- Search functionality for specific data types
- Saves collections metadata for next steps
python/02_search_stac_assets.py- Search for specific data assets- Filter collections by keywords (biodiversity, ocean, etc.)
- Find Parquet and Zarr data assets
- Interactive asset selection
python/03_get_zarr_to_df.py- Process oceanographic Zarr data- Convert Zarr arrays to DataFrames
- Handle large datasets with smart sampling
- Spatial data processing
python/04_get_parquet_data.py- Process biodiversity Parquet data- Read Parquet files from S3
- Data exploration and filtering
- Schema analysis and sample extraction
python/05_combine_and_save.py- Complete data combination workflow- Select and combine Parquet + Zarr datasets
- Spatial data integration
- Save to local files and personal storage
- Metadata generation and tracking
Additional Tools¶
python/check_credentials.py- Verify storage credentialspython/run_full_demo.py- Run complete workflow automatically
🛠️ Services Available¶
RStudio Service¶
Perfect for: - Statistical analysis and spatial data processing - Data visualization and reporting - R-based marine research workflows - Quick data exploration and analysis
Getting Started:
1. Launch RStudio service in EDITO Datalab
2. Run r/01_stac_search.R to discover data collections
3. Use r/02_read_parquet.R to process biodiversity data
4. Try r/03_personal_storage.R for data management
Jupyter Service¶
Perfect for: - Machine learning and data science - Interactive data exploration - Python-based analysis and visualization - Notebook-based research workflows
Getting Started:
1. Launch Jupyter service in EDITO Datalab
2. Run the Python scripts in sequence:
- python/01_get_stac_collections.py
- python/02_search_stac_assets.py
- python/03_get_zarr_to_df.py
- python/04_get_parquet_data.py
- python/05_combine_and_save.py
VSCode Service¶
Perfect for: - Mixed R/Python projects - Large codebases and development - Collaborative research - Advanced data processing workflows
Getting Started:
1. Launch VSCode service in EDITO Datalab
2. Open the using_datalab folder
3. Run either R or Python scripts as needed
4. Use integrated terminal for command-line tools
📊 Data Formats & Sources¶
STAC (SpatioTemporal Asset Catalog)¶
- Purpose: Find and discover marine datasets
- API:
https://api.dive.edito.eu/data/ - Use: Search for available data collections
- Scripts:
01_stac_search.R,01_get_stac_collections.py
Parquet (Biodiversity Data)¶
- Purpose: Efficient tabular data storage for occurrence records
- Use: Marine species observations, biodiversity data
- Example: EUROBIS marine species occurrence data
- Scripts:
02_read_parquet.R,04_get_parquet_data.py - Features: Fast querying, columnar storage, schema evolution
Zarr (Oceanographic Data)¶
- Purpose: Cloud-optimized array data for large datasets
- Use: Oceanographic data, climate reanalyses, satellite data
- Tools: xarray, zarr-python
- Scripts:
03_get_zarr_to_df.py - Features: Chunked storage, parallel access, compression
Personal Storage (MyFiles)¶
- Purpose: Your personal cloud storage for data and results
- Access: Automatically configured in EDITO services
- Use: Save processed data, share results, backup analysis
- Scripts:
03_personal_storage.R,05_combine_and_save.py - Formats: CSV, Parquet, JSON, any file type
🎥 Video Examples¶
The tutorial includes video demonstrations of: - Service Configuration: RStudio, Jupyter, and VSCode setup - STAC Search: Finding and exploring marine data collections - Data Processing: Working with Parquet and Zarr data - Personal Storage: Uploading and managing data in MyFiles - Complete Workflow: End-to-end data analysis pipeline
🚀 Getting Started¶
Option 1: Quick Start (15 minutes)¶
- Launch a service at datalab.dive.edito.eu
- Run one script to get familiar with the workflow
- Explore the data and see what's available
Option 2: Complete Workflow (1 hour)¶
- Start with R: Run
r/01_stac_search.Rto discover data - Process data: Use
r/02_read_parquet.Rfor biodiversity data - Manage storage: Try
r/03_personal_storage.Rfor data management - Advanced Python: Run the Python scripts in sequence for full workflow
Option 3: Automated Demo¶
- Run the complete demo:
python/run_full_demo.py - Watch the process: See all steps automated
- Examine results: Check the output files and storage
📖 Additional Resources¶
- EDITO Datalab: datalab.dive.edito.eu
- EDITO Data API: data.dive.edito.eu
- STAC Specification: stacspec.org
- Personal Storage: datalab.dive.edito.eu/account/storage
- Workshop Repository: GitHub
🤝 Contributing¶
Found an issue or have suggestions? Please contribute to improve this workshop!
Ready to start? Go to datalab.dive.edito.eu and launch your first service! 🌊🐠