Skip to content

Add Process to EDITO

Learn how to deploy computational models and data processing workflows to EDITO.

⚠️ Important Notice: The example process provided in this tutorial is purely demonstrative and serves as a template for learning. It may need to be adjusted, customized, or completely rewritten to meet your specific data processing needs and requirements.

🎯 What You'll Learn

  • Identify when your application is a model (input → output transformation)
  • Dockerize computational workflows and models
  • Deploy processes to EDITO using the process playground
  • Configure Helm charts for batch processing jobs
  • Handle input data from online sources or personal storage

🛠️ Requirements

  • Docker - Containerization
  • GitLab account - EDITO infrastructure access
  • Container registry (GitHub Packages, Docker Hub, etc.)

🤔 When Is My App a Model?

Your application qualifies as a model when it: - Takes input data and transforms it into output data - Performs computational analysis, prediction, or simulation - Processes data through algorithms or mathematical operations - Generates results that can be used for decision-making or further analysis

Examples: - Machine learning models (prediction, classification) - Statistical analysis workflows - Simulation models - Data processing pipelines - Image processing algorithms

📊 Input Data Sources

Your model can work with data from several sources:

External APIs and URLs

  • Download data from external services
  • Access real-time data streams
  • Connect to public datasets and repositories

Pre-loaded Data

  • Include static data in your Docker image
  • Copy data files during container build
  • Access data from /app/data/ directory

Generated Data

  • Create sample data for demonstration
  • Generate synthetic datasets for testing
  • Use built-in R data generation functions

📁 Tutorial examples

  • example_model/ - Complete example model workflow with R scripts
  • example_process/ - Demonstrative process template (Helm chart for deployment)

🔬 Example Model

The example_model/ directory contains a complete R-based model that demonstrates the typical workflow:

Model Components

  • Dockerfile - Container configuration with R environment
  • Scripts/ - R scripts for data processing and analysis
  • 01_data_preparation.R - Data preprocessing script
  • 02_model_analysis.R - Statistical analysis and visualization
  • requirements.txt - R package dependencies

Model Workflow

# 01_data_preparation.R
# Load and clean input data
data <- read.csv("/data/input/sample_data.csv")
processed_data <- data %>%
  filter(!is.na(value)) %>%
  mutate(processed_value = value * 2)

# 02_model_analysis.R  
# Run analysis and generate results
model <- lm(processed_value ~ category, data = processed_data)
results <- summary(model)

# Save outputs
write.csv(results, "/data/output/analysis_results.csv")

🐳 Dockerize and Push

1. Build Your Docker Image

cd example_model/
docker build -t your-registry/your-model:latest .

2. Push to Container Registry

# Tag for your registry
docker tag your-registry/your-model:latest your-registry.com/your-model:latest

# Push to registry
docker push your-registry.com/your-model:latest

Supported Registries: - GitHub Packages - Docker Hub - GitLab Container Registry - Any OCI-compatible registry

🚀 Deploy to EDITO

1. Clone EDITO Process Playground

git clone https://gitlab.mercator-ocean.fr/pub/edito-infra/process-playground.git
cd process-playground

README!

Follow the EDITO Process Playground README

Deploy Your Process

Use the process playground interface to deploy your containerized model with the Helm chart template.

📋 Example Process Template

The example_process/ directory contains a complete Helm chart that demonstrates how to deploy your model as a Kubernetes job.

Process Flow

  • Download: Input data is downloaded from your personal S3 storage to /data/input
  • Process: Two sequential processing steps run in /data:
  • Data preparation (Rscript /Scripts/01_data_preparation.R)
  • Model analysis (Rscript /Scripts/02_model_analysis.R)
  • Upload: Results are uploaded from /data/output back to your personal S3 storage

Key Features

  • Simple S3 Integration: Downloads from and uploads to your personal storage
  • Configurable Processing: Commands can be customized in values.yaml
  • Tutorial-Friendly: Clear data flow with /data directory structure
  • Error Handling: Proper timeout and logging mechanisms

🔧 Important Components

Job YAML (templates/job.yaml)

The main Kubernetes Job configuration that orchestrates your process:

# Init container downloads data
initContainers:
- name: s3-download
  command: ["aws", "s3", "sync", "s3://your-bucket/input/", "/data/input/"]

# Your processing containers run sequentially  
containers:
- name: data-prep
  command: ["Rscript", "/Scripts/01_data_preparation.R"]
- name: model-analysis  
  command: ["Rscript", "/Scripts/02_model_analysis.R"]

# Final container uploads results
- name: s3-upload
  command: ["aws", "s3", "sync", "/data/output/", "s3://your-bucket/output/"]

Key Features: - Init Container: Downloads input data from S3 - Processing Containers: Run your model scripts sequentially - Upload Container: Uploads results back to S3 - Shared Volume: All containers share /data directory

Values Schema (values.schema.json)

Defines the configuration form in the EDITO playground:

{
  "type": "object",
  "properties": {
    "image.repository": {
      "title": "Docker Image Repository",
      "description": "Your container registry URL"
    },
    "s3.inputPath": {
      "title": "Input S3 Path", 
      "description": "S3 path to your input data"
    },
    "s3.outputPath": {
      "title": "Output S3 Path",
      "description": "S3 path for results"
    }
  }
}

Purpose: - UI Form Generation: Creates input fields in the playground - Validation: Ensures required values are provided - Documentation: Describes each configuration option

Chart Metadata (Chart.yaml)

Helm chart information and dependencies:

apiVersion: v2
name: example-process
description: Example data processing workflow
version: 0.1.0
dependencies:
- name: s3-secret
  version: "1.0.0"
  repository: "file://../s3-secret"

Components: - Chart Identity: Name, version, description - Dependencies: Required sub-charts (S3 secrets, etc.) - Metadata: For chart management and discovery

Configuration Values (values.yaml)

Default configuration for your process:

# Docker image configuration
image:
  repository: "your-registry.com/your-model"
  tag: "latest"

# S3 configuration
s3:
  inputPath: "your-bucket/input/"
  outputPath: "your-bucket/output/"

# Processing commands
processing:
  dataPrep: "Rscript /Scripts/01_data_preparation.R"
  modelAnalysis: "Rscript /Scripts/02_model_analysis.R"

Configuration Options: - Docker Image: Your containerized model - S3 Paths: Input and output data locations - Processing Commands: Customizable R/Python scripts

🔄 Input/Output Handling

The Kubernetes Job YAML orchestrates data flow through a simple three-stage process:

Input Stage: An init container downloads your data from S3 storage to /data/input

Processing Stage: Your containers run sequentially, processing data in the shared /data directory

Output Stage: A final container uploads results from /data/output back to your S3 storage

Learn More: - Kubernetes Jobs Documentation - EDITO Process Examples

🤝 Contributing

Found an issue or have suggestions? Please contribute to improve this workshop!

📖 Additional Resources


⚠️ Final Reminder: This tutorial provides a demonstrative example process template to help you understand the concepts and structure. The actual processing logic, data handling, and workflow steps will need to be customized or completely rewritten to match your specific use case and requirements. Use this as a starting point for learning, not as a production-ready solution.

📄 Presentation: Process Deployment Guide