Open Source Compute Scheduler

The Compute Gardener Scheduler is a Kubernetes scheduler plugin that solves a key challenge: optimizing when and where workloads run based on real-time carbon intensity data. With robust pod-level tracking and metrics, you get precise visibility into your workloads' energy consumption and carbon footprint. The scheduler helps teams make meaningful progress on sustainability goals while maintaining operational excellence.

View on GitHub

Key Features

Core Features

Carbon-Aware Scheduling

Schedule pods based on real-time carbon intensity data
Connect to Electricity Map API or create custom intensity sources
Built-in caching to limit external API calls
Track and measure impact with detailed metrics

Comprehensive Metrics

Track carbon intensity, energy usage, and cost metrics
Measure actual carbon and cost savings
Monitor resource utilization at pod and node level
Prometheus integration for observability

Additional Capabilities

Price-Aware Scheduling

Schedule based on time-of-use electricity pricing
Define custom pricing schedules via YAML
Pod-level controls via annotations
Configurable scheduling delays with flexible time formats

Energy Budget Tracking

Define and monitor energy usage limits for workloads
Configurable actions when budgets are exceeded
Track energy usage over pod lifecycle
Namespace-level energy budgets for workload groups

Hardware Power Profiling

Automatically detect node hardware to determine power profiles
Map cloud instance types to their hardware components
Accurate power modeling with datacenter PUE consideration
GPU-specific power profiles for workload types

Advanced Policies

Define team-based energy quotas at namespace level
GPU workload classification for inference, training, and rendering
Enable gradual adoption by starting with specific namespaces
Workload-type optimization for batch jobs, services, and stateful workloads

Configuration

Prerequisites

Metrics Server (Recommended): Without it, the scheduler won't be able to collect real-time node utilization data, resulting in less accurate energy usage estimates. Core carbon-aware and price-aware scheduling will still function using requested resources.
Prometheus (Recommended): For visualizing scheduler performance metrics and validating carbon/cost savings. The scheduler will continue to function without it, but you'll miss valuable insights.

Environment Variables

# API Configuration
# Required: Your API key for Electricity Map API
ELECTRICITY_MAP_API_KEY=<your-api-key>

# Optional: Default is https://api.electricitymap.org/v3/carbon-intensity/latest?zone=
ELECTRICITY_MAP_API_URL=<api-url>

# Optional: Default is US-CAL-CISO
ELECTRICITY_MAP_API_REGION=<region>

# Optional: API request timeout
API_TIMEOUT=10s

# Optional: Maximum API retry attempts
API_MAX_RETRIES=3

# Optional: Delay between retries
API_RETRY_DELAY=1s

# Optional: API rate limit per minute
API_RATE_LIMIT=10

# Optional: Cache TTL for API responses
CACHE_TTL=5m

# Optional: Maximum age of cached data
MAX_CACHE_AGE=1h

# Optional: Enable pod priority-based scheduling
ENABLE_POD_PRIORITIES=false

# Scheduling Configuration
# Optional: Maximum pod scheduling delay
MAX_SCHEDULING_DELAY=24h

# Carbon Configuration
# Optional: Enable carbon-aware scheduling (default: true)
CARBON_ENABLED=true

# Optional: Base carbon intensity threshold (gCO2/kWh)
CARBON_INTENSITY_THRESHOLD=200.0

# Pricing Configuration
# Optional: Enable TOU pricing
PRICING_ENABLED=false

# Optional: Default is 'tou'
PRICING_PROVIDER=tou

# Path to pricing schedules
PRICING_SCHEDULES_PATH=/path/to/schedules.yaml

# Node Power Configuration
# Default idle power consumption in watts
NODE_DEFAULT_IDLE_POWER=100.0

# Default maximum power consumption in watts
NODE_DEFAULT_MAX_POWER=400.0

# Node-specific power settings
NODE_POWER_CONFIG_worker1=idle:50,max:300

# Path to hardware profiles ConfigMap
HARDWARE_PROFILES_PATH=/path/to/hardware-profiles.yaml

# Metrics Collection Configuration
# Interval for collecting pod metrics
METRICS_SAMPLING_INTERVAL=30s

# Maximum number of metrics samples per pod
MAX_SAMPLES_PER_POD=500

# How long to keep metrics for completed pods
COMPLETED_POD_RETENTION=1h

# Strategy for downsampling metrics
DOWNSAMPLING_STRATEGY=timeBased

# Observability Configuration
# Optional: Logging level
LOG_LEVEL=info

# Optional: Enable tracing
ENABLE_TRACING=false

Pod Annotations

# Basic scheduling controls
# Opt out of compute-gardener scheduling
compute-gardener-scheduler.kubernetes.io/skip: "true"

# Disable carbon-aware scheduling for this pod
compute-gardener-scheduler.kubernetes.io/carbon-enabled: "false"

# Set custom carbon intensity threshold
compute-gardener-scheduler.kubernetes.io/carbon-intensity-threshold: "250.0"

# Set custom price threshold
compute-gardener-scheduler.kubernetes.io/price-threshold: "0.12"

# Set custom maximum scheduling delay
compute-gardener-scheduler.kubernetes.io/max-scheduling-delay: "12h"

# Energy budget controls
# Set energy budget in kilowatt-hours
compute-gardener-scheduler.kubernetes.io/energy-budget-kwh: "5.0"

# Action when budget exceeded: log, notify, annotate, label
compute-gardener-scheduler.kubernetes.io/energy-budget-action: "notify"

# Hardware efficiency controls
# Maximum power consumption threshold
compute-gardener-scheduler.kubernetes.io/max-power-watts: "300.0"

# Minimum efficiency requirement
compute-gardener-scheduler.kubernetes.io/min-efficiency: "0.8"

# GPU workload type (inference, training, rendering)
compute-gardener-scheduler.kubernetes.io/gpu-workload-type: "inference"

# PUE configuration 
# Power Usage Effectiveness for datacenter
compute-gardener-scheduler.kubernetes.io/pue: "1.2"

# GPU-specific Power Usage Effectiveness
compute-gardener-scheduler.kubernetes.io/gpu-pue: "1.15"

# Node hardware labels (for improved energy profiles)
node.kubernetes.io/cpu-model: "Intel(R) Xeon(R) Platinum 8275CL"
node.kubernetes.io/gpu-model: "NVIDIA A100"

Namespace-Level Energy Policies

Define energy policies at the namespace level that automatically apply to all pods in the namespace:

# Enable energy policies for this namespace
labels:
  compute-gardener-scheduler.kubernetes.io/energy-policies: "enabled"

# Default policies for all pods in this namespace
annotations:
  # Default carbon intensity threshold for all pods
  compute-gardener-scheduler.kubernetes.io/policy-carbon-intensity-threshold: "200"
  
  # Default energy budget (in kWh) for all pods
  compute-gardener-scheduler.kubernetes.io/policy-energy-budget-kwh: "10"
  
  # Default action when budget is exceeded
  compute-gardener-scheduler.kubernetes.io/policy-energy-budget-action: "notify"
  
  # Workload-specific policy overrides
  
  # Energy budget for batch jobs (like training jobs)
  compute-gardener-scheduler.kubernetes.io/workload-batch-policy-energy-budget-kwh: "20"
  
  # GPU workload type for batch jobs
  compute-gardener-scheduler.kubernetes.io/workload-batch-policy-gpu-workload-type: "training"
  
  # Price threshold for service workloads (like APIs, web servers)
  compute-gardener-scheduler.kubernetes.io/workload-service-policy-price-threshold: "0.15"

Hardware Power Profiles

The scheduler uses hardware-specific power profiles to accurately estimate and optimize energy consumption:

Hardware Profile Database provides power profiles for various CPU, GPU, and memory types
Cloud Instance Detection automatically maps cloud instances to their hardware components
Hybrid Cloud Hardware Detection uses node labels or runtime detection to identify hardware
Accurate energy estimation with datacenter PUE (Power Usage Effectiveness) consideration
GPU workload-specific power profiles to accurately model AI/ML workloads

Hardware Profile ConfigMap

# Global PUE defaults
# Default datacenter PUE (typical range: 1.1-1.6)
defaultPUE: 1.1       
# Default GPU-specific PUE for power conversion losses
defaultGPUPUE: 1.15   

# CPU power profiles
cpuProfiles:
  "Intel(R) Xeon(R) Platinum 8275CL":
    # Idle power in watts
    idlePower: 10.5  
    # Max power in watts
    maxPower: 120.0  
  "Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz":
    idlePower: 5.0
    maxPower: 65.0

# GPU power profiles with workload type coefficients
gpuProfiles:
  "NVIDIA A100":
    # Idle power in watts
    idlePower: 25.0       
    # Max power in watts at 100% utilization
    maxPower: 400.0       
    # Power coefficients for different workload types
    workloadTypes:        
      # Inference typically uses ~60% of max power at 100% utilization
      inference: 0.6      
      # Training uses full power
      training: 1.0       
      # Rendering uses ~90% of max power at 100% utilization
      rendering: 0.9      
  "NVIDIA GeForce GTX 1660":
    idlePower: 7.0
    maxPower: 125.0
    workloadTypes:
      inference: 0.5
      training: 0.9
      rendering: 0.8

# Memory power profiles
memProfiles:
  "DDR4-2666 ECC":
    # Idle power per GB in watts
    idlePowerPerGB: 0.125  
    # Max power per GB in watts at full utilization
    maxPowerPerGB: 0.375   
    # Base power overhead in watts
    baseIdlePower: 1.0     

# Cloud instance mappings to hardware components
cloudInstanceMapping:
  aws:
    "m5.large":
      cpuModel: "Intel(R) Xeon(R) Platinum 8175M"
      memoryType: "DDR4-2666 ECC"
      numCPUs: 2
      # in MB
      totalMemory: 8192  
  gcp:
    "n2-standard-4":
      cpuModel: "Intel Cascade Lake"
      memoryType: "DDR4-3200"
      numCPUs: 4
      # in MB
      totalMemory: 16384

Observability & Metrics

The scheduler exports comprehensive Prometheus metrics for monitoring:

Carbon and Pricing Metrics: Current carbon intensity, electricity rates, scheduling delays
Energy Budget Metrics: Budget usage percentage, exceeded budget counts, job energy usage
Hardware Efficiency Metrics: Node PUE, efficiency metrics, power-filtered nodes
Resource Utilization Metrics: CPU, memory, and GPU usage across nodes
Power Estimation Metrics: Estimated node power consumption with PUE consideration
Carbon Emissions Metrics: Estimated job carbon emissions in grams of CO2
Scheduler Performance: Scheduling attempts, latency, estimated savings
Metrics System: Sampling counts, cache size, and system health

These metrics help validate the scheduler's behavior, measure carbon and cost savings, and ensure optimal performance.

Metrics Integration

The scheduler exposes metrics through multiple integration methods:

Health checks on port 10259 (HTTPS) path /healthz
Metrics on port 10259 (HTTPS) path /metrics
Built-in ServiceMonitor resources for Prometheus Operator
Prometheus annotation-based discovery support
Configurable sampling intervals and downsampling strategies
Customizable metrics retention for completed jobs

Example Metrics

# Carbon intensity and electricity rate metrics
scheduler_compute_gardener_carbon_intensity{region="US-CAL-CISO"} 214.56
scheduler_compute_gardener_electricity_rate{region="US-CAL-CISO"} 0.28

# Energy budget tracking metrics
scheduler_compute_gardener_energy_budget_usage_percent{
  namespace="ai-training", 
  pod="training-job-1"
} 78.5

scheduler_compute_gardener_job_energy_usage_kwh{job="batch-job-123"} 4.75
scheduler_compute_gardener_job_carbon_emissions_grams{job="batch-job-123"} 1023.8

# Hardware efficiency metrics
scheduler_compute_gardener_node_pue{node="worker-1"} 1.15
scheduler_compute_gardener_node_power_estimate_watts{node="worker-1"} 267.4

Common Use Cases

ML/GPU Workloads

Run AI training jobs when carbon intensity is lowest, with specialized power profiles for GPU workload types.

Inference Services

Monitor energy usage of services without impacting SLAs, while gaining insights about optimal scheduling windows for future deployments.

Batch Processing

Schedule data processing jobs during low-cost electricity periods with configurable scheduling delays.

Energy Budgeting

Track workload energy usage with configurable alerts as budgets approach limits, enabling proactive planning while maintaining service availability.

Multi-Cloud Optimization

Accurately model power across different cloud providers using hardware profiles and PUE configurations.

Research & Academia

Measure energy consumption of compute-intensive research workloads while still maximizing resource utilization with flexible scheduling policies.

Join Our Community

Get involved with the Compute Gardener community. Ask questions, share your experience, and contribute to making computing more sustainable.

GitHub Discussions Get Support

Stay Updated

Get updates about new features, carbon optimization tips, and community highlights.