Skip to content

Monitoring and Observability

Terralist exposes Prometheus metrics to provide deep insights into registry operations, storage backend performance, and system health.

Metrics Endpoint

Metrics are available at:

GET /metrics

This endpoint returns metrics in Prometheus format and can be scraped by Prometheus or compatible monitoring systems.

Available Metrics

Application Metrics

Build Information

terralist_build_info{version="...", commit="...", build_time="..."}

Static metadata about the running Terralist instance.

Uptime

terralist_uptime_seconds

Total seconds since Terralist started.

Errors

terralist_errors_total{type="..."}

Total count of errors by type.


Artifacts Metrics

Track module and provider operations across authorities.

Uploads

terralist_artifacts_uploaded_total{type="module|provider", authority="..."}

Total number of uploaded artifacts.

Example queries:

# Upload rate per minute
rate(terralist_artifacts_uploaded_total[5m])

# Uploads by authority
sum by (authority) (terralist_artifacts_uploaded_total)

Downloads

terralist_artifacts_downloaded_total{type="module|provider", authority="..."}

Total number of downloaded artifacts.

Example queries:

# Most downloaded modules
topk(10, sum by (authority) (terralist_artifacts_downloaded_total{type="module"}))

Deletions

terralist_artifacts_deleted_total{type="module|provider", authority="..."}

Total number of deleted artifacts.

Current Artifact Count

terralist_artifacts_total{type="module|provider", authority="..."}

Current number of artifacts (gauge that increases with uploads, decreases with deletions).

Example queries:

# Total artifacts in registry
sum(terralist_artifacts_total)

# Artifacts per authority
sum by (authority, type) (terralist_artifacts_total)


Request Metrics

Requests by Authority

terralist_requests_by_authority_total{authority="...", operation="upload|download|list"}

Total requests grouped by authority and operation type.

Example queries:

# Request rate per authority
rate(terralist_requests_by_authority_total[5m])

# Most active authorities
topk(5, sum by (authority) (rate(terralist_requests_by_authority_total[5m])))


API Keys Metrics

terralist_api_keys_total{authority="...", status="active|expired"}

Current number of API keys by authority and status.

Example queries:

# Total active API keys
sum(terralist_api_keys_total{status="active"})

# Expired keys that need cleanup
terralist_api_keys_total{status="expired"} > 0


Storage Backend Metrics

Monitor the performance and health of storage backends (S3, Azure, GCS, Local).

Operations

terralist_storage_operations_total{operation="upload|download|delete", backend="s3|azure|gcs|local", status="success|error"}

Total storage operations by type, backend, and status.

Example queries:

# Error rate by backend
rate(terralist_storage_operations_total{status="error"}[5m])

# Success rate percentage
sum(rate(terralist_storage_operations_total{status="success"}[5m])) 
/ 
sum(rate(terralist_storage_operations_total[5m])) * 100

Data Transfer

terralist_storage_bytes_total{operation="upload|download", backend="..."}

Total bytes transferred through storage operations.

Example queries:

# Upload throughput (bytes/sec)
rate(terralist_storage_bytes_total{operation="upload"}[5m])

# Total data uploaded per backend
sum by (backend) (terralist_storage_bytes_total{operation="upload"})

Operation Duration

terralist_storage_operation_duration_seconds{operation="...", backend="..."}

Histogram of storage operation durations.

Example queries:

# P95 upload latency
histogram_quantile(0.95, sum(rate(terralist_storage_operation_duration_seconds_bucket{operation="upload"}[5m])) by (le, backend))

# P50 download latency by backend
histogram_quantile(0.50, sum(rate(terralist_storage_operation_duration_seconds_bucket{operation="download"}[5m])) by (le, backend))

# Slow operations (>5s)
terralist_storage_operation_duration_seconds_bucket{le="5.0"} - terralist_storage_operation_duration_seconds_bucket{le="2.5"}


HTTP Metrics

Standard HTTP metrics provided by Prometheus middleware.

Request Duration

terralist_http_request_duration_seconds{method="GET|POST|PUT|DELETE", path="...", status="200|404|500"}

Histogram of HTTP request durations.

Requests Total

terralist_http_requests_total{method="...", path="...", status="..."}

Total HTTP requests.

Request Size

terralist_http_request_size_bytes{method="...", path="..."}

Histogram of HTTP request body sizes.

Response Size

terralist_http_response_size_bytes{method="...", path="..."}

Histogram of HTTP response sizes.

Example queries:

# Request rate per endpoint
sum by (path) (rate(terralist_http_requests_total[5m]))

# Error rate (4xx + 5xx)
sum(rate(terralist_http_requests_total{status=~"4..|5.."}[5m]))

# P99 response time
histogram_quantile(0.99, sum(rate(terralist_http_request_duration_seconds_bucket[5m])) by (le))


Database Metrics

Connection pool and query performance metrics.

Active Connections

terralist_database_connections_active

Current number of active database connections.

Idle Connections

terralist_database_connections_idle

Current number of idle database connections in the pool.

Connections in Use

terralist_database_connections_in_use

Current number of connections actively executing queries.

Wait Count

terralist_database_connections_wait_count_total

Total number of times a connection had to wait.

Wait Duration

terralist_database_connections_wait_duration_seconds_total

Total time spent waiting for connections.

Example queries:

# Connection pool utilization %
(terralist_database_connections_in_use / terralist_database_connections_active) * 100

# Average wait time
rate(terralist_database_connections_wait_duration_seconds_total[5m]) 
/ 
rate(terralist_database_connections_wait_count_total[5m])


Prometheus Configuration

Add Terralist to your prometheus.yml:

scrape_configs:
  - job_name: 'terralist'
    scrape_interval: 15s
    static_configs:
      - targets: ['terralist:5758']
    metrics_path: /metrics

Alerting Examples

Note: These are example alerts to get you started. Thresholds, severity levels, and time windows should be adjusted based on your specific workload, SLA requirements, and operational experience. Start with conservative thresholds and refine them based on actual production behavior to avoid alert fatigue.

Storage Backend Alerts

High Storage Error Rate

groups:
  - name: terralist_storage
    rules:
      - alert: HighStorageErrorRate
        expr: |
          (
            sum by (backend) (rate(terralist_storage_operations_total{status="error"}[5m]))
            /
            sum by (backend) (rate(terralist_storage_operations_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High storage error rate on {{ $labels.backend }} ({{ $value | humanizePercentage }})"
          description: "Storage backend {{ $labels.backend }} has error rate above 5% for 5 minutes"

Slow Storage Operations

      - alert: SlowStorageOperations
        expr: |
          histogram_quantile(0.95,
            sum by (le, backend, operation) (rate(terralist_storage_operation_duration_seconds_bucket[5m]))
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow {{ $labels.operation }} on {{ $labels.backend }} (P95: {{ $value | humanizeDuration }})"
          description: "Storage {{ $labels.operation }} P95 latency exceeds 5s on {{ $labels.backend }}"

      - alert: CriticallySlowStorageOperations
        expr: |
          histogram_quantile(0.95,
            sum by (le, backend, operation) (rate(terralist_storage_operation_duration_seconds_bucket[5m]))
          ) > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical storage latency on {{ $labels.backend }} (P95: {{ $value | humanizeDuration }})"
          description: "Storage operations critically slow, may impact user experience"

Storage Backend Down

      - alert: StorageBackendNoActivity
        expr: |
          (time() - max by (backend) (terralist_storage_operations_total)) > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "No storage activity on {{ $labels.backend }} for over 1 hour"
          description: "Storage backend {{ $labels.backend }} may be unreachable or experiencing issues"

HTTP and API Alerts

High HTTP Error Rate

  - name: terralist_http
    rules:
      - alert: HighHTTPErrorRate
        expr: |
          (
            sum(rate(terralist_http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(terralist_http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High HTTP 5xx error rate ({{ $value | humanizePercentage }})"
          description: "More than 5% of requests returning 5xx errors"

      - alert: HighHTTPClientErrorRate
        expr: |
          (
            sum(rate(terralist_http_requests_total{status=~"4.."}[5m]))
            /
            sum(rate(terralist_http_requests_total[5m]))
          ) > 0.20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High HTTP 4xx error rate ({{ $value | humanizePercentage }})"
          description: "More than 20% of requests returning 4xx errors, check authentication"

Slow HTTP Responses

      - alert: SlowHTTPResponses
        expr: |
          histogram_quantile(0.95,
            sum by (le, path) (rate(terralist_http_request_duration_seconds_bucket[5m]))
          ) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow responses on {{ $labels.path }} (P95: {{ $value | humanizeDuration }})"
          description: "API endpoint {{ $labels.path }} responding slowly"

Database Alerts

Connection Pool Exhaustion

  - name: terralist_database
    rules:
      - alert: DatabaseConnectionPoolNearLimit
        expr: |
          (terralist_database_connections_in_use / terralist_database_connections_active) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database connection pool {{ $value | humanizePercentage }} utilized"
          description: "Connection pool usage above 80%, consider increasing pool size"

      - alert: DatabaseConnectionPoolExhausted
        expr: |
          (terralist_database_connections_in_use / terralist_database_connections_active) > 0.95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool nearly exhausted ({{ $value | humanizePercentage }})"
          description: "Immediate action required - connection pool at capacity"

High Connection Wait Times

      - alert: HighDatabaseConnectionWaitTime
        expr: |
          rate(terralist_database_connections_wait_duration_seconds_total[5m])
          /
          rate(terralist_database_connections_wait_count_total[5m])
          > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High database connection wait time ({{ $value | humanizeDuration }} avg)"
          description: "Applications waiting too long for database connections"

System Health Alerts

Service Down

  - name: terralist_health
    rules:
      - alert: TerraListDown
        expr: up{job="terralist"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Terralist service is down"
          description: "Terralist instance unreachable for 2 minutes"

High Error Count

      - alert: HighApplicationErrorCount
        expr: |
          increase(terralist_errors_total[5m]) > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High application error count ({{ $value }} in 5m)"
          description: "Application experiencing elevated error rates"