Monitoring and Observability¶
Terralist exposes Prometheus metrics to provide deep insights into registry operations, storage backend performance, and system health.
Metrics Endpoint¶
Metrics are available at:
This endpoint returns metrics in Prometheus format and can be scraped by Prometheus or compatible monitoring systems.
Available Metrics¶
Application Metrics¶
Build Information¶
Static metadata about the running Terralist instance.
Uptime¶
Total seconds since Terralist started.
Errors¶
Total count of errors by type.
Artifacts Metrics¶
Track module and provider operations across authorities.
Uploads¶
Total number of uploaded artifacts.
Example queries:
# Upload rate per minute
rate(terralist_artifacts_uploaded_total[5m])
# Uploads by authority
sum by (authority) (terralist_artifacts_uploaded_total)
Downloads¶
Total number of downloaded artifacts.
Example queries:
# Most downloaded modules
topk(10, sum by (authority) (terralist_artifacts_downloaded_total{type="module"}))
Deletions¶
Total number of deleted artifacts.
Current Artifact Count¶
Current number of artifacts (gauge that increases with uploads, decreases with deletions).
Example queries:
# Total artifacts in registry
sum(terralist_artifacts_total)
# Artifacts per authority
sum by (authority, type) (terralist_artifacts_total)
Request Metrics¶
Requests by Authority¶
Total requests grouped by authority and operation type.
Example queries:
# Request rate per authority
rate(terralist_requests_by_authority_total[5m])
# Most active authorities
topk(5, sum by (authority) (rate(terralist_requests_by_authority_total[5m])))
API Keys Metrics¶
Current number of API keys by authority and status.
Example queries:
# Total active API keys
sum(terralist_api_keys_total{status="active"})
# Expired keys that need cleanup
terralist_api_keys_total{status="expired"} > 0
Storage Backend Metrics¶
Monitor the performance and health of storage backends (S3, Azure, GCS, Local).
Operations¶
terralist_storage_operations_total{operation="upload|download|delete", backend="s3|azure|gcs|local", status="success|error"}
Total storage operations by type, backend, and status.
Example queries:
# Error rate by backend
rate(terralist_storage_operations_total{status="error"}[5m])
# Success rate percentage
sum(rate(terralist_storage_operations_total{status="success"}[5m]))
/
sum(rate(terralist_storage_operations_total[5m])) * 100
Data Transfer¶
Total bytes transferred through storage operations.
Example queries:
# Upload throughput (bytes/sec)
rate(terralist_storage_bytes_total{operation="upload"}[5m])
# Total data uploaded per backend
sum by (backend) (terralist_storage_bytes_total{operation="upload"})
Operation Duration¶
Histogram of storage operation durations.
Example queries:
# P95 upload latency
histogram_quantile(0.95, sum(rate(terralist_storage_operation_duration_seconds_bucket{operation="upload"}[5m])) by (le, backend))
# P50 download latency by backend
histogram_quantile(0.50, sum(rate(terralist_storage_operation_duration_seconds_bucket{operation="download"}[5m])) by (le, backend))
# Slow operations (>5s)
terralist_storage_operation_duration_seconds_bucket{le="5.0"} - terralist_storage_operation_duration_seconds_bucket{le="2.5"}
HTTP Metrics¶
Standard HTTP metrics provided by Prometheus middleware.
Request Duration¶
terralist_http_request_duration_seconds{method="GET|POST|PUT|DELETE", path="...", status="200|404|500"}
Histogram of HTTP request durations.
Requests Total¶
Total HTTP requests.
Request Size¶
Histogram of HTTP request body sizes.
Response Size¶
Histogram of HTTP response sizes.
Example queries:
# Request rate per endpoint
sum by (path) (rate(terralist_http_requests_total[5m]))
# Error rate (4xx + 5xx)
sum(rate(terralist_http_requests_total{status=~"4..|5.."}[5m]))
# P99 response time
histogram_quantile(0.99, sum(rate(terralist_http_request_duration_seconds_bucket[5m])) by (le))
Database Metrics¶
Connection pool and query performance metrics.
Active Connections¶
Current number of active database connections.
Idle Connections¶
Current number of idle database connections in the pool.
Connections in Use¶
Current number of connections actively executing queries.
Wait Count¶
Total number of times a connection had to wait.
Wait Duration¶
Total time spent waiting for connections.
Example queries:
# Connection pool utilization %
(terralist_database_connections_in_use / terralist_database_connections_active) * 100
# Average wait time
rate(terralist_database_connections_wait_duration_seconds_total[5m])
/
rate(terralist_database_connections_wait_count_total[5m])
Prometheus Configuration¶
Add Terralist to your prometheus.yml:
scrape_configs:
- job_name: 'terralist'
scrape_interval: 15s
static_configs:
- targets: ['terralist:5758']
metrics_path: /metrics
Alerting Examples¶
Note: These are example alerts to get you started. Thresholds, severity levels, and time windows should be adjusted based on your specific workload, SLA requirements, and operational experience. Start with conservative thresholds and refine them based on actual production behavior to avoid alert fatigue.
Storage Backend Alerts¶
High Storage Error Rate¶
groups:
- name: terralist_storage
rules:
- alert: HighStorageErrorRate
expr: |
(
sum by (backend) (rate(terralist_storage_operations_total{status="error"}[5m]))
/
sum by (backend) (rate(terralist_storage_operations_total[5m]))
) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High storage error rate on {{ $labels.backend }} ({{ $value | humanizePercentage }})"
description: "Storage backend {{ $labels.backend }} has error rate above 5% for 5 minutes"
Slow Storage Operations¶
- alert: SlowStorageOperations
expr: |
histogram_quantile(0.95,
sum by (le, backend, operation) (rate(terralist_storage_operation_duration_seconds_bucket[5m]))
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Slow {{ $labels.operation }} on {{ $labels.backend }} (P95: {{ $value | humanizeDuration }})"
description: "Storage {{ $labels.operation }} P95 latency exceeds 5s on {{ $labels.backend }}"
- alert: CriticallySlowStorageOperations
expr: |
histogram_quantile(0.95,
sum by (le, backend, operation) (rate(terralist_storage_operation_duration_seconds_bucket[5m]))
) > 30
for: 5m
labels:
severity: critical
annotations:
summary: "Critical storage latency on {{ $labels.backend }} (P95: {{ $value | humanizeDuration }})"
description: "Storage operations critically slow, may impact user experience"
Storage Backend Down¶
- alert: StorageBackendNoActivity
expr: |
(time() - max by (backend) (terralist_storage_operations_total)) > 3600
for: 5m
labels:
severity: warning
annotations:
summary: "No storage activity on {{ $labels.backend }} for over 1 hour"
description: "Storage backend {{ $labels.backend }} may be unreachable or experiencing issues"
HTTP and API Alerts¶
High HTTP Error Rate¶
- name: terralist_http
rules:
- alert: HighHTTPErrorRate
expr: |
(
sum(rate(terralist_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(terralist_http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP 5xx error rate ({{ $value | humanizePercentage }})"
description: "More than 5% of requests returning 5xx errors"
- alert: HighHTTPClientErrorRate
expr: |
(
sum(rate(terralist_http_requests_total{status=~"4.."}[5m]))
/
sum(rate(terralist_http_requests_total[5m]))
) > 0.20
for: 10m
labels:
severity: warning
annotations:
summary: "High HTTP 4xx error rate ({{ $value | humanizePercentage }})"
description: "More than 20% of requests returning 4xx errors, check authentication"
Slow HTTP Responses¶
- alert: SlowHTTPResponses
expr: |
histogram_quantile(0.95,
sum by (le, path) (rate(terralist_http_request_duration_seconds_bucket[5m]))
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Slow responses on {{ $labels.path }} (P95: {{ $value | humanizeDuration }})"
description: "API endpoint {{ $labels.path }} responding slowly"
Database Alerts¶
Connection Pool Exhaustion¶
- name: terralist_database
rules:
- alert: DatabaseConnectionPoolNearLimit
expr: |
(terralist_database_connections_in_use / terralist_database_connections_active) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Database connection pool {{ $value | humanizePercentage }} utilized"
description: "Connection pool usage above 80%, consider increasing pool size"
- alert: DatabaseConnectionPoolExhausted
expr: |
(terralist_database_connections_in_use / terralist_database_connections_active) > 0.95
for: 2m
labels:
severity: critical
annotations:
summary: "Database connection pool nearly exhausted ({{ $value | humanizePercentage }})"
description: "Immediate action required - connection pool at capacity"
High Connection Wait Times¶
- alert: HighDatabaseConnectionWaitTime
expr: |
rate(terralist_database_connections_wait_duration_seconds_total[5m])
/
rate(terralist_database_connections_wait_count_total[5m])
> 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High database connection wait time ({{ $value | humanizeDuration }} avg)"
description: "Applications waiting too long for database connections"
System Health Alerts¶
Service Down¶
- name: terralist_health
rules:
- alert: TerraListDown
expr: up{job="terralist"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Terralist service is down"
description: "Terralist instance unreachable for 2 minutes"