Log Based Alerting
UDS Core deploys Loki with log alerting and metrics recording capabilities enabled through the Loki Ruler. This allows you to create alerts based on log patterns and generate metrics from logs using recording rules. The Loki Ruler evaluates LogQL expressions and can send alerts to Alertmanager and recording rule metrics to Prometheus.
Overview
Section titled “Overview”Log-based alerting in UDS Core provides two main capabilities:
- Loki Alerting Rules: Generate alerts when specific log patterns are detected
- Loki Recording Rules: Create metrics from log queries for better performance and to enable metric-based alerting
The Loki Ruler component evaluates these rules and integrates with the existing monitoring stack:
- Alerts -> Sent to Alertmanager
- Recording Rules -> Metrics sent to Prometheus for storage and further alerting
Architecture Flow
Section titled “Architecture Flow”- Applications send logs to Loki
- Loki Ruler (a part of the Loki backend pods) evaluates rules against log data
- Alerting rules trigger notifications via Alertmanager
- Alerts are routed through notification channels
- Recording rules create metrics stored in Prometheus
- Metrics are available for dashboards and further alerting
Deploying Rules
Section titled “Deploying Rules”Rules are deployed using Kubernetes ConfigMaps or Secrets with specific labels that Loki’s sidecar watches for. The sidecar automatically loads any ConfigMaps/Secrets labeled with loki_rule: "1"
.
Rule File Structure
Section titled “Rule File Structure”Rules must be in YAML format following the Loki ruler configuration specification:
groups: - name: example-group rules: - alert: HighErrorRate expr: | rate({app="my-app"} |= "ERROR" [5m]) > 0.1 for: 5m labels: severity: warning team: backend annotations: summary: "High error rate detected" description: "Error rate is {{ $value }} errors per second"
- record: my_app:error_rate expr: | rate({app="my-app"} |= "ERROR" [5m])
Deploying Alerting Rules
Section titled “Deploying Alerting Rules”Create a ConfigMap with your alerting rules. Here’s an example that monitors error rates and application logs availability:
apiVersion: v1kind: ConfigMapmetadata: name: my-app-alert-rules namespace: my-app-namespace labels: loki_rule: "1" # Required label for Loki sidecar to pick updata: rules.yaml: | groups: - name: my-app-alerts rules: - alert: ApplicationErrors expr: | sum(rate({namespace="my-app-namespace"} |= "ERROR" [5m])) > 0.05 for: 2m labels: severity: warning service: my-app annotations: summary: "High error rate for my-app" runbook_url: "https://wiki.company.com/runbooks/my-app-errors"
- alert: ApplicationLogsDown expr: | absent_over_time({namespace="my-app-namespace",app="my-app"}[5m]) for: 1m labels: severity: critical service: my-app annotations: summary: "Application is not producing logs" description: "No logs received from application for 5 minutes"
Deploying Recording Rules
Section titled “Deploying Recording Rules”Recording rules create metrics from log queries which can be used to enable more performative metric-based alerting:
apiVersion: v1kind: ConfigMapmetadata: name: my-app-recording-rules namespace: my-app-namespace labels: loki_rule: "1"data: recording-rules.yaml: | groups: - name: my-app-metrics interval: 30s # How often to evaluate the rules rules: - record: my_app:request_rate # name of the metric produced in prometheus expr: | sum(rate({namespace="my-app-namespace",app="my-app"} |= "REQUEST" [1m]))
- record: my_app:error_rate expr: | sum(rate({namespace="my-app-namespace",app="my-app"} |= "ERROR" [1m]))
- record: my_app:error_percentage expr: | ( sum(rate({namespace="my-app-namespace",app="my-app"} |= "ERROR" [1m])) / sum(rate({namespace="my-app-namespace",app="my-app"} [1m])) ) * 100
You can then create Prometheus alerting rules based on these recorded metrics using a PrometheusRule CRD (these rules will automatically be sent to Alertmanager):
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: my-app-prometheus-alerts namespace: my-app-namespace labels: prometheus: kube-prometheus-stack-prometheusspec: groups: - name: my-app-prometheus-alerts rules: - alert: HighErrorPercentage expr: my_app:error_percentage > 5 for: 5m labels: severity: warning service: my-app annotations: description: "High error rate on my-app" runbook_url: "https://wiki.company.com/runbooks/my-app-high-errors"
Best Practices
Section titled “Best Practices”Rule Organization
Section titled “Rule Organization”- Group related rules together using meaningful group names
- Use consistent labeling across rules for better organization
- Include runbook URLs in alert annotations for operational guidance
Performance Considerations
Section titled “Performance Considerations”- Use recording rules for complex or frequently-evaluated queries
- Set appropriate evaluation intervals (default: 1m, adjust based on needs)
- Be mindful of cardinality when creating recorded metrics
Alert Quality
Section titled “Alert Quality”- Set appropriate
for
durations to reduce noise - Include meaningful labels for routing and filtering
- Provide actionable descriptions and summary annotations
- Test alerts before deploying to production
Integration with Existing Monitoring
Section titled “Integration with Existing Monitoring”Recording rules from Loki create metrics that integrate seamlessly with existing Prometheus-based monitoring:
- Grafana Dashboards: Use recorded metrics in dashboards alongside other Prometheus metrics
- Prometheus Alerts: Create traditional metric-based alerts using recorded metrics
- Alertmanager: All alerts (direct from Loki and Prometheus-based) route through the same Alertmanager instance in UDS Core
This approach provides a unified monitoring experience while leveraging the power of log-based analysis.