π€ KubeNidra Agent
Troubleshooting
This guide helps you diagnose and resolve common issues with the KubeNidra Agent.
Quick Diagnostics
Check Agent Status
# Check if agent is running
kubectl get pods -n kubenidra -l kubenidra/component=agent
# Check agent logs
kubectl logs -n kubenidra -l kubenidra/component=agent
# Check agent health
kubectl exec -n kubenidra deployment/kubenidra-agent -- curl -s http://localhost:8118/healthzCheck Configuration
# View agent configuration
kubectl get configmap -n kubenidra kubenidra-agent-cm -o yaml
# Check environment variables if you are not using the yaml config option
kubectl exec -n kubenidra deployment/kubenidra-agent -- env | grep KUBENIDRACommon Issues
1. Agent Not Starting
Symptoms:
- Agent pod in
CrashLoopBackOfforErrorstate - No logs from agent
Diagnosis:
# Check pod status
kubectl describe pod -n kubenidra -l kubenidra/component=agent
# Check logs
kubectl logs -n kubenidra -l kubenidra/component=agent --previous
# Check events
kubectl get events -n kubenidra --sort-by='.lastTimestamp'Common Causes:
- Configuration Error
# Check config file syntax
kubectl get configmap -n kubenidra kubenidra-agent-cm -o jsonpath='{.data.kubenidra\.yaml}' | yq eval- Missing RBAC Permissions
# Check if service account exists
kubectl get serviceaccount -n kubenidra kubenidra-agent-sa
# Check cluster role binding
kubectl get clusterrolebinding kubenidra-agent-crb- Resource Limits
# Check resource usage
kubectl top pod -n kubenidra -l kubenidra/component=agent2. Prometheus Connection Issues
Symptoms:
- Agent logs show "Failed to create Prometheus client"
- Workloads not being snoozed due to metrics validation
Diagnosis:
# Check Prometheus connectivity from agent
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
curl -s "http://prometheus:9090/api/v1/query?query=up"
# Check Prometheus service
kubectl get svc -n monitoring prometheus
# Check network connectivity
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
nslookup prometheus.monitoring.svc.cluster.localSolutions:
- Fix Prometheus Endpoint
# Update configuration
prometheus:
endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
timeout: "30s"- Check Prometheus Service
# Verify Prometheus is running
kubectl get pods -n monitoring -l app=prometheus
# Check service endpoints
kubectl get endpoints -n monitoring prometheus- Network Policy Issues
# Allow agent to access Prometheus
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-kubenidra-to-prometheus
namespace: monitoring
spec:
podSelector:
matchLabels:
app: prometheus
ingress:
- from:
- namespaceSelector:
matchLabels:
name: kubenidra
ports:
- protocol: TCP
port: 90903. Workloads Not Being Snoozed
Symptoms:
- Workloads remain active despite being idle
- No snooze operations in logs
Diagnosis:
# Check if KubeNidra is enabled
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/enabled}{"\n"}{end}'
# Check workload annotations
kubectl get deployment MY_APP_DEPLOYMENT -n development -o yaml | yq eval '.metadata.annotations'
# Check agent logs for specific workload
kubectl logs -n kubenidra -l kubenidra/component=agent | grep "my-app"Common Causes:
- KubeNidra Not Enabled
# Enable KubeNidra on workload
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
namespace: development
annotations:
kubenidra/enabled: "true"- Namespace Not Watched
# Check watched namespaces
kubectl get configmap -n kubenidra kubenidra-agent-cm -o jsonpath='{.data.kubenidra\.yaml}'- Manual Override Active
# Check for manual overrides
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/manual-override}{"\n"}{end}'
# Check for manual overrides with duration
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotatio
ns.kubenidra/manual-override-until}{"\n"}{end}'- Metrics Validation Failing
# Check Prometheus data
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
curl -s "http://prometheus:9090/api/v1/query?query=container_cpu_usage_seconds_total{namespace=\"development\"}"Solutions:
- Enable KubeNidra
# Enable on existing deployment
kubectl patch deployment my-app -n development -p '{"metadata":{"annotations":{"kubenidra/enabled":"true"}}}'- Add Namespace to Watch List
# Update configuration
watched_namespaces:
- "development"
- "staging"- Remove Manual Override
# Remove manual override
kubectl patch deployment my-app -n development -p '{"metadata":{"annotations":{"kubenidra/manual-override":null}}}'- Adjust Metrics Validation
# Reduce validation requirements
snooze:
prometheus_validation_enabled: false
# Or adjust thresholds
minimum_data_coverage: 0.5
minimum_data_points: 34. Workloads Not Waking Up
Symptoms:
- Snoozed workloads don't wake up on schedule
- Manual wake requests ignored
Diagnosis:
# Check wake schedule
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/wake-schedule}{"\n"}{end}'
# Check if workload is snoozed
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.replicas}{"\n"}{end}'
# Check agent logs for wake operations
kubectl logs -n kubenidra -l kubenidra/component=agent | grep "wake"Solutions:
- Set Wake Schedule
# Set wake schedule
kubectl patch deployment my-app -n development \
-p '{"metadata":{"annotations":{"kubenidra/wake-schedule":"09:00-17:00,mon-fri"}}}'- Manual Wake
# Trigger manual wake
kubectl patch deployment my-app -n development \
-p '{"metadata":{"annotations":{"kubenidra/wake-now":"true"}}}'- Check Timezone
# Check agent timezone
kubectl exec -n kubenidra deployment/kubenidra-agent -- date5. Too Many Operations
Symptoms:
- Agent logs show "backoff" messages
- Operations stopped due to rate limiting
Diagnosis:
# Check operation history
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/operation-history}{"\n"}{end}'
# Check backoff status
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/backoff-until}{"\n"}{end}'
# Check agent metrics
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
curl -s http://localhost:1881/metrics | grep kubenidra_operationsSolutions:
- Increase Rate Limits
# Update configuration
snooze:
max_operations_per_hour: 20
operation_cooldown: "2m"- Clear Backoff
# Remove backoff annotation
kubectl patch deployment my-app -n development \
-p '{"metadata":{"annotations":{"kubenidra/backoff-until":null}}}'- Adjust Behavior Mode
# Use conservative mode
snooze:
behavior_mode: "conservative"
idle_duration: "30m"Performance Issues
High Resource Usage
Symptoms:
- Agent pod using high CPU/memory
- Slow response times
Diagnosis:
# Check resource usage
kubectl top pod -n kubenidra -l kubenidra/component=agentSolutions:
- Limit Namespaces
# Watch only specific namespaces
watched_namespaces:
- "development"
- "staging"- Increase Check Intervals
# Reduce check frequency
snooze:
check_interval: "10m"
wake_check_interval: "5m"- Adjust Resource Limits
# Increase resource limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512MiRecovery Procedures
Agent Restart
# Restart agent
kubectl rollout restart deployment/kubenidra-agent -n kubenidra
# Check status
kubectl rollout status deployment/kubenidra-agent -n kubenidra
# Verify health
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
curl -s http://localhost:1881/healthzConfiguration Update
# Update configuration
kubectl patch configmap kubenidra-agent-cm -n kubenidra --patch-file new-config.yaml
# Restart to apply changes
kubectl rollout restart deployment/kubenidra-agent -n kubenidraEmergency Stop
# Scale agent to 0
kubectl scale deployment kubenidra-agent -n kubenidra --replicas=0