πŸ’€ KubeNidra
πŸ€– KubeNidra Agent

Troubleshooting

This guide helps you diagnose and resolve common issues with the KubeNidra Agent.

Quick Diagnostics

Check Agent Status

# Check if agent is running
kubectl get pods -n kubenidra -l kubenidra/component=agent

# Check agent logs
kubectl logs -n kubenidra -l kubenidra/component=agent

# Check agent health
kubectl exec -n kubenidra deployment/kubenidra-agent -- curl -s http://localhost:8118/healthz

Check Configuration

# View agent configuration
kubectl get configmap -n kubenidra kubenidra-agent-cm -o yaml

# Check environment variables if you are not using the yaml config option
kubectl exec -n kubenidra deployment/kubenidra-agent -- env | grep KUBENIDRA

Common Issues

1. Agent Not Starting

Symptoms:

  • Agent pod in CrashLoopBackOff or Error state
  • No logs from agent

Diagnosis:

# Check pod status
kubectl describe pod -n kubenidra -l kubenidra/component=agent

# Check logs
kubectl logs -n kubenidra -l kubenidra/component=agent --previous

# Check events
kubectl get events -n kubenidra --sort-by='.lastTimestamp'

Common Causes:

  1. Configuration Error
# Check config file syntax
kubectl get configmap -n kubenidra kubenidra-agent-cm -o jsonpath='{.data.kubenidra\.yaml}' | yq eval
  1. Missing RBAC Permissions
# Check if service account exists
kubectl get serviceaccount -n kubenidra kubenidra-agent-sa

# Check cluster role binding
kubectl get clusterrolebinding kubenidra-agent-crb
  1. Resource Limits
# Check resource usage
kubectl top pod -n kubenidra -l kubenidra/component=agent

2. Prometheus Connection Issues

Symptoms:

  • Agent logs show "Failed to create Prometheus client"
  • Workloads not being snoozed due to metrics validation

Diagnosis:

# Check Prometheus connectivity from agent
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
  curl -s "http://prometheus:9090/api/v1/query?query=up"

# Check Prometheus service
kubectl get svc -n monitoring prometheus

# Check network connectivity
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
  nslookup prometheus.monitoring.svc.cluster.local

Solutions:

  1. Fix Prometheus Endpoint
# Update configuration
prometheus:
  endpoint: "http://prometheus.monitoring.svc.cluster.local:9090"
  timeout: "30s"
  1. Check Prometheus Service
# Verify Prometheus is running
kubectl get pods -n monitoring -l app=prometheus

# Check service endpoints
kubectl get endpoints -n monitoring prometheus
  1. Network Policy Issues
# Allow agent to access Prometheus
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-kubenidra-to-prometheus
  namespace: monitoring
spec:
  podSelector:
    matchLabels:
      app: prometheus
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: kubenidra
      ports:
        - protocol: TCP
          port: 9090

3. Workloads Not Being Snoozed

Symptoms:

  • Workloads remain active despite being idle
  • No snooze operations in logs

Diagnosis:

# Check if KubeNidra is enabled
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/enabled}{"\n"}{end}'

# Check workload annotations
kubectl get deployment MY_APP_DEPLOYMENT -n development -o yaml | yq eval '.metadata.annotations'

# Check agent logs for specific workload
kubectl logs -n kubenidra -l kubenidra/component=agent | grep "my-app"

Common Causes:

  1. KubeNidra Not Enabled
# Enable KubeNidra on workload
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: development
  annotations:
    kubenidra/enabled: "true"
  1. Namespace Not Watched
# Check watched namespaces
kubectl get configmap -n kubenidra kubenidra-agent-cm -o jsonpath='{.data.kubenidra\.yaml}'
  1. Manual Override Active
# Check for manual overrides
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/manual-override}{"\n"}{end}'

# Check for manual overrides with duration
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotatio
ns.kubenidra/manual-override-until}{"\n"}{end}'
  1. Metrics Validation Failing
# Check Prometheus data
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
  curl -s "http://prometheus:9090/api/v1/query?query=container_cpu_usage_seconds_total{namespace=\"development\"}"

Solutions:

  1. Enable KubeNidra
# Enable on existing deployment
kubectl patch deployment my-app -n development -p '{"metadata":{"annotations":{"kubenidra/enabled":"true"}}}'
  1. Add Namespace to Watch List
# Update configuration
watched_namespaces:
  - "development"
  - "staging"
  1. Remove Manual Override
# Remove manual override
kubectl patch deployment my-app -n development -p '{"metadata":{"annotations":{"kubenidra/manual-override":null}}}'
  1. Adjust Metrics Validation
# Reduce validation requirements
snooze:
  prometheus_validation_enabled: false
  # Or adjust thresholds
  minimum_data_coverage: 0.5
  minimum_data_points: 3

4. Workloads Not Waking Up

Symptoms:

  • Snoozed workloads don't wake up on schedule
  • Manual wake requests ignored

Diagnosis:

# Check wake schedule
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/wake-schedule}{"\n"}{end}'

# Check if workload is snoozed
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.spec.replicas}{"\n"}{end}'

# Check agent logs for wake operations
kubectl logs -n kubenidra -l kubenidra/component=agent | grep "wake"

Solutions:

  1. Set Wake Schedule
# Set wake schedule
kubectl patch deployment my-app -n development \
  -p '{"metadata":{"annotations":{"kubenidra/wake-schedule":"09:00-17:00,mon-fri"}}}'
  1. Manual Wake
# Trigger manual wake
kubectl patch deployment my-app -n development \
  -p '{"metadata":{"annotations":{"kubenidra/wake-now":"true"}}}'
  1. Check Timezone
# Check agent timezone
kubectl exec -n kubenidra deployment/kubenidra-agent -- date

5. Too Many Operations

Symptoms:

  • Agent logs show "backoff" messages
  • Operations stopped due to rate limiting

Diagnosis:

# Check operation history
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/operation-history}{"\n"}{end}'

# Check backoff status
kubectl get deployment -A -o jsonpath='{range .items[*]}{.metadata.namespace}{"\t"}{.metadata.name}{"\t"}{.metadata.annotations.kubenidra/backoff-until}{"\n"}{end}'

# Check agent metrics
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
  curl -s http://localhost:1881/metrics | grep kubenidra_operations

Solutions:

  1. Increase Rate Limits
# Update configuration
snooze:
  max_operations_per_hour: 20
  operation_cooldown: "2m"
  1. Clear Backoff
# Remove backoff annotation
kubectl patch deployment my-app -n development \
  -p '{"metadata":{"annotations":{"kubenidra/backoff-until":null}}}'
  1. Adjust Behavior Mode
# Use conservative mode
snooze:
  behavior_mode: "conservative"
  idle_duration: "30m"

Performance Issues

High Resource Usage

Symptoms:

  • Agent pod using high CPU/memory
  • Slow response times

Diagnosis:

# Check resource usage
kubectl top pod -n kubenidra -l kubenidra/component=agent

Solutions:

  1. Limit Namespaces
# Watch only specific namespaces
watched_namespaces:
  - "development"
  - "staging"
  1. Increase Check Intervals
# Reduce check frequency
snooze:
  check_interval: "10m"
  wake_check_interval: "5m"
  1. Adjust Resource Limits
# Increase resource limits
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Recovery Procedures

Agent Restart

# Restart agent
kubectl rollout restart deployment/kubenidra-agent -n kubenidra

# Check status
kubectl rollout status deployment/kubenidra-agent -n kubenidra

# Verify health
kubectl exec -n kubenidra deployment/kubenidra-agent -- \
  curl -s http://localhost:1881/healthz

Configuration Update

# Update configuration
kubectl patch configmap kubenidra-agent-cm -n kubenidra --patch-file new-config.yaml

# Restart to apply changes
kubectl rollout restart deployment/kubenidra-agent -n kubenidra

Emergency Stop

# Scale agent to 0
kubectl scale deployment kubenidra-agent -n kubenidra --replicas=0