Troubleshooting Guide

This guide helps you diagnose and resolve common issues with KECS.

Diagnostic Tools

Health Checks

Check KECS component health:

bash

# API server health
curl http://localhost:8081/health

# Detailed health status
curl http://localhost:8081/health/detailed

# Kubernetes connectivity
kubectl cluster-info

Logs

View KECS logs:

bash

# If running directly
kecs server 2>&1 | tee kecs.log

# If running in Docker
docker logs kecs-container

# If running in Kubernetes
kubectl logs -n kecs-system deployment/kecs-control-plane

Debug Mode

Enable debug logging:

bash

# Via command line
kecs server --log-level debug

# Via environment variable
export KECS_LOG_LEVEL=debug
kecs server

Common Issues

Installation Issues

Problem: Build Fails

Symptoms:

go: cannot find main module

Solution:

bash

# Ensure you're in the correct directory
cd /path/to/kecs

# Clean and rebuild
make clean
make deps
make build

Problem: Missing Dependencies

Symptoms:

package github.com/... is not in GOROOT

Solution:

bash

# Update dependencies
go mod download
go mod tidy

# Verify Go version
go version  # Should be 1.21+

Startup Issues

Problem: Port Already in Use

Symptoms:

listen tcp :8080: bind: address already in use

Solution:

bash

# Find process using port
lsof -i :8080

# Kill process
kill -9 <PID>

# Or use different port
kecs server --api-port 9080

Problem: Cannot Connect to Kubernetes

Symptoms:

failed to get kubernetes config: stat /home/user/.kube/config: no such file or directory

Solution:

bash

# Check kubeconfig exists
ls ~/.kube/config

# Set kubeconfig explicitly
kecs server --kubeconfig /path/to/kubeconfig

# Or use in-cluster config
kubectl apply -f deploy/kubernetes/rbac.yaml

Cluster Operations

Problem: Cluster Creation Fails

Symptoms:

failed to create kind cluster: exit status 1

Solution:

bash

# Check Docker is running
docker ps

# Check Kind is installed
kind version

# Create cluster manually
kind create cluster --name kecs-cluster

# Verify cluster
kubectl get nodes

Problem: Cluster Already Exists

Symptoms:

cluster already exists

Solution:

bash

# List existing clusters
aws ecs list-clusters --endpoint-url http://localhost:8080

# Delete and recreate
aws ecs delete-cluster --cluster <name> --endpoint-url http://localhost:8080

Service Deployment Issues

Problem: Service Won't Start

Symptoms:

Service stuck in PENDING
No running tasks

Solution:

Check task definition:

bash

aws ecs describe-task-definition \
  --task-definition <family:revision> \
  --endpoint-url http://localhost:8080

Check cluster resources:

bash

kubectl top nodes
kubectl describe nodes

Review service events:

bash

aws ecs describe-services \
  --cluster <cluster> \
  --services <service> \
  --endpoint-url http://localhost:8080

Problem: Tasks Keep Stopping

Symptoms:

Tasks transition to STOPPED
Service can't maintain desired count

Solution:

Check task stop reason:

bash

aws ecs describe-tasks \
  --cluster <cluster> \
  --tasks <task-arn> \
  --endpoint-url http://localhost:8080 \
  | jq '.tasks[0].stoppedReason'

View container logs:

bash

kubectl logs -n <cluster-name> <pod-name>

Common causes:
- Image pull failures
- Health check failures
- Resource constraints
- Application errors

Task Issues

Problem: Image Pull Error

Symptoms:

CannotPullContainerError: Error response from daemon: pull access denied

Solution:

Verify image exists:
bash
```
docker pull <image-name>
```

Check image registry credentials:

bash

# For private registries
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<username> \
  --docker-password=<password> \
  -n <cluster-name>

Update task definition with credentials:

json

{
  "containerDefinitions": [{
    "repositoryCredentials": {
      "credentialsParameter": "arn:aws:secretsmanager:region:account:secret:name"
    }
  }]
}

Problem: Out of Memory

Symptoms:

OutOfMemoryError: Container killed due to memory limit

Solution:

Increase memory limits:

json

{
  "memory": "1024",
  "memoryReservation": "512"
}

Check memory usage:
bash
```
kubectl top pod -n <cluster-name>
```
Optimize application memory usage

Networking Issues

Problem: Service Discovery Not Working

Symptoms:

Services can't communicate
DNS resolution fails

Solution:

Check service registration:

bash

aws servicediscovery list-services \
  --endpoint-url http://localhost:4566

Test DNS resolution:

bash

kubectl exec -n <namespace> <pod> -- nslookup <service-name>

Verify network policies:

bash

kubectl get networkpolicies -n <namespace>

Problem: Load Balancer Not Working

Symptoms:

Can't access service externally
Health checks failing

Solution:

Check service type:
bash
```
kubectl get svc -n <namespace>
```

Verify target health:

bash

aws elbv2 describe-target-health \
  --target-group-arn <arn> \
  --endpoint-url http://localhost:4566

Check security groups and ports

LocalStack Integration Issues

Problem: LocalStack Connection Failed

Symptoms:

Could not connect to the endpoint URL: "http://localhost:4566/"

Solution:

Verify LocalStack is running:

bash

docker ps | grep localstack
curl http://localhost:4566/_localstack/health

Check KECS configuration:

yaml

localstack:
  enabled: true
  endpoint: http://localhost:4566

Restart both services:
bash
```
docker-compose restart
```

Problem: AWS SDK Not Using LocalStack

Symptoms:

Requests going to real AWS
Authentication errors

Solution:

Check sidecar injection:

bash

kubectl describe pod <pod> -n <namespace> | grep localstack-proxy

Set AWS endpoint explicitly:

python

boto3.client('s3', endpoint_url='http://localhost:4566')

Verify environment variables:

bash

kubectl exec <pod> -n <namespace> -- env | grep AWS

Performance Issues

Problem: Slow API Responses

Solution:

Check resource usage:

bash

# KECS server
top -p $(pgrep kecs)

# Database
ls -la ~/.kecs/data/kecs.db

Enable performance metrics:
bash
```
curl http://localhost:8081/metrics
```

Optimize database:

bash

# Vacuum database
sqlite3 ~/.kecs/data/kecs.db "VACUUM;"

Problem: High Memory Usage

Solution:

Check for memory leaks:

bash

go tool pprof http://localhost:8081/debug/pprof/heap

Limit concurrent operations:
yaml
```
server:
  maxConcurrentRequests: 100
```
Adjust cache settings:
yaml
```
cache:
  maxSize: 1000
  ttl: 5m
```

Advanced Debugging

Enable Verbose Logging

bash

# All components
export KECS_LOG_LEVEL=trace

# Specific components
export KECS_API_LOG_LEVEL=debug
export KECS_STORAGE_LOG_LEVEL=trace
export KECS_K8S_LOG_LEVEL=debug

Trace Requests

bash

# Enable request tracing
curl -H "X-Debug-Trace: true" \
  -X POST http://localhost:8080/v1/ListClusters \
  -H "Content-Type: application/x-amz-json-1.1" \
  -H "X-Amz-Target: AmazonEC2ContainerServiceV20141113.ListClusters" \
  -d '{}'

Database Inspection

bash

# Open database
sqlite3 ~/.kecs/data/kecs.db

# List tables
.tables

# Check clusters
SELECT * FROM clusters;

# Check services
SELECT * FROM services WHERE cluster_arn = 'arn:...';

Kubernetes Debugging

bash

# Get all resources in namespace
kubectl get all -n <cluster-name>

# Describe problematic pod
kubectl describe pod <pod-name> -n <cluster-name>

# Get pod events
kubectl get events -n <cluster-name> --sort-by='.lastTimestamp'

# Debug container
kubectl debug -it <pod-name> -n <cluster-name> --image=busybox

Getting Help

Collect Diagnostic Information

Run the diagnostic script:

bash

./scripts/collect-diagnostics.sh

This collects:

KECS logs
Configuration files
Kubernetes cluster state
System information

Report Issues

When reporting issues, include:

Environment Details
- KECS version: kecs version
- OS: uname -a
- Kubernetes version: kubectl version
- Docker version: docker version
Steps to Reproduce
- Exact commands run
- Configuration files used
- Expected vs actual behavior
Logs and Errors
- KECS server logs
- Relevant Kubernetes events
- Error messages
Diagnostic Bundle
- Output from diagnostic script

Community Support

GitHub Issues: github.com/nandemo-ya/kecs/issues

Prevention Tips

Regular Maintenance

Update Regularly
bash
```
git pull origin main
make build
```
Monitor Resources
- Set up alerts for disk space
- Monitor memory usage
- Track API response times

Backup Data

bash

# Backup database
cp ~/.kecs/data/kecs.db ~/.kecs/data/kecs.db.backup

Clean Up Resources

bash

# Remove stopped tasks
kubectl delete pods -n <namespace> --field-selector=status.phase=Succeeded

# Prune unused images
docker image prune -a

Best Practices

Use Resource Limits
- Set appropriate CPU/memory limits
- Monitor actual usage
- Leave headroom for spikes
Enable Health Checks
- Configure liveness probes
- Set readiness probes
- Monitor health metrics
Plan for Failures
- Test failure scenarios
- Document recovery procedures
- Keep backups current
Stay Informed
- Read release notes
- Follow security advisories
- Join community discussions

Troubleshooting Guide ​

Diagnostic Tools ​

Health Checks ​

Logs ​

Debug Mode ​

Common Issues ​

Installation Issues ​

Problem: Build Fails ​

Problem: Missing Dependencies ​

Startup Issues ​

Problem: Port Already in Use ​

Problem: Cannot Connect to Kubernetes ​

Cluster Operations ​

Problem: Cluster Creation Fails ​

Problem: Cluster Already Exists ​

Service Deployment Issues ​

Problem: Service Won't Start ​

Problem: Tasks Keep Stopping ​

Task Issues ​

Problem: Image Pull Error ​

Problem: Out of Memory ​

Networking Issues ​

Problem: Service Discovery Not Working ​

Problem: Load Balancer Not Working ​

LocalStack Integration Issues ​

Problem: LocalStack Connection Failed ​

Problem: AWS SDK Not Using LocalStack ​

Performance Issues ​

Problem: Slow API Responses ​

Problem: High Memory Usage ​

Advanced Debugging ​

Enable Verbose Logging ​

Trace Requests ​

Database Inspection ​

Kubernetes Debugging ​

Getting Help ​

Collect Diagnostic Information ​

Report Issues ​

Community Support ​

Prevention Tips ​

Regular Maintenance ​

Best Practices ​

Troubleshooting Guide

Diagnostic Tools

Health Checks

Logs

Debug Mode

Common Issues

Installation Issues

Problem: Build Fails

Problem: Missing Dependencies

Startup Issues

Problem: Port Already in Use

Problem: Cannot Connect to Kubernetes

Cluster Operations

Problem: Cluster Creation Fails

Problem: Cluster Already Exists

Service Deployment Issues

Problem: Service Won't Start

Problem: Tasks Keep Stopping

Task Issues

Problem: Image Pull Error

Problem: Out of Memory

Networking Issues

Problem: Service Discovery Not Working

Problem: Load Balancer Not Working

LocalStack Integration Issues

Problem: LocalStack Connection Failed

Problem: AWS SDK Not Using LocalStack

Performance Issues

Problem: Slow API Responses

Problem: High Memory Usage

Advanced Debugging

Enable Verbose Logging

Trace Requests

Database Inspection

Kubernetes Debugging

Getting Help

Collect Diagnostic Information

Report Issues

Community Support

Prevention Tips

Regular Maintenance

Best Practices