Skip to main content
Solutions for common issues encountered when deploying to AWS.

ECS Task Issues

Cause: Container not responding to health checksVerify the /health endpoint works:
curl http://localhost:8000/health
Should return: {"status": "ok", "instantiated_at": "..."}If this fails, check CloudWatch logs for startup errors:
aws logs tail /ecs/{infra_name}-prd --follow
Cause: Container starts but fails health checksCheck the logs for the startup sequence:
aws logs tail /ecs/{infra_name}-prd --since 10m
Look for:
  • Application startup complete - Container started
  • SIGTERM - Health check failed, container being killed
Common causes:
  • Database connection failing (check DB_HOST, DB_PASS)
  • Missing environment variables
  • App crashes after startup
Cause: Multiple uvicorn workers with DuckDBDuckDB requires single-writer access. Ensure your command uses one worker:
command="uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 1",
Do NOT increase --workers if using Pal agent.
Cause: No EFS configuredPal stores data in DuckDB at /data/pal.db. Without EFS, this is lost on container restart.See: EFS Setup Guide
Cause: IAM permissions or secret doesn’t existVerify secrets exist:
aws secretsmanager list-secrets \
  --query "SecretList[?contains(Name, '{infra_name}-prd')].[Name]" \
  --output table
If missing, redeploy with ag infra up prd:aws to create them from your YAML files.

Docker & ECR Issues

Cause: Docker not authenticated to ECRRun the authentication script:
./scripts/auth_ecr.sh
Or manually:
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  [ACCOUNT_ID].dkr.ecr.us-east-1.amazonaws.com
ECR tokens expire after 12 hours. Re-run if you get this error after a break.
Large images can timeout on slow connections. Try:
  1. Build with -f flag to ensure fresh layers
  2. Check your network connection
  3. Consider using GitHub Actions for CI/CD builds

Database Issues

Cause: Special characters in passwordAvoid @, #, %, & in DB_PASS. These require URL encoding and cause silent connection failures.Safe characters: alphanumeric, !, -, _
Check security group allows ECS to access RDS:
aws ec2 describe-security-groups \
  --filters "Name=group-name,Values=*-db-sg" \
  --query 'SecurityGroups[0].IpPermissions'
The database security group must allow inbound port 5432 from the ECS security group.
RDS must be in a public subnet with publicly_accessible=True (the default).Add your IP to the security group or use a bastion host.

EFS Issues

Ensure mount targets exist in the same subnets as your ECS tasks:
aws efs describe-mount-targets --file-system-id fs-xxx
Each subnet in aws_subnet_ids needs its own mount target.
Check that your access point uses UID/GID 61000 to match the container user:
aws efs describe-access-points --access-point-id fsap-xxx
The POSIX user should be Uid: 61000, Gid: 61000.

Debugging Commands

# View ECS service events (replace {infra_name} with your infra_name)
aws ecs describe-services \
  --cluster {infra_name}-prd \
  --services {infra_name}-prd-service \
  --query 'services[0].events[:5]'

# View recent logs
aws logs tail /ecs/{infra_name}-prd --follow

# Check task status
aws ecs list-tasks --cluster {infra_name}-prd
aws ecs describe-tasks --cluster {infra_name}-prd --tasks [TASK_ARN]

SSH Access

Local Development

docker exec -it ai-api zsh

Production (ECS)

ECS_CLUSTER={infra_name}-prd
TASK_ARN=$(aws ecs list-tasks --cluster $ECS_CLUSTER --query "taskArns[0]" --output text)

aws ecs execute-command \
    --cluster $ECS_CLUSTER \
    --task $TASK_ARN \
    --container {infra_name}-prd \
    --interactive \
    --command "zsh"