Monitoring

CloudWatch Logs captures output from your ECS containers.

View Logs

Tail logs in real-time:

aws logs tail /ecs/{infra_name}-prd --follow

Search recent logs:

aws logs filter-log-events \
  --log-group-name /ecs/{infra_name}-prd \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s)000

Replace {infra_name} with your infra_name from settings.py (e.g., agentos-aws-template).

ECS Service Status

View service status and recent events:

aws ecs describe-services \
  --cluster {infra_name}-prd \
  --services {infra_name}-prd-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'

List running tasks:

aws ecs list-tasks --cluster {infra_name}-prd

What Success Looks Like

After a successful deployment, logs show:

INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Health check passing:

INFO:     192.168.x.x - "GET /health HTTP/1.1" 200 OK

Warning Signs

Log Pattern	Meaning	Action
`database is locked`	DuckDB concurrency issue	Reduce workers to 1
`connection refused`	Can’t reach RDS	Check security group
`OOMKilled`	Out of memory	Increase task memory
`CannotPullContainerError`	ECR auth expired	Re-run `auth_ecr.sh`
`SIGTERM` then restart loop	Health check failing	Check app logs for errors

Health Checks

The load balancer checks /health every 30 seconds.

Target Status	Meaning
healthy	Task passing health checks
unhealthy	Health check failing
draining	Task being replaced

If unhealthy, check:

Container logs for startup errors
Security group allows port 8000 from ALB
Database connectivity (DB_HOST, DB_PASS)

Log Retention

CloudWatch retains logs indefinitely by default. Set a retention policy to control costs:

aws logs put-retention-policy \
  --log-group-name /ecs/{infra_name}-prd \
  --retention-in-days 30

Retention	Monthly Cost (10GB/day)
7 days	~$3
30 days	~$15
90 days	~$45

Alerts (Optional)

Create a CloudWatch alarm for task failures:

aws cloudwatch put-metric-alarm \
  --alarm-name "{infra_name}-task-failures" \
  --metric-name "FailedTasks" \
  --namespace "AWS/ECS" \
  --statistic Sum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=ClusterName,Value={infra_name}-prd \
  --evaluation-periods 1 \
  --alarm-actions [YOUR_SNS_TOPIC_ARN]

See AWS SNS documentation to create a notification topic.

Production

Templates

Applications

Interfaces

View Logs

ECS Service Status

What Success Looks Like

Warning Signs

Health Checks

Log Retention

Alerts (Optional)

Production

Templates

Applications

Interfaces

​View Logs

​ECS Service Status

​What Success Looks Like

​Warning Signs

​Health Checks

​Log Retention

​Alerts (Optional)

View Logs

ECS Service Status

What Success Looks Like

Warning Signs

Health Checks

Log Retention

Alerts (Optional)