Deployment Hell: How My AI CTO Fixed a Broken ctrlman.dev Deployment

Ctrl Man
DevOps , AI Agents , Postmortem , Automation
28 Mar, 2026

The 2 AM Crisis

It was late evening when everything went dark. Both ctrlman.dev and app.ctrlman.dev were returning 502 Bad Gateway errors. The VPS was accessible, but nothing was working. SSH was failing, the disk was full, MongoDB was missing, and nginx was misconfigured.

In the past, I would have spent hours debugging alone, jumping between terminals, Googling error messages, and slowly piecing together what went wrong.

But this time, I did something different. I opened Telegram and sent a message to my AI CTO.

Enter Hermes: My AI CTO

Hermes Agent is an autonomous AI assistant that lives on my machines. Think of it as having a 24/7 CTO who:

Never sleeps
Can execute commands directly
Remembers every deployment war story
Coordinates multiple AI models for complex tasks
Reports progress via Telegram

Our Workflow:

CEO (Mario)                    AI CTO (Hermes)
    │                               │
    ├─ "Server is down, fix it" ───►│
    │                               ├─ Diagnoses via SSH
    │                               ├─ Identifies 6 failures
    │                               ├─ Executes fixes
    │                               └─ Reports via Telegram
    │◄─ "Fixed. Here's what happened" ─┤

Communication: Telegram (voice messages, screenshots, logs) Execution: Direct shell access with my SSH keys Models: Qwen 3.5 Plus (reasoning) + Qwen 4B (execution on RTX 3060)

The Cascade of Failures

Hermes connected to the Hetzner VPS and immediately identified six cascading failures. Here’s how we fixed them, together:

1. The SSH Connection Block

Symptom: SSH commands were asking for a passphrase, even though I’d removed it.

Hermes Diagnosed: The SSH agent had an old cached key with a passphrase.

The Fix:

ssh-keygen -p -f ~/.ssh/id_ed25519 -N ""
eval $(ssh-agent -s) && ssh-add ~/.ssh/id_ed25519

Telegram Update: “SSH agent had stale key. Cleared cache and reloaded. Now connected.”

Lesson: Always verify the SSH agent has the correct key loaded before deployment scripts run.

2. Disk Space Exhaustion (100% Full)

Symptom: apt-get install failed with “no free space in /var/cache/apt/archives/”

Hermes Diagnosed: Old project directories were consuming 11GB:

/home/webadmin/astroplate-combo2 (11GB)
/home/webadmin/proto-parsec (404MB)

The Fix:

df -h /
du -sh /home/webadmin/* | sort -rh | head -10
rm -rf /home/webadmin/astroplate-combo2
rm -rf /home/webadmin/proto-parsec

Result: Freed 11GB, disk usage dropped from 100% to 69%.

Telegram Update: “Disk was full. Old deployments consumed 11GB. Cleaned up. Now at 69%.”

Lesson: Add disk space checks to deployment pre-flight. Automate cleanup of old deployments.

3. The Missing Database

Symptom: App crashed immediately with MongoDB connection error: connect ECONNREFUSED ::1:27017

Hermes Diagnosed: MongoDB was never installed during initial server setup. The app kept restarting because it couldn’t connect.

The Fix:

curl -fsSL https://www.mongodb.org/static/pgp/server-7.0.asc | gpg --dearmor -o /usr/share/keyrings/mongodb-server-7.0.gpg
echo 'deb [ signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] http://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse' > /etc/apt/sources.list.d/mongodb-org-7.0.list
apt-get update && apt-get install -y mongodb-org

systemctl start mongod
systemctl enable mongod

Telegram Update: “MongoDB was never installed. Installing now. Service started and enabled.”

Lesson: Database installation should be part of server bootstrap, not app deployment. Add health checks.

4. Port Mismatch & Configuration Nightmares

Symptom: app.ctrlman.dev still wouldn’t respond.

Hermes Diagnosed: Three different ports in use:

.env file: PORT=8081
Nginx upstream: localhost:4321
PM2: Running on who-knows-what

The Fix:

sed -i 's/PORT=8081/PORT=4321/' /home/webadmin/proto-parsec-v2/.env
cd /home/webadmin/proto-parsec-v2
pm2 restart proto-parsec-app-v2 --update-env

Telegram Update: “Port mismatch found. .env had 8081, nginx expected 4321. Fixed and restarted.”

Lesson: Standardize PORT across .env, nginx, and PM2. Add port verification to deploy scripts.

5. Wrong Entry Point & PM2 Persistence

Symptom: PM2 processes wouldn’t survive reboots.

Hermes Diagnosed: Two issues:

PM2 was started with dist/server/entry.mjs (wrong - that’s Astro build output)
PM2 wasn’t linked to system startup

The Fix:

# Correct entry point for Astro SSR
pm2 delete astroplate-landing-v2
cd /home/webadmin/astroplate-combo2-v2
pm2 start server.mjs --name astroplate-landing-v2

# Enable persistence
pm2 startup
pm2 save

Telegram Update: “PM2 was using wrong entry point. Fixed to server.mjs. Enabled startup persistence.”

Lesson: Use server.mjs for Astro SSR, not build output. Run pm2 startup and pm2 save.

6. Nginx Configuration Gaps

Symptom: Still getting 502 on app.ctrlman.dev.

Hermes Diagnosed: Nginx config was correct, but needed a reload after all the changes.

The Fix:

nginx -t
systemctl reload nginx

Telegram Update: “Nginx config valid. Reloaded. All services healthy. Deployment complete.”

Lesson: Always test nginx config and reload after changes.

The Final Checklist

After this war story, Hermes and I created an automated pre-flight check:

#!/bin/bash
# Deployment Pre-Flight (auto-run by Hermes)
df -h / | awk 'NR==2 {if ($5+0 > 90) exit 1}'
systemctl is-active --quiet mongod || exit 1
curl -s http://localhost:4321/health || exit 1
nginx -t || exit 1

Key Takeaways

#	Issue	Solution	Automated?
1	SSH agent cache	Clear and reload key	✅ Yes
2	Full disk (100%)	Clean old deployments	✅ Yes (cron)
3	Missing MongoDB	Install during bootstrap	✅ Yes
4	Port mismatch	Standardize to 4321	✅ Yes
5	PM2 wrong entry	Use server.mjs	✅ Yes
6	Nginx not reloaded	Test + reload	✅ Yes

The New Workflow: CEO + AI CTO

This deployment disaster became the catalyst for a completely new way of working:

Before (Solo Founder Struggle)

Problem → Google → Trial & Error → 4 hours later → Maybe fixed

After (CEO + AI CTO Partnership)

Problem → Telegram message → AI CTO diagnoses → Executes fixes → Reports back → 20 minutes → Done

Hermes Now Handles:

✅ Deployment automation (no more manual SSH)
✅ Health monitoring (disk, services, ports)
✅ Article writing (session files → blog posts via Qwen 4B on RTX 3060)
✅ Multi-agent coordination (Qwen Plus for reasoning, Qwen 4B for execution)
✅ Cron-based publishing (2 articles/day at 09:00 & 18:00)

The Bigger Vision

This isn’t just about fixing deployments. It’s about building a scalable AI-first company:

CEO (Me): Strategy, vision, product decisions, user relationships AI CTO (Hermes): Execution, automation, monitoring, documentation, content

Communication: Telegram (async, voice-friendly, mobile-first) Infrastructure: Multi-machine (local + RTX 3060 remote for heavy tasks) Content Pipeline: Session files → Qwen 4B → Blog articles → Published automatically

Result: I can focus on building the product while Hermes handles the operational complexity.

What’s Next

We’re now building:

Automated health checks - Hermes monitors all services and alerts via Telegram
Self-healing deployments - Auto-rollback on failure detection
Content automation - 2 blog posts/day generated from session files
Multi-agent workflows - Qwen for reasoning, Qwen 4B for execution, Kimi for review

The deployment gods weren’t on our side that night. But having an AI CTO who never sleeps? That’s better than luck. 🛠️

Have you tried working with AI agents for DevOps? I’d love to hear your experience. Find me on Telegram @ctrlman_dev or drop a comment below.

About the Author: Mario is the CEO of ctrlman.dev, building productivity tools and AI agent workflows. He coordinates daily with his AI CTO Hermes via Telegram to ship features, fix deployments, and publish content - all while focusing on product vision and user needs.

Comments

Google GitHub

Loading comments...

Deployment Hell: How My AI CTO Fixed a Broken ctrlman.dev Deployment

The 2 AM Crisis

Enter Hermes: My AI CTO

The Cascade of Failures

1. The SSH Connection Block

2. Disk Space Exhaustion (100% Full)

3. The Missing Database

4. Port Mismatch & Configuration Nightmares

5. Wrong Entry Point & PM2 Persistence

6. Nginx Configuration Gaps

The Final Checklist

Key Takeaways

The New Workflow: CEO + AI CTO

Before (Solo Founder Struggle)

After (CEO + AI CTO Partnership)

The Bigger Vision

What’s Next

Tags:

Comments

Related Posts

Automated Error Monitoring for Your NGINX Service with Telegram Alerts

Mastering MySQL: Setting Up Your Database for Success

Related Posts

Automated Error Monitoring for Your NGINX Service with Telegram Alerts

Mastering MySQL: Setting Up Your Database for Success

Deployment Hell: How My AI CTO Fixed a Broken ctrlman.dev Deployment

The 2 AM Crisis

Enter Hermes: My AI CTO

The Cascade of Failures

1. The SSH Connection Block

2. Disk Space Exhaustion (100% Full)

3. The Missing Database

4. Port Mismatch & Configuration Nightmares

5. Wrong Entry Point & PM2 Persistence

6. Nginx Configuration Gaps

The Final Checklist

Key Takeaways

The New Workflow: CEO + AI CTO

Before (Solo Founder Struggle)

After (CEO + AI CTO Partnership)

The Bigger Vision

What’s Next

Tags:

Share:

Comments

Related Posts

Automated Error Monitoring for Your NGINX Service with Telegram Alerts

Mastering MySQL: Setting Up Your Database for Success

Related Posts

Automated Error Monitoring for Your NGINX Service with Telegram Alerts

Mastering MySQL: Setting Up Your Database for Success