Skip to main content

Command Palette

Search for a command to run...

PoC: Building a "24/7 SRE" Teammate with LangGraph, AWS Bedrock, and Slack

Updated
5 min read
PoC: Building a "24/7 SRE" Teammate with LangGraph, AWS Bedrock, and Slack

It’s 3:14 AM. Your phone buzzes on the nightstand.

We all know the drill. You drag yourself out of bed, squinting at the screen, trying to authenticate into the VPN while your brain is still booting up. You have to SSH into a bastion host, run top, grep through logs, and check network graphs, all just to figure out if this is a real fire or just a noisy neighbor.

I wanted to change that workflow. The "15-minute tax" of purely getting ready to solve a problem, logging in, context switching, and finding the right dashboard, is often more painful than the fix itself.

So, I built a Proof of Concept (PoC) using Python, LangGraph, and AWS Bedrock. It’s a chatbot that lives entirely in Slack threads, handles the investigation for you, and turns a 20-minute debugging session into a 30-second decision.

The Idea: A Teammate, Not Just a Tool

The goal wasn't to build another dashboard or a fancy CLI tool. I wanted a "Junior SRE" teammate. Someone (or something) reliable, tireless, and capable of executing standard operating procedures without deviating from the script.

We used Claude Sonnet 4.5 via AWS Bedrock for the reasoning engine. Why this model? We needed an LLM with high-functioning reasoning capabilities to handle "intent detection." It needs to look at a messy request like "Why is the payment service acting weird?" and figure out that "acting weird" actually means "check latency metrics and error logs," without hallucinating a command that doesn't exist.

But here’s the kicker: everything happens in a Slack Thread.

This isn't a "fire and forget" script. It's a persistent, state-aware conversation. You can go back and forth with the bot to build a complete picture of the incident before you decide to do anything.

Under the Hood: The Architecture

The architecture relies on three distinct components working in harmony:

  1. The Interface (Slack): This serves as the "Ears & Mouth." We use Slack not just for chat, but for state management. Every incident lives in its own thread, keeping the main channel clean.

  2. The Orchestrator (LangGraph): This is the secret sauce. A simple script executes linearly, but LangGraph allows us to build a cyclic workflow: Listen -> Reason -> Propose -> Wait. It maintains the memory of the conversation, so if I say "check the other server," it knows which "server" I was talking about previously.

  3. The Hands (AWS Systems Manager): The bot never executes code directly. It triggers AWS SSM Runbooks. This ensures the bot is "Security First," executing only audited, pre-approved tasks.

How It Works: The Conversation Loop

Here is how a typical 3 AM interaction looks with this PoC. Notice how the bot doesn't just jump to a fix; it collaborates with me.

  1. The Trigger: An alert lands in the channel. The bot spins up a thread.

  2. The Investigation (Data Gathering):

    • Me: "Hey, is this actually impacting traffic?"

    • Bot (fetches CloudWatch metrics): "I pulled the network metrics. Inbound traffic is normal, but latency on the backend service spiked 500ms in the last 10 minutes."

    • Me: "Grab the error logs for that service."

    • Bot (fetches CloudWatch Logs): "Here are the last 20 lines. I see multiple OutOfMemoryError exceptions."

  3. The Proposal & Confidence Score:

    • The bot analyzes the data. It sees the correlation between the latency spike and the memory errors.

    • Bot: "Based on the OOM errors and high latency, I recommend restarting the service via the restart_backend_service runbook. Confidence Score: 95%".

    • Note: That confidence score is key. It tells me why the bot thinks this is the right move.

  4. The Guardrail:

    • It stops. It waits. A button appears: [Approve].

The "No-Op" Rule

One of the coolest features of our LangGraph implementation is its ability to discern intent.

If the AI decides that no operation is needed, maybe the CPU spike was transient and is already cooling down, or maybe I'm just asking it to fetch more read-only data like logs, it skips the approval step.

It doesn't annoy me with an "Are you sure?" prompt just to read a log file. It simply fetches the data and keeps the conversation going. The "Wait" state only triggers strictly when the bot proposes a state-changing action (like restarting a server or flushing a cache). This makes the experience fluid rather than bureaucratic.

Power with Guardrails

We needed to make sure this thing wouldn't burn down production while we slept. The design follows a "Power with Guardrails" philosophy:

  • Pre-Configured Actions Only: The bot is restricted to a specific allowlist of AWS SSM Runbooks. If I ask it to rm -rf /, or even something benign but unauthorized, it will politely refuse because that action isn't in its registered toolset.

  • Immutable Audit Trail: Because every interaction happens in a Slack thread, we effectively get an automatic incident report. Post-mortem analysis becomes incredibly easy: just read the thread to see exactly what data was fetched, what logic the AI used, and who clicked "Approve."

Why This Solves the 3 AM Problem

You might ask: "If I still have to wake up to click Approve, what’s the point?"

The difference is cognitive load.

Without the bot, I'm waking up to investigate. I have to engage my brain, remember correct CLI syntax, login to VPNs, and correlate timestamps across three different dashboards.

With the bot, I'm waking up to manage. The investigation is done. The logs are already parsed. The metrics are plotted. The solution is proposed with a confidence score. I just check the work and tap Approve.

It’s the difference between 30 minutes of high-stress debugging and 30 seconds of executive review. By treating AI as a teammate rather than a tool, we don't just fix incidents faster; we transform on-call engineers from first responders into strategic problem solvers.

This article summarizes the key takeaways from Anuj Mali ‘s presentation on PoC: Building a "24/7 SRE" Teammate with LangGraph, AWS Bedrock, and Slack at Aerawat Corp's #TechThursday event, a bi-weekly forum where we share the insights on emerging trends, innovative ideas, and rapid product development strategies around Fintech, Artificial Intelligence, Autism and Diversity with Disability Engineering and Accessibility hackings.