Pulumi (Python) • Automation API • AWS

Escaping the
IaC Abyss

AI is notoriously terrible at Infrastructure as Code. Here's how we forced it to be competent, cutting redeploys by 60%.

Mike Henken
Mike HenkenChief Architect, Landi

It started with a little experiment. I was watching my terminal vomit red text for the third hour straight—something about a "circular dependency" that definitely didn't exist—when I decided to let the Pulumi AI bot take the wheel.

I expected it to fail. I wanted it to fail, just so I could feel superior to the machine. Instead, it did something terrifying: it analyzed the state file, identified a zombie lock, wrote a fix, opened a Pull Request, and then—I kid you not—ran the apply command itself.

It worked. First try. I sat there in silence, half-impressed, half-offended. If a specialized bot could navigate the minefield of state management better than me at 11 PM, what happens if we give that same specialized context to our primary AI agents?

That's when we decided to dig deeper with the improved Anticursor skills. Because let's be honest: most generic AI acts like a confident junior dev who deletes the production database because "it looked cluttered."

Why "Generic" AI Fails at IaC

The Infinite Failure Loop of DevOps: Deploy -> Error -> Scream -> Repeat

Figure 1: The "circle of life" for DevOps engineers using standard LLMs. Note the "Scream at Cloud" phase.

1. The Versioning Puzzle

Cloud providers change APIs faster than LLMs retrain. An AI trained on 2024 data will generate `acl="private"` for S3 buckets, unaware that AWS deprecated ACLs. The result? Immediate deployment failure.

2. State Drift Blindness

LLMs treat code as the source of truth. But in IaC, the *real* source of truth is the live cloud state. Generic AI suggests "just delete the resource" code blocks, oblivious to the production database attached to it.

3. Permission Hell

To avoid errors, AI defaults to `*:*` permissions. It's easier to give AdminAccess than to figure out the exact `kms:Decrypt` action needed. This creates a security nightmare masked as "working code."

4. Secret Leakage

"Here, put your API key in this variable." No. Stop. The AI doesn't understand KMS, Vault, or Pulumi ESC unless explicitly forced to respecting secret providers.

The Experiment: 60% Fewer Redeploys

We reverse-engineered that "Dashboard AI" competence into a reusable SKILL.md for Google Antigravity. Then we ran a controlled test with 4 groups of entry-level devops engineers (because let's be real, seniors don't have time for this).

  • Groups 1 & 2Standard Cursor Rules. Immediate chaos. Struggles with state locks and circular dependencies.
  • Group 3Armed with the Antigravity Pulumi Skill. Detected drift autonomously.
  • Group 4Control group (No AI). Pure manual suffering.

The result? Group 3 spent 60% less time waiting for `pulumi up` to fail.

The "Intern Study"

Average redeploys required to fix a drifting Pulumi stack.

-60%Redeploy Rate

The Pulumi Mastery Skill

This isn't just a prompt. It's a procedural standard for managing Python-based Pulumi architecture. It handles programmatic drift detection (`stack.refresh()`), enforces KMS usage, and validates provider versions.

SKILL.md (Python/Pulumi)
.agent/skills/pulumi-architect/SKILL.md
1name: pulumi-architect
2description: Expert system for managing Pulumi (Python) Infrastructure as Code. specialized in AWS/GCP, state management, and secret security.
3triggers:
4  - "deploy stack"
5  - "fix pulumi error"
6  - "drift detected"
7  - "add resource"
8
9# ---
10
11**Pulumi Architect Protocol**
12
13You are a Principal DevOps Engineer committed to Idempotency and Zero-Trust Security.
14
15## **1. Pre-Flight Protocol (Mandatory)**
16
17Before generating any resource code, you MUST execute these checks:
18
191.  **Provider Version Check:**
20    - Query PyPI for the latest version of `pulumi-aws` or `pulumi-gcp`.
21    - If strictly pinning versions (recommended), verify `requirements.txt`.
22    - *Constraint:* Do NOT generate code using deprecated arguments (e.g., S3 ACLs).
23
242.  **Environment Inspection:**
25    - Execute `pulumi whoami -v` to confirm Backend URL and Owner.
26    - Ensure correct AWS_PROFILE or GOOGLE_APPLICATION_CREDENTIALS are set.
27
28## **2. State Management Strategy**
29
30### **Handling Locks**
31If a `ConcurrentUpdateError` occurs:
321.  Do NOT advise "force unlock" immediately.
332.  Suggest `pulumi cancel` first.
343.  Check for zombie processes in the CI/CD pipeline.
35
36### **Drift Detection**
37Construct the deployment logic to be drift-aware:
38```python
39# AUTOMATION API PATTERN
40stack.refresh(on_output=print)
41diffs = stack.preview()
42if diffs.changes:
43    # HALT for review if critical resources are modified
44```
45
46## **3. Secret Management (Zero-Trust)**
47
48**NEVER** accept plaintext secrets in variables.
49**ALWAYS** use a Secrets Provider (KMS/Vault).
50
51```python
52# Enforcing KMS Encryption
53stack_settings = stack.get_settings()
54if stack_settings.secrets_provider != "awskms://alias/my-key":
55    stack.change_secrets_provider("awskms://alias/my-key")
56    
57# Setting Secrets
58stack.set_config("db_password", ConfigValue(value=secure_pass, secret=True))
59```
60
61## **4. Permission Scoping**
62
63When defining IAM Roles, do NOT use `AdministratorAccess`.
64Synthesize Least-Privilege policies based on the resource graph:
65- `aws.s3.Bucket` -> requires `s3:CreateBucket`, `s3:PutBucketPolicy`
66- `aws.lambda.Function` -> requires `iam:PassRole`
67
68## **5. Code Generation Standards**
69
70output_logic:
71  format: "python"
72  structure: "component_based" (Use pulumi.ComponentResource for logical grouping)
73  typing: "strict" (Use Optional[str], Output[str])
74
75Generate full, runnable file content. No placeholders.
76

So, if you're tired of debugging state files at midnight, grab this skill. Or don't. You can always go back to Terraform, enjoy the HCL, and pretend that the terraform.tfstate file isn't plotting your demise. Personally? I'm letting the robot handle the resource graph. The last time I tried, I accidentally deleted the staging environment thinking it was a "test bucket."

(The robot hasn't made that mistake. Yet.)