AI Agents in Practice Beyond Coding
The preceding chapters laid out the argument. Complexity always grows, and bureaucracy will not fix itself. Strategy requires mapping the terrain before you act. Teams are widening their scope, and the way we interact with software is inverting.
This chapter is a practical proof. Not a vision, but a working architecture.
Every piece of work on this book runs through an AI agent system. The chapters, the articles, the website, the PDF and EPUB production, the editorial reviews, the architecture diagrams, even the email drafts sent to co-authors. But the system does not stop at the book. Andy and I use the same architecture in our daily work as consultants and managing partners of DasScrumTeam: client logs, meeting agendas, contact profiles, email correspondence, calendar coordination, financial bookkeeping preparation. Not as an experiment, but as the daily production environment.
This chapter describes the architecture as of March 2026. On the evolution path, what you see here sits between Genesis and Custom Built. In one or two years, the tool setup will most likely look entirely different. That is not a disclaimer. That is Evolution Focus in action. The architecture changes constantly. By the time you read this, some components will have evolved further. That is the point.
This chapter describes a working system. Some tools and terms may be unfamiliar to readers outside software development.
- Obsidian: a note-taking application that stores everything as plain text files
- Markdown: a lightweight text format using simple characters for formatting (e.g.
**bold**,# heading) - Claude Code: an AI coding agent by Anthropic that reads files, edits content, and orchestrates tools
- Git: a version control system that tracks every change to every file, enabling undo, comparison, and collaboration
- Pull request: a Git workflow where one person proposes changes and another reviews and approves them before they take effect
- Hugo: a tool that generates websites from Markdown files
- Quarto: a scientific publishing tool that renders Markdown into PDF and EPUB books
- LaTeX: a typesetting system used under the hood by Quarto for high-quality PDF output
- EPUB: an open ebook format readable on most devices and apps
- API: Application Programming Interface, a structured way for software systems to communicate
- MCP (Model Context Protocol): a protocol that connects AI agents to external services like email, calendar, and databases
- Prompt injection: a security risk where hidden instructions in external data (e.g. an email) manipulate an AI agent into unintended actions
- Miro: an online whiteboard tool for collaborative workshops
- Airtable: a cloud database with spreadsheet-like usability
The Architecture
The AI agent architecture: Obsidian as knowledge hub, Claude Code as AI engine, specialized skills, MCP server integrations, and output channels
The system has five layers. If the technical details are not your focus, the value they produce is demonstrated in the sections that follow.
Layer 1: The User. A consultant, trainer, or author sits in front of Obsidian, a Markdown editor that serves as the knowledge hub. All content lives as plain Markdown files in a Git-versioned vault. The book chapters, client notes, email drafts, contact profiles, architecture diagrams: everything is text.
Layer 2: Obsidian + Claude Code. Obsidian connects to Claude Code through an ame3helper plugin and the Obsidian CLI (available since Obsidian 1.12). Claude Code is a coding agent, but “coding” is just the mechanism. The real capability is tool orchestration: reading files, editing content, calling APIs, spawning sub-agents, and managing persistent sessions.
Layer 3: 50+ Specialized Skills. Skills are pre-written prompts combined with pseudo-code scripts. They fall into four categories:
- Agents execute tasks autonomously: send emails, research contacts, build the book, transcribe recordings.
- Assistants work collaboratively: the Book-Writing assistant loads all chapter summaries into context before helping write. The Email assistant drafts messages but never sends without approval.
- Actions operate on specific content: rewrite a selected paragraph, link a mentioned person to their contact note.
- Infra skills manage the system itself: session tracking, note connections, tool memory.
Each skill is a Markdown file containing pseudo-code instructions. Not natural language, not full programming. Pseudo-code, because it constrains the model’s behavior more reliably than prose.
Layer 4: MCP Servers. The Model Context Protocol connects Claude Code to external services. Google Workspace (Gmail, Calendar, Drive), IMAP for additional email accounts, Miro for workshop boards, Airtable for customer and training databases, and vault-rag for semantic search across the knowledge vault. Each server provides a structured API that the agent can call autonomously.
Layer 5: Output. The results: the ame3.ai website (Hugo), the book (PDF, EPUB via Quarto), email drafts in Gmail, LinkedIn posts, PDF letters with corporate branding, Miro boards for workshops.
How the Book Gets Built
The Book-Writing assistant is the most complex skill in the system. When started, it loads compressed context summaries of all book chapters into its working memory. This gives it awareness of the entire book structure, terminology, cross-references, and narrative flow before it writes a single word.
The publishing pipeline works like software deployment:
- I write or edit a chapter in Obsidian as plain Markdown.
- The ame3helper plugin exports content to Hugo (website) and Quarto (book).
- Quarto renders the full book as PDF and EPUB, using LaTeX under the hood. I have never looked at the LaTeX. Vibe Coding in its purest form.
- A test agent verifies that the website builds correctly and that links resolve.
- Everything is versioned in Git. Andy and I work like a software team: he comments on my chapters, I review the pull request, approve it, and the next build publishes automatically.
This workflow exists because AI agents are not reliable enough to trust blindly. Every generated output passes through version control and human review. The agent proposes. The human decides.
Memory and Retrospectives
The most underestimated capability is learning across sessions. AI models have no memory between conversations. Every new session starts from zero.
The solution is a structured memory system: Markdown files that capture rules, preferences, and context. After every task, a Learn and Remember agent analyzes what happened and proposes improvements to the skills or new memory entries. This is essentially a retrospective. The agent identifies what could work better, I review the suggestions, and the ones that make sense get merged into the system.
Over time, this compounds. The email skill learned to never send without approval. The book assistant learned to check cross-references before editing. The consulting agent learned which log format I prefer. None of these rules were written upfront. They emerged from practice.
AI in Consulting
Where this system truly proves its value is in consulting work. I keep a structured log for every client engagement: what we discussed, what the action items are, what changed since last time. Two weeks between meetings, sometimes two months. Without notes, context is lost.
With the consulting agent, I can ask: “What was the product strategy we discussed in the last session?” and get an accurate answer pulled from the log, the email history, and the calendar. Before a sprint transition, I can say: “The next sprint transition should be one and a half days. Draft an agenda based on improvements from the last retrospective.” Two minutes later, a complete agenda exists, consistent with previous sessions.
Contact profiles update automatically. When I add a new contact, the agent searches public data, pulls relevant information from past email exchanges, and creates a structured profile. This is not magic. It is structured access to data that already exists but would take thirty minutes to assemble manually.
Virtual Teams of Specialist Agents
Andy takes this further for training development. When building a new course module, he does not rely on a single agent. He deploys multiple specialist agents, each with a different perspective:
- A learning objectives agent checks whether the course actually covers the declared learning goals.
- An experienced trainer agent reviews whether the session timing is realistic.
- A narrative agent evaluates whether the course elements have a coherent storyline or are just assembled pieces.
Each agent produces independent feedback. The combined output resembles a review from a diverse team.
During a live webinar, Andy demonstrated this with a new course module. He sent the same material to all three agents simultaneously. Within minutes, the results came back. All three agreed: the timing for Block 3 was unrealistic. The learning objectives agent flagged that two objectives were not at the declared level. The narrative agent found gaps in the storyline between modules. From the combined feedback, the system generated a prioritized adaptation backlog. Andy reviewed it, confirmed the priorities, and gave the go-ahead to implement the changes.
This mirrors the empirical loop at the heart of AME3: anticipate what a good course looks like, advance by generating content, assess through independent agent review, then adapt. The difference is speed. What would take a team of three human reviewers a day takes three agents a few minutes. This is practical today. Not a research concept, not a demo. A working tool that catches problems a single reviewer would miss.
What We Learned
Pseudo-code beats natural language for skills. Writing prompts as pseudo-code forces logical structure and produces more predictable behavior. Natural language is ambiguous. Pseudo-code is not.
Version control is non-negotiable. AI agents create and destroy content. Without Git, a single bad run could overwrite hours of work. The coding agent’s undo capabilities help, but a proper version history is the real safety net.
Human-in-the-loop is not optional. Prompt injection is a real risk. An incoming email could contain hidden instructions. An agent with full autonomy could leak data or send embarrassing messages. I learned this the hard way when an agent drafted an email to the wrong contact, simply because it could not find the right person and filled in data from its context window. The review step caught it. Every sensitive action requires explicit approval.
Start with software development practices. Test-driven development, code review, version control, continuous integration: these practices exist because humans make mistakes. AI agents make different mistakes, but the same practices apply. Treat content like code.
The tools change constantly. Six months ago, this system ran on a custom API integration. All of that was discarded and rebuilt on Claude Code, because it was simply more capable. Next month, something else might change. Build for evolution, not permanence.
Energy cost is real. Running a model like Opus is energy-intensive. Every prompt, every search, every sub-agent costs computation and therefore electricity. The cost is roughly 150 to 200 dollars per month per person for heavy use at the time of writing. Smaller models handle simpler tasks already. The system routes research queries to lighter models automatically.
Efficiency improvements will come. But the Jevons Paradox predicts what will happen next: savings will not reduce spending. They will be reinvested into more capable tools, broader agent coverage, and tasks that were previously not worth automating. As long as the productivity gain far exceeds the cost, I expect this to become the standard computing budget per person for knowledge workers. Not a temporary spike, but the new baseline.
Limitations of AI Today
This chapter shows what works. Honesty demands showing what does not. The AI introduction describes why Large Language Models are fundamentally non-deterministic: they are coherence engines, not truth engines. Here is what that means in daily practice.
Hallucination. The model confidently generates plausible but wrong information. It does not know what it does not know. When my email agent could not find a contact, it did not report failure. It filled in a plausible name and address from its context window and drafted a message to someone who did not exist. The review step caught it. Every output needs verification, especially when it looks convincing.
Context window limits. Models forget. In long conversations, earlier context fades. Skills degrade on complex multi-step tasks because the model loses track of decisions made twenty steps ago. This is why the memory system described earlier exists: structured files that persist across sessions, because the model itself cannot remember.
Knowledge cutoff. Models do not know what happened after their training date. They cannot tell you about last week’s product launch or yesterday’s regulatory change. This is why tool access matters: the model must be able to query live data rather than relying on what it learned during training.
Domain gaps. General LLMs train on broad internet text. Specialized domains, medicine, manufacturing, legal, non-English technical language, are underrepresented in that data. Performance drops significantly. Fine-tuning on domain data or retrieval-augmented generation can help, but both require significant investment. Enterprises with proprietary process knowledge face a structural disadvantage: their most valuable data is exactly what the model has never seen.
Bias. Every model carries the biases of its training data and its alignment tuning. RLHF, Reinforcement Learning from Human Feedback, is where alignment is shaped. Whoever pays for that tuning shapes the output. Different vendors produce observably different behavior on the same input. This is not neutral. Detecting bias is hard. No universal test exists.
A practical mitigation: run the same task through models from different vendors and compare the results. This book’s editorial process uses exactly this approach, reviewing chapters with Claude, GPT-4o, and Mistral in parallel. Where they agree, the finding is likely real. Where they disagree, the difference reveals alignment choices worth examining.
The better long-term answer is models with open training data, not just open weights. The AI introduction already notes that most “open-source” models only publish weights while keeping training data and algorithms closed. Truly open models are emerging. Apertus, developed by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre, publishes everything: training data, code, weights, and alignment methods. It supports over 1,000 languages and was designed for EU AI Act compliance. OLMo from the Allen Institute follows a similar philosophy. These are still exceptions, because curating training data at scale is expensive. But open training data is the only path to auditable, reproducible AI. Enterprises should watch this space and push for transparency from their vendors.
None of these limitations are permanent in their current form. Context windows are growing. Domain-specific models are improving. Bias detection methods are maturing. But the fundamental nature of the technology, non-deterministic, probabilistic, dependent on training data, will not change. The response is not to wait for perfection. It is to work empirically: observe what the system produces, test your assumptions, set boundaries, and adapt.
What This Means for Enterprises
The architecture described here runs on a single consultant’s laptop. It is not enterprise infrastructure. But the patterns transfer directly.
Consider what this setup already does: the same architecture that manages consulting logs and email drafts also builds the website, renders the PDF, and runs editorial reviews for this book. One person, dozens of agents, a complete publishing pipeline. Now scale that to a Team of seven. The scope of what a single unit can own expands dramatically.
Three patterns emerge that apply to any enterprise:
Knowledge management becomes executable. Instead of documentation that nobody reads, knowledge lives in a structured vault that agents can query, update, and act on. The knowledge works for you, not the other way around.
Specialist agents replace specialist roles for routine tasks. Not the deep expertise, but the routine application of that expertise: reviewing compliance, checking consistency, drafting standard communications. Humans focus on the judgment calls.
Software development practices become universal. Version control, code review, continuous integration, test automation: these are no longer limited to software teams. Any Team that works with text, data, or structured content can benefit. Treat content like code.
Now imagine a compliance Team in a regulated industry. Their regulatory requirements live not in a PDF that nobody reads, but in a structured vault that an audit agent can query, cross-reference, and flag inconsistencies in. When new regulations arrive, the agent identifies which existing policies are affected and drafts updates. The compliance officer reviews, adjusts, and approves. The knowledge works for the team, not the other way around. The human is the Governor, not the operator.
This is what the AI-Enhanced Team looks like in practice. Not a team that writes code faster, but a team that owns an entire value chain, from customer need to delivered outcome. The human role in this architecture is the Governor: setting constraints, reviewing outcomes, adjusting boundaries. AI agents handle the routine. Humans focus on judgment, relationships, and decisions under uncertainty.
This setup is also an example of choosing Path A: using AI to simplify, not to amplify. No additional reporting layers, no extra compliance dashboards, no meetings about meetings. The agents reduce coordination overhead instead of generating more of it.
AI Can Only Build on What Is Already There
But here is what this architecture cannot do. It cannot fix a broken organization. AI agents amplify whatever structure they operate within. If your Teams lack clear ownership, agents will produce work that nobody is accountable for. If your backlog is a list of competing priorities, agents will execute competing priorities faster. If decisions require five levels of approval, agents will generate five times more requests for approval.
This architecture works because it operates within a structure designed for autonomy, transparency, and short feedback loops. AME3 provides that structure. The Match cadence creates regular inspection points. The Arena Owner provides clear decision authority. The System Lead ensures the work system evolves with the tools. Without these foundations, AI agents become expensive tools for accelerating dysfunction.
This is why the sequence matters. Before you deploy AI agents, your enterprise may need to answer more fundamental questions. How do you plan when you cannot predict? How do you choose the right framework for your situation? How do you move from projects to product? Whether you begin with AI or with the organizational foundations, the same empirical principles apply.
We are not there yet. But the architecture is already working.