A Dungeon Master as a Long-Horizon Agent

Tabletop role-playing games (RPGs) such as Dungeons & Dragons (D&D) model many of the state management, task decomposition, workflow organization and context engineering challenges that long-horizon agents face. They make a great case study in long-horizon agent engineering where we can learn from both the practices of the players and the Dungeon Master (DM) who runs the game.

Like others, I’ve tried to play solo RPGs and adventure games directly with ChatGPT / Gemini / Claude via their chat UI. While LLM chat applications can convincingly create a world setting, narrate a scenario and interact over a modest number of turns with the user, they tend to eventually descend into problems with self-consistency, rambling storylines, hallucination and the correctness of the rule mechanics governing the game.

This led me to start thinking about what it would take to create some coherent approximation of the DM capability in agent form. We’ll explore the connection between the challenges of long-horizon agent engineering and D&D DMing in this post, grounded in a demonstration DM agent we built as a testbed for a skill-based agent platform we’ve been developing.

What is a long-horizon agent anyway?

Let’s take the simple definition of an agent as an LLM with tools (and optionally memory) working towards some goal.

We’re particularly interested in long-horizon agents. These are the agents that execute over extended time frames to perform deep research, discover software vulnerabilities, analyze financial reports, clone a browser engine or play and referee complex games like D&D.

There is no official definition for long-horizon agents. The term means different things to different people, but the following criteria are used:

Human equivalent task time. While an agent may run for hours, wall-clock time alone is not useful as it would conflate a long-horizon agent with a cron job. Instead, a task complexity metric such as METR’s task-completion time horizon criteria (the duration of a task, measured in human expert time, at which an agent succeeds with a given reliability) is more helpful. A long-horizon agent might diagnose a network fault that would take a human hours to discover with various tools and probes. In contrast, a stateless agent that polls email every hour for months and labels new emails according to some fixed criteria in its system prompt is not operating on a long horizon.
Trajectory length between goal setting, decisions and outcome evaluation. A tool call made by an LLM in a long-horizon agent may have consequences that are only apparent and assessable dozens or hundreds of steps later. Errors compound under this delayed feedback. The naive math has steep odds. If each step has a 95% chance of success, without recovery only 0.6% of 100-step runs will succeed. Long-horizon agents need to self-heal and correct behavioral drift due to errors in a way that agents with shorter horizons do not.

Even without errors, extended trajectories mean that context needs to be carefully managed to avoid accumulation of historical context that is no longer helpful or duplicative. Goal drift occurs as agents pay attention to more recent context. Long-horizon agents require an architecture that structurally reduces or eliminates these problems.
Outcome verifiability. This problem is sharpened by the fact that long-horizon agent outcomes are often hard to verify at all. Long-horizon agents are typically applied to tasks that do not have easy or even any objective verification criteria. Coding agents doing something like a pure refactor are an exception, where existing tests continuing to pass is a clean verification condition. Many long-horizon agents have outcomes that are verifiable only by human judgment, either directly or via some proxy such as LLM-as-judge standing in for this judgment.

These judgments are inherently subjective. How “good” is the research report, the banter between podcast hosts or the quality of a new code base’s architecture?
Accumulated stateful side effects. As the agent executes, it makes decisions that modify external state, for example updating a row in a database table, creating or deleting files, invoking external payment APIs. Some of these actions are irreversible or security risks. Long-horizon agents will often have significantly more appreciable stacked stateful side effects and accumulated blast radius than shorter horizon agents. Even a deep research agent, that does not mutate the external world, will gather and manipulate intermediate research artifacts that are further transformed into other intermediate artifacts and eventually the final report.

There are engineering responses to the challenges of long-horizon agents, which have been covered at length elsewhere. They include context engineering, externalization of state, task decomposition and planning, agent-specific evals, agent harness architecture, Human-in-the-Loop at critical decision points, and so on. We’re not going to dive deep into all of these but will touch on a few as they arise in the context of a DM agent.

How long-horizon challenges are solved in D&D

D&D is a multi-player tabletop role-playing game that involves storytelling, imagination and rule-based mechanics, evolving unstructured and structured state over player turns. There are free form exploration phases and also structured phases such as combat. A human or agent that participates as player or DM needs to follow rules and maintain narrative consistency with the world that they play in. I’m going to assume some basic familiarity below. If you’ve never played before you should, it’s a lot of fun. If you’ve ever played a computer RPG like Baldur’s Gate, that’s also fine background.

A D&D session or campaign easily meets the definition of a long-horizon agent task (for both players and DM):

Human equivalent task time. D&D sessions run over hours as players and the DM engage with the adventure and different phases of the game. Sessions are grouped by campaigns that may span years. This seems uncontroversially a long-horizon human endeavor.
Trajectory length between goal setting, decisions and outcome evaluation. A D&D session occurs over dozens of turns and ultimately hundreds of steps if you include actions such as combat mechanics. Somehow the game avoids collapsing into nonsense where players or DM are cognitively overloaded and errors fatally derail the trajectory.

The reasons are structural rather than accidental. The game has workflows with phases (character creation, exploration, social encounter, combat, downtime), each with explicit boundaries and per-phase rules. Each phase has a manageable working set. Combat tracks initiative order and HP, whereas exploration tracks party location and elapsed game time. State that does not belong to the current phase is parked in writing rather than carried in anyone’s head. Distributed memory helps too; each player owns the state of their own character on a sheet and in their head, while the DM owns the world.
Outcome verifiability. The measure of a D&D game’s outcome is not pass / fail. It is whether the players had fun. DMs gauge this through the quality of player responses. Small differences in this signal separate boredom from engagement.
Accumulated stateful side effects. D&D externalizes nearly all consequential state to shared artifacts, which are manipulated during player turns. These include character sheets, the initiative tracker, the map, the DM’s notes, and dice rolls. When a player character takes damage, the player erases and rewrites a number on the character sheet. State updates compound at every step of the game.

Transitioning through different phases of the game allows for earlier state and history to be discarded. After character creation, all that matters is the resulting character sheet. After combat is finished, the initiative order no longer matters. After a room is completed, never to be returned to again, its state is no longer material. This provides a form of explicit compaction for players and DM.

Consistency and world-model concerns abound in the game. If an item is picked up and added to a player’s inventory, it no longer exists in the room in which it was found. If a player has a backstory as a farmer, this should be a durable fact and they should not abruptly become a software engineer. Humans are very good at maintaining self-consistency and reasoning along spatial and temporal dimensions. We innately understand concepts like object permanence.

The social dynamics of D&D can help self-heal errors before they contribute to drift. If a player attempts to make an illegal move or inadvertently cheat, the DM or other players can call them on it. The game is fault tolerant, with any accidental error correctable through narrative devices and unlikely to lead to complete narrative collapse.

Beyond just the state and rules that govern the game, the DM is responsible for planning the narrative arc of a session and ensuring it creates interest for the players. A combination of pre-planning, real-time human reasoning and storytelling, and notes taken for future sessions allows the DM to create an engaging and coherent storyline.

The Fraying

While working on ReadyLoop.ai’s agent platform I was looking for a testbed long-horizon agent to implement as a skill-based agent. Our platform provides a managed agent runtime with durable storage aimed at skill-based agents following the agentskills.io architecture, similar in some respects to the recently announced Claude Managed Agents and OpenAI’s workspace agents. A DM agent that could facilitate a single-player experience of a D&D-style game seemed like a natural fit to explore long-horizon agent engineering challenges on ReadyLoop. This is far from a fully capable DM agent but provides a starting point to explore the engineering challenges in this domain.

The game is built on a subset of the rules and features in the 5.5e SRD, set in the world of Aurvelen, where the Fundament, a vast lattice of arcane threads that encodes the operating rules of reality, has been discovered. The Fundament has simultaneously uplifted communities and individuals by curing sickness and increasing crop growth while leaving its users feeling thinned out and purposeless.

Our DM agent is implemented as a skill in the style of agentskills.io; a SKILL.md file bundled with Python scripts that the agent calls as tools. The scripts provide game state initialization and guided game evolution. There are over 50 of these scripts, with names like combat.py, death_restart.py, spellcasting.py, codex.py, dice.py, ability_generator.py and so on.

A quick sketch of the architecture is provided below. In the interest of brevity, I’m going to skip over details of the ReadyLoop agent platform, I’ll dive into this in later posts. In yellow you can see the static Markdown and image assets that follow the principles of progressive disclosure of state to the agent. There’s a lot going on in a game like The Fraying, even at the level of world building and the lore that explains Aurvelen. We don’t dump all of this in the prompt context at all times, instead the content is structured so that a high-level overview is available in the system prompt via SKILL.md and enough detail is available for the agent to lazily fetch specific content on demand. Similarly, SKILL.md has a skill catalog that points it to the key scripts to organize the game workflow. Durable state is persisted to a distributed filesystem provided by the platform so that a turn can record and retrieve variables like the player’s Hit Points.

Architecture diagram

Does all the state and structure discussed in the previous section make it easier to build a long-horizon agent that DMs D&D? Yes and no. The pre-existing tabletop state structure is a natural starting point for an externalized state in a DM agent. You want to track the character sheets and maintain a session log on the filesystem, distinct from prompt context. There were still some major gaps to address:

Storyline creation and consistency. The DM agent uses storyline JSON files which track plot premise, NPCs, facts, rooms and possible encounters. To keep things simple and entertaining, storylines follow a 5 room dungeon format. The 5-room dungeon is a well-known DM trope with rooms organized as a story arc: entrance, puzzle, twist, climax and reward. World facts are assigned unique keys and refer to conditions such as “A binding verdict on the caravan’s status has been pronounced before the council”. Each room has preconditions for entry (facts that must have been established). Game phases such as combat can establish facts on completion as can the DM agent during story exposition. Without this structure, the DM agent would randomly transition rooms or invent NPCs. The LLM was still allowed to provide ad hoc narration around the storyline, but with strong guidance in its SKILL.md biasing towards grounding in the facts provided by the storyline.
World model. LLMs start with the disadvantage of limited world models and the inability to maintain any state outside their context window unless we teach them how to use tools and externalize state. In addition to the fixed storyline, we needed to track the dynamic state of each room: which items were in the room, which monsters were present, which doors were open etc. Items transferred between the room and inventory needed to be modeled as a script invocation. Scripts needed to raise an error if the LLM decided to pick up an item that didn’t exist in a room.
Sentinels and interlock. Each turn of the game involved multiple tool calls by the LLM to different scripts. We needed to keep track of the current state on the filesystem across these calls and have scripts read this to know whether an action was permitted. For example, we needed to know if a restart following a player character death was permissible based on a file sentinel that was written only when the player character dies. Scripts emit a “next action” hint to the LLM to make it aware of the scripts and arguments that were available to advance the workflow. This essentially models a workflow graph through state variables, script preconditions and next action hints. Skill-based agents do not have an external workflow engine or orchestration capability but can achieve the same effective workflow trajectory.
Session logging. Every significant “beat” in the game was written to a Markdown log file by the agent. This provides a long-term memory that survives compaction and even player death.
Compaction management. The DM agent was a single flat agent with no explicit subagent invocations. We achieved a similar effect to invoking multiple subagents by explicitly compacting with a primitive provided by the ReadyLoop agent platform. This compaction primitive let scripts write additional context into the post-compaction prompt, providing a controlled context reset. Explicit compaction was performed at game phase transitions: following character creation, on room entry, before combat began, after combat ended, etc. This provided a way to bound context growth over the game’s extended horizon, while avoiding the unpredictability of compaction forced by the context window limits (which the agent harness also provided and we engineered scripts and prompts defensively against). When combined with context engineering optimizations baked into the ReadyLoop harness (for example tool result pruning), we were able to diminish context growth and mitigate context rot.

Taking a step back, we’ve essentially built an RPG engine and expressed it through a set of agent tools and durable agent skill state. Building an RPG engine in 2026 is not particularly novel, but when combined with the agent’s LLM as driver and narrator, we ended up with an interactive game fiction experience. This agent was expressible as a skill bundle, remained consistent with the rules of D&D and maintained a coherent (if limited) world model over an extended horizon.

You can play the game at thefraying.com with WhatsApp or browser-based chat. Please reach out directly to me if you have any feedback and subscribe for future posts!