The Conversation Happening Behind Closed Doors

There's a particular type of meeting happening right now at Am Law 100 firms that nobody is writing press releases about. An innovation leader presents updated metrics from the agentic AI pilot that launched with considerable fanfare in early 2025. The numbers are complicated. Some efficiency gains materialized. Some didn't. A few incidents — a flawed research output that propagated further than it should have, a client question about autonomous task completion that nobody had a clean answer to — are being characterized internally as "learnings." The managing partner who forwarded the launch announcement to the firm's LinkedIn page has not mentioned the initiative publicly since.

This is the agentic cliff. Not a cliff in the sense that the technology failed — it largely didn't. A cliff in the sense that firms ran hard toward automation without first mapping the terrain, and discovered the drop only after they were already in the air.

The lazy contrarian read is that agentic AI is overhyped. That's not the argument here. Agentic systems — tools that can independently research, draft, route, and synthesize across multi-step workflows without constant human intervention — represent a genuine capability shift, and the firms dismissing them entirely will pay a different kind of price. The real insight is more precise: the firms succeeding with agentic AI aren't the ones who moved fastest. They're the ones who built deliberate human judgment architecture into their workflows before they automated them. The cliff isn't the technology. It's the gap between how a workflow appears to operate and how it actually needs to operate to meet professional responsibility standards.

The Supervision Paradox: When "Review" Becomes Theater

When an agentic system runs a 40-step research-and-draft workflow autonomously, the supervising attorney is left reviewing an output rather than participating in a process. That distinction matters enormously under professional responsibility frameworks — and bar associations are beginning to say so explicitly.

Guidance from the New York State Bar Association, the California State Bar, and the UK's Solicitors Regulation Authority, issued and updated between 2025 and early 2026, has increasingly signaled that attestations of final document review may not satisfy competence obligations when intermediate reasoning is opaque or unreviewable. The principle isn't new — it's the same logic that governs delegation to junior associates — but the scale and opacity of agentic pipelines creates a new version of an old problem.

Consider a scenario playing out at multiple firms: a litigation support pipeline that autonomously pulls case law, synthesizes arguments, and drafts motion sections across hundreds of matters monthly. The supervising partner reviews the finished brief. The chain of reasoning that selected certain cases over others, the judgment calls made in synthesizing conflicting precedents, the implicit framing decisions — none of that is visible in the output. The partner signs off. The brief is filed. If it later emerges that the agent mischaracterized a line of authority, "I reviewed the final document" is not a robust defense. The profession has a word for this: supervision theater. The appearance of oversight without its substance.

Even a 10% error catch-rate on intermediate steps — a conservative assumption for any complex legal reasoning task — multiplied across high-volume matter types represents meaningful malpractice exposure. The math was never run before deployment, at most firms, because the efficiency gains dominated the analysis.

Knowledge Contamination: The Risk That Siloed Tools Never Created

One flawed research output from a traditional AI tool affects one matter. One flawed agentic write-back to a firm's master contract playbook affects every associate who touches that clause for the next 18 months.

This is the knowledge contamination risk, and it caught firms without mature knowledge management infrastructure almost entirely off guard. Agentic systems designed to write back to knowledge bases — automatically tagging precedents, updating standard positions, flagging market norms — can propagate confident errors at scale in ways that siloed tools never could. The agent doesn't know what it doesn't know, and it writes with the same confident syntax whether it's right or wrong.

Knowledge management professionals at firms with dedicated KM leadership understood this risk intuitively from the beginning. The failure mode isn't exotic; it's a version of the same problem that emerges when a talented but unsupervised junior associate updates a precedent database without partner review. The difference is velocity and volume. An agent can execute in an hour what would take a first-year associate three weeks, which means the error propagation timeline compresses dramatically.

Firms navigating this successfully have implemented a clear architectural principle: agentic systems can propose knowledge base updates; humans with explicit subject-matter authority confirm them. That single checkpoint, rigorously enforced, is the difference between an agentic system that strengthens institutional knowledge and one that silently corrupts it.

Client Expectations Have Lapped Internal Policy

Major financial institution and PE fund clients — themselves now sophisticated AI users with their own governance frameworks — are inserting pointed questions into outside counsel guidelines that didn't exist two years ago. Which tasks were completed by autonomous agents? What was the human review checkpoint? Can you demonstrate the agent's reasoning chain? What data was the agent operating against, and was any of our matter information used to train or refine the model?

Firms that deployed agentic tools without building corresponding audit and explainability infrastructure are being caught flat-footed in these conversations. The practical consequence isn't theoretical: at least one Am Law 50 firm lost a significant capital markets mandate in late 2025 when it could not produce a coherent answer to a client's AI governance questionnaire. The client's legal operations team had experienced a bad outcome with another outside vendor and was applying heightened scrutiny across its panel. The firm's competitors who could answer the questions — here is our agent taxonomy, here is where human judgment gates exist, here is the audit trail for this matter type — won the work.

This is an active competitive differentiator right now, not a future risk. Firms that can speak confidently about their agentic governance architecture are winning mandates from clients who've already learned what the absence of that architecture looks like.

The Brilliant Intern Problem at Enterprise Scale

The analogy circulating among legal innovation leaders captures something precise: agentic AI behaves like an extremely capable first-year associate who has read everything but has no judgment about what matters in this specific client relationship. The brilliant intern knows the law. The brilliant intern does not know that this particular GC has a risk tolerance shaped by a board incident two years ago, or that this client relationship has a history that makes a particular drafting approach politically fraught, or that this clause has been litigated with this counterparty before and the firm's standard position was adjusted as a result.

That's manageable at one or two deployments. It becomes a firm-wide quality and culture problem when the brilliant intern is running thousands of parallel tasks daily with no partner-level context, and when the supervising attorneys believe the agent has context it doesn't have.

Firms that defined their matter-type taxonomy, their client sensitivity tiers, and their escalation logic before deploying agents are managing this effectively. A Midwest-based Am Law 100 firm built a three-tier client sensitivity framework before expanding its agentic deployment: standard matters, elevated-sensitivity matters requiring intermediate human checkpoints, and restricted matters where agents operate in suggestion-only mode. The framework took three months to build and almost caused the innovation team to miss their deployment target. Eighteen months later, it's the reason they haven't had an incident and their client-facing governance narrative is coherent. The firms that deployed and planned to "tune later" are still tuning, against a backdrop of accumulating incidents.

The Infrastructure Fault Line That's Splitting the Market

A structural divide is now visible between two categories of firms. The first purchased agentic capability as a feature from a single vendor — fast to deploy, relatively easy to demonstrate, difficult to govern and nearly impossible to audit at the component level. The second built composable AI infrastructure where agents operate within firm-controlled guardrails, knowledge layers, and audit trails, often combining capabilities from multiple systems under a unified governance architecture.

The first group achieved faster initial deployment metrics and generated more compelling 2025 press coverage. The second group is pulling ahead on durability, client confidence, and the ability to iterate without rebuilding from scratch when requirements change — and legal AI requirements are changing continuously as bar guidance evolves, client expectations shift, and matter types expand.

This is the buy-versus-build-versus-architect conversation that CIOs should be leading right now, and most aren't having it rigorously enough. The question isn't "which vendor has the best agentic features?" The question is "when the next bar guidance update requires us to demonstrate intermediate-step audit trails, can our current infrastructure produce them?" Vendors who control the full stack have little commercial incentive to make that question easy to answer. Firms that own their governance layer do.

What Responsible Agentic Deployment Actually Looks Like

This is the constructive turn. Not a product pitch — a framework drawn from what's actually working in the field.

Successful agentic deployments share three structural characteristics:

  • Workflow archaeology before automation. Before any agent is deployed, the existing human workflow is mapped in detail — every decision point, every judgment call, every place where an experienced attorney does something a first-year couldn't. This is not glamorous work. It typically takes two to four weeks per matter type. It is the single most reliable predictor of deployment success. Firms that skip it are essentially asking the agent to replicate a process nobody has fully articulated, and then expressing surprise when it replicates it imperfectly.
  • Graduated autonomy tied to demonstrated performance. Agents earn expanded scope through audited performance on real matter samples, not through IT deployment schedules or vendor roadmaps. This means defining, in advance, what "good" looks like for each task type, running a structured evaluation period with human review of agent outputs at the component level, and gating expansion on measurable accuracy thresholds. It's the same logic applied to any new associate's advancement — and the parallel is professionally useful when explaining the framework to skeptical partners.
  • Feedback loops that reach people, not just dashboards. Supervising attorneys receive structured prompts about specific agent decisions — the agent selected this authority over this alternative; does that reflect your judgment on this matter type? — not just completion notifications. This changes the attorney's role from passive reviewer to active calibrator. It is more intellectually honest about what supervision actually means in an agentic environment, and it produces the kind of documented engagement with the agent's reasoning that professional responsibility frameworks require. It also, not incidentally, makes the agent better over time in ways that passive review does not.

A Self-Assessment for Monday Morning

If you greenlit an agentic pilot in 2024 or 2025 and the results have been more complicated than anticipated, the following questions are worth taking into your next internal conversation. They're not designed to produce a grade. They're designed to identify where the gaps are.

  • For each deployed agentic workflow: can you produce a map of every intermediate decision the agent makes, and identify which of those decisions requires attorney judgment under your jurisdiction's competence standards?
  • Does your agentic infrastructure produce an audit trail at the component level — not just final outputs — that you could show a client or, if necessary, a bar disciplinary committee?
  • Have you defined, in writing, which matter types and client relationships are appropriate for which levels of agent autonomy? Is that framework enforced technically, or only through policy?
  • If an agent wrote back to your firm's knowledge base or precedent library in the last 12 months, how many of those write-backs were reviewed by a subject-matter authority before becoming firm standard?
  • Can your supervising attorneys describe, without looking at a dashboard, what the agent did on a specific matter — not just what it produced?
  • Does your current vendor infrastructure give you the ability to add, modify, or enforce governance guardrails without the vendor's involvement?

Six honest answers to those six questions will tell you more about your firm's actual agentic AI posture than any deployment metric your innovation team has presented this year.

The firms that will define legal AI leadership over the next three years are not the ones who moved fastest in 2025. They're the ones who understood that the goal was never to automate the workflow — it was to automate the right parts of it, with the right human judgment preserved in the right places. That's a design problem before it's a technology problem. And the firms treating it that way are the ones quietly pulling away from the cliff's edge while others are still trying to figure out how to land.