04/14/2026 | Press release | Distributed by Public on 04/14/2026 11:08
Table of Contents
1. Introduction: The End of 'Build Once, Ship, Move On'
For decades, the Software Development Lifecycle (SDLC) has been the bedrock of the technology industry. Its logic is elegant and straightforward: Define requirements, write code, test it, ship it, maintain it. If the code says 2 + 2 = 5, you have a bug - a logic error with a deterministic fix. You patch it, you push a new version, and the world moves on. Modern software engineering is built on one core assumption: predictability.
That assumption is now breaking.
As enterprises race to deploy autonomous AI agents that don't just execute instructions but reason, adapt, and make decisions, they're discovering an uncomfortable truth: Agents aren't traditional software. They're probabilistic, not deterministic. They don't "run"; they "behave." And the degree to which they behave well isn't a fixed property you can test once and trust forever.
The instinct in our industry is to treat agent development the way we have always treated software development: Write the prompt, connect the APIs, run a test suite, deploy, and move on. This results in agents that work brilliantly in staging and erratically in production, agents that degrade over time even when no one touches them, agents that pass every test case and still fail the customer.
The gap between how we build traditional software and how agents actually work is where most enterprise AI deployments break down. No amount of prompt engineering will close it. What is needed is a fundamentally different lifecycle designed from first principles for the realities of probabilistic AI.
At Salesforce, we have spent the past two years building and deploying autonomous agents at enterprise scale across thousands of use cases, serving tens of thousands of employees and millions of end customers. In the early days, we saw the same friction points many enterprises face today. We had disparate approaches across the organization, a lingering bias toward "build once and ship," and ad hoc AI governance. This fragmentation led to inconsistent results and inaccurate guidance for our teams.
Through that work, we have developed the Agent Development Lifecycle, or ADLC.
ADLC is a closed-loop learning system of testing, monitoring, calibration, and refinement that is designed to manage the inherent drift of an evolving AI ecosystem. Where SDLC ends at the release, ADLC recognizes that deployment is merely day one and delivers a model centered on continuous improvement post-launch, rigorous AI governance, and teams comprising agentic technologists who support technical build and behavioral coaching.
This paper introduces the ADLC framework, explains the first principles behind it, details its six core phases, and describes the new jobs to be done and organizational capabilities enterprises must develop to succeed in the agentic era.
2. First Principles: Why Agents Aren't Software
To understand ADLC, we need to move past the hype and look at the functional reality of building and managing probabilistic systems.
The Taxonomy of Agent Work
The first mistake most organizations make is treating every agent task the same way. In practice, when we decompose a job into its elemental tasks, those tasks fall into three distinct categories, each of which demands a different tooling approach.
The first category is deterministic code. These are tasks that require 100% accuracy and zero creativity: updating a CRM field, generating an invoice, changing a lead status. They belong in traditional code (Apex triggers, flows, APIs), not in a language model.
The second category is retrieval. These tasks require specific, factual context: "What are our Q3 pricing tiers?" or "What is this customer's contract renewal date?" The agent needs to be grounded in a trusted data source (like Salesforce Data 360) to ensure it's not hallucinating.
The third category is the reasoning engine: the large language model (LLM) itself. These are the tasks that require genuine judgment: "Based on this prospect's LinkedIn activity and recent support tickets, what is their most likely pain point?" This is where the LLM earns its keep.
Distinguishing between these three categories is the most important design decision in ADLC. If a task can be accomplished with a regular expression or an if-then statement, keep it out of the LLM. Save the reasoning tokens for where the ambiguity actually lives.
Job Decomposition, Then Job Automation
The second mistake is attempting to automate an entire role, for example, "Automate Sales" or "Automate Support." Automation isn't the wrong goal, but it's the wrong starting point. Before you can automate anything, you must decompose the role: break it into its elemental tasks and determine which specific tasks an agent can own, which require human judgment, and which should remain in deterministic code.
When we built our Engagement Agent - the same one that now generates over $120 million in annualized pipeline - we didn't try to "automate Sales." Instead, we looked for the high-volume, high-impact tasks where extra capacity would provide outsized gains for our sales development representatives (SDRs). Engagement Agent has succeeded not by trying to be an SDR but by doing specific SDR tasks exceptionally well.
Don't Throw the Kitchen Sink at Your Agent
This leads to the third first principle: Resist the urge to build a "general agent." The biggest mistake we see operators make is trying to build a single agent that can do everything - handle research, draft communications, verify facts, update records, and make recommendations all in one system. These monolithic agents become bloated, expensive, and prone to "logic collisions" where the instructions for one capability interfere with the performance of another.
ADLC instead advocates for a targeted agent architecture. Each agent should have a defined, coherent scope of work with clear purpose and boundaries. Build one agent for account research. One for customer outreach. One for seller coaching. Each does one thing exceptionally well, and each can be calibrated, monitored, and improved independently. Critically, these targeted agents must pass a quality check to earn their "ticket to ride" before being orchestrated into enterprisewide systems. ADLC is a gated process, and only high-quality agents get through.
Within each targeted agent, the work is further decomposed into subagents that handle specific tasks in the agent's broader scope. Consider our Engagement Agent, whose job is engaging a customer after a lead is submitted by email. The macro goal is to get that customer ready to book a meeting with a seller. But that scope breaks down into discrete subtasks: One subagent drafts the initial email. Another checks the message strength before it goes out. Another handles the follow-up sequence. Each subagent has its own skill and context and can be independently refined. The result is that, as any one component improves, the overall agent improves. And when a subagent gets overloaded, you can split it apart and create a new one - giving agents clear purpose and boundaries at every level.
How these targeted agents then work together is a question of orchestration, and it's the ultimate destination of ADLC.
3. The ADLC Framework: Phases of the Agent Lifecycle
Unlike SDLC, which flows linearly from design to deployment to maintenance, ADLC is built as a continuous cycle. It starts with phases that are also present in SDLC (design, build, test, and deploy), which are foundational. Experiment and observe come next and form a perpetual loop of calibration and refinement. The final phase, control and orchestrate, is the long-term destination that wraps around the entire cycle. Governance runs throughout and ensures all agents are secure, trusted, compliant, and high quality.
Phase 1: Design
Every agent begins not with a line of code but with a specific job to be done. The design phase is where an organization identifies a business use case and validates that an autonomous agent is the right tool for the job, rather than a deterministic flow or a manual human process.
However, decomposition is a guessing game without a performance baseline. Before a job is dissected into tasks, organizations must capture hard data on how humans currently perform those tasks. You can't measure an agent's accuracy or efficiency if you don't have a documented baseline for what good looks like in the hands of an employee.
This phase also encompasses governance. Before an agent touches production data, the organization must define permissions, sharing rules, ethical guardrails, and data access boundaries. This includes a structured intake process with formal submission that captures the use case, expected business value, and scope, followed by risk, bias, and fairness assessments. We review the centralized AI registry to determine whether an approved agent or design pattern already exists, and we check whether the use case can be extended from an existing agent or requires building something new.
Governance isn't a gate that slows the process down. Done correctly, it accelerates the work; preapproved design patterns and intake templates eliminate ambiguity and reduce the review cycle from weeks to days.
The design phase produces three artifacts: a validated use case with defined business value, a task taxonomy (deterministic, retrieval, and reasoning), and a governance clearance with defined data access and ethical boundaries.
Phase 2: Build
With a use case and governance clearance in hand, the team moves to building and configuring the agent. This phase involves enabling the agentic platform within the organization's environment, configuring foundational LLM settings, and defining the agent's identity.
The most important artifact produced in this phase is the agent's Instructions - its natural language "job description" that tells it what it does, how it should behave, and where its boundaries are. Unlike traditional code, these instructions are written in plain language and are closer in nature to an onboarding document for a new employee than a technical specification.
The team also assigns actions - the specific tools the agent can invoke. These may include flows, Apex code, prompt templates, or external API calls. Each action is classified by topic and mapped to the task taxonomy from Phase 1. Deterministic tasks are handled by code-based actions, retrieval tasks connect to grounding data sources, and reasoning tasks route through the LLM with appropriate context.
Finally, the team configures knowledge - connecting the agent to knowledge articles, CRM records, Data 360 datasets, and any other structured or unstructured data sources it will need to do its job. The quality and completeness of this grounding data is, in practice, the largest determinant of agent performance - far more impactful than the sophistication of the prompt.
Phase 3: Test and Evaluate
This is the phase where ADLC diverges most sharply from SDLC. In traditional software development, testing is a checkpoint you pass through on the way to deployment. In ADLC, testing is a permanent state. An agent is never "done" being tested, because an agent is never a finished product.
Our initial testing process uses Agentforce Testing Center to input sample utterances and verify that an agent correctly identifies the right topics, selects the right actions, and produces grounded, accurate responses. When an agent fails (and it will fail), the teams use Agentforce Observability to understand the failure, refine the natural language instructions, adjust the action classification logic, and update the grounding data to reduce hallucinations and improve response accuracy.
ADLC also incorporates structured experimentation in the form of data-driven comparative testing that allows teams to try different instructional approaches, different grounding strategies, and different action configurations and then measure which performs best against defined metrics. (This emphasis on experimentation is a through line in ADLC.)
The key mindset shift is this: You're not debugging code. You're coaching a new employee. The agent's mistakes aren't errors to be fixed with a patch; they're learning signals that inform the next round of instruction refinement. This phase produces not just a functional agent but a record of what the agent got right, what it got wrong, and how the instructions were adjusted in response.
Phase 4: Deploy
Deployment in ADLC is not the triumphant finish line it represents in SDLC. It is the agent's first day on the job. Much like a human employee finishing orientation, an agent must be vetted for competency before it shows up for work. We're assuming that your agent has met the foundational requirements for day one readiness. (We'll explore the specific metrics of agentic excellence in an upcoming paper.)
In this phase, the team activates the agent and embeds it into the channels where users actually work, including our system of engagement, Slack. We've found that deploying agents into Slack dramatically reduces the adoption friction that plagues stand-alone "AI portals." Users don't have to learn a new interface or change their habits. The agent simply shows up in the channel where they're already asking questions. This turns agent adoption from a mandate into a convenient habit.
This phase also involves finalizing the agent's deployment configuration, managing inbound and outbound flow routing, and ensuring that the handoff between agent and human - the escalation path - is seamless. Because no matter how capable the agent is at launch, there will be cases it cannot handle, and the speed and accuracy of that handoff is often what determines if users trust the system.
Phase 5: Observe
Phase 5 isn't a phase in the traditional sense. It's a continuous process that begins the moment the agent enters production.
Observation starts with Agent Analytics: seeing how the agent reasons through real conversations, which actions it selects, and where it succeeds or struggles. In traditional software, observability means "Is the server up?" In ADLC, observability means "Is the agent thinking correctly?" An agent can be available 99.9% of the time and still be failing its mission by providing low-quality or irrelevant reasoning. We have to match the form to the function. LLMs have the ability to speak like a doctor (the form) without having the knowledge to be a doctor (the function). Through observation, it's critical that we push the agent beyond "Does this sound good?" to "Is this factually certain?"
The monitoring layer tracks quantitative performance indicators: resolution rate, handoff frequency to humans, customer satisfaction scores, session engagement patterns, and operational efficiency metrics. The evaluation layer goes further, systematically scoring the agent's responses against a "golden dataset," a curated set of known-good answers that serves as the benchmark for accuracy, groundedness, and safety. This golden dataset isn't static. It evolves as the business evolves, as customer expectations shift, and as the agent's capabilities expand, and we continuously update the dataset against agent performance, regression testing, and adversarial testing.
This phase also introduces the concept of reasoning efficacy - moving beyond "Did it get the right answer?" to "How well did it reason to get there?" This involves tracking the input variables (prompt, context, grounding data), inspecting the reasoning trace (the chain of thought the agent used), and measuring output variance (the amount the answer changes when the same question is asked three times). Observability should also flag results as "incertified" if the agent is reasoning through a system of engagement versus a system of record.
Every data point generated in Phase 5 feeds directly back into Phase 3 (Test and Evaluate), creating the continuous improvement loop that is ADLC's defining structural feature. Every "incorrect" agent action becomes a training datum for the next calibration cycle. The goal is to build a data loop where the system gets smarter, not just through engineering effort but through operational use - where every learning becomes an improvement as we run deep session tracing, collect feedback, and create self-learning agents.
Phase 6: Control and Orchestrate
The first five phases describe the lifecycle of a single agent. Phase 6 describes the destination: an enterprise where multiple specialized agents collaborate, hand off tasks, and collectively deliver outcomes that no agent could achieve alone.
Orchestration is the ability for multiple agents to interact without creating what we call a "chaos of autonomy" where agents duplicate work, contradict each other, or drop tasks in the handoff between systems. Currently, most multi-agent workflows are serialized: Agent A completes a task, then passes it to Agent B. The future of ADLC is de-serialized orchestration, where multiple agents pursue complementary approaches simultaneously, with a coordination layer ensuring coherence.
Agent owners should want orchestration because it delivers built-in adoption (no marketing), inherited trust (users trust in the orchestrator brand), and reduced agent sprawl (no disconnected tools). Without orchestration, agents remain siloed - useful to a small team but too risky to scale.
Quality is the ticket to ride for orchestration. In this phase, agents undergo regular performance reviews to determine their readiness for orchestration. This is where the Agent Scoring Rubric becomes critical: Agents must meet quality standards before they can be orchestrated into enterprisewide superagents like our Sales Agent or Employee Agent.
Without passing the quality bar, agents remain siloed - useful perhaps to a small team but too risky to scale. Orchestration is what transforms stand-alone agents into a unified digital workforce that meets users where they are, in their existing flow.
This phase ensures that only agents with proven accuracy, reliability, and compliance can graduate from small-scale experimentation to this high-value orchestrated experience. This gate prevents the proliferation of plausibly right but factually wrong information across the enterprise, where one poorly performing agent could undermine trust in the entire agentic ecosystem.
Orchestration is the next frontier in most enterprises' agentic transformation. At Salesforce, we're still developing new agents, but our focus is shifting to the orchestration of high-quality agents. We believe the natural orchestration layer for the enterprise is Slack, because it's where the flow of work already happens. Rather than forcing users to go to a new AI portal, we're bringing the agents into the channels where people are already collaborating. This reduces the friction of change and makes multi-agent collaboration feel natural and convenient.
4. Defining Critical Jobs to Be Done
The six phases of ADLC aren't a linear path to completion but an iterative, continuous loop requiring constant management from an agent's first day on the job to its eventual retirement. However, successful execution of the ADLC framework depends less on traditional job titles and more on the mastery of specific jobs to be done (JTBD).
At Salesforce, we're building agentic teamsof expert generalists who flex and blur the lines between traditional roles to support both technical builds and behavioral coaching. Organizations don't necessarily need to hire net-new people; instead, they can upskill existing teams to handle these core functions.
How these jobs are staffed depends on two dimensions: the size of company and size of the job. A nimble startup might have one or two agentic technologists covering all nine functions. A large enterprise might have dedicated teams for each. But all functions are critical to success in ADLC - and a small company deploying a complex, high-stakes agent might need to flex a big team to support delivery.
Through our work building and deploying agents at scale, Salesforce has identified nine critical JTBD that every organization must address:
The Build Functions
The Improve Functions
The Drive Functions
The organizations that lead in this era will be those that stop thinking in terms of rigid headcounts and start staffing these critical functions intentionally, ensuring every JTBD has a seat at the table.
5. Managing Synthetic Reasoning: Drift, Calibration, and the Human in the Loop
Perhaps the most counterintuitive aspect of ADLC is the concept of drift. In traditional software, behavior is constant unless the code is changed. In the agentic era, behavior can shift due to external factors.
Drift is the phenomenon where agent performance degrades over time even when no one has changed the agent's instructions, its actions, or its code. It happens because the world around the agent changes. New data enters the knowledge base that conflicts with old assumptions. The foundation model provider releases an update that subtly alters the model's reasoning patterns. Customer expectations shift because a competitor launched a new feature or because a viral social media post changed what "good service" looks like. In each case, the agent's performance degrades, not because anything is broken but because the context in which it operates has moved.
Drift isn't a bug. It's an inherent property of probabilistic systems operating in a dynamic world. And managing it requires a fundamentally different approach to the human role in the system.
The Calibration Loop
In SDLC, human oversight is a safety net designed to catch errors before they reach production. In ADLC, human in the loop serves a dual purpose: It's a safety net and a data factory. When a human reviews an agent's output and corrects a mistake, that correction is both a fix and a data point that improves the agent's next calibration cycle.
Human in the loop shows up in three maturity stages post-deployment. During early deployment, humans review 100% of agent outputs. Every response is checked, and every correction is logged. As the agent matures, humans review only low-confidence outputs. These are cases where the agent itself signals uncertainty. At full maturity, humans review approximately 5% of outputs at random, performing quality control and spot checks. At each stage, the review process generates the calibration data that drives continuous improvement.
Deterministic Fences
We also implement "deterministic fences," hard-coded guardrails that constrain the agent's probabilistic outputs in high-stakes scenarios. For example, an agent can draft a customer refund, but only a deterministic piece of code with a hard cap on dollar amounts can issue it. The agent reasons; the fence acts. This separation ensures that trust is maintained even as the agent is given increasing autonomy.
Trust is the enterprise's number one value. We don't mind introducing a measure of friction if it keeps the system safe. We use agents themselves to streamline that friction wherever possible, creating a virtuous cycle where safety and speed improve together.
6. Conclusion: The Crossover Point
The ultimate goal of ADLC isn't to build agents that are "good enough." It's to reach the "crossover point," a data-driven milestone where an agent's performance metrics consistently beat the human baseline captured during the design phase. We're already seeing this in customer service, where Salesforce agents handle certain task categories at performance levels that match or exceed human agents, at a fraction of the cost and with around-the-clock availability.
Reaching the crossover point consistently, across thousands of use cases, requires the discipline of ADLC. It requires decomposing jobs with surgical precision, building targeted agents that do one thing exceptionally well, testing not once but continuously, observing not just whether the agent is "up" but whether it's thinking correctly, and orchestrating multiple agents into a coherent system. Scaling in the agentic era is about shortening the time it takes to move an agent from "trainee" to "senior contributor" with ADLC as the training program.
But the crossover point isn't the finish line. Once an agent consistently beats the human average, the question becomes "How much further can it go?" The agent's ceiling may actually exceed what any human could achieve at that particular task. In agentic transformation, first we do no harm. Then we continue to push forward and test what's possible.
This transition won't be easy. It requires a tolerance for imperfection that deterministic software never demanded, including a willingness to make a thousand small mistakes in service of building systems that learn from each one. It requires new jobs to be done, new metrics, and a willingness to treat AI agents not as products to be shipped but as colleagues to be mentored. The organizations that master this transition will build better agents and transform how work gets done in the agentic era.
In our next paper, we'll explore what this transformation means for the workforce and the enterprise: how the role of the manager evolves when the team includes both humans and agents, how to manage token consumption, and what it takes to lead a hybrid workforce at scale.