03/24/2026 | Press release | Distributed by Public on 03/24/2026 09:52
There's a lot of noise right now about AI agents. Every vendor has one. Every keynote promises them. But very few organizations have actually shipped agents that work in production, at scale, against real business outcomes. At Salesforce, we have. And what we've learned in the process is that most of the industry is thinking about this wrong.
This paper isn't a product pitch. It's a set of hard-won lessons from building and deploying AI agents inside one of the largest enterprise software companies in the world. As Customer Zero, we use our products first, test them in real-world scenarios, tune them, and then build those lessons directly back into the platform. We've made mistakes internally so that our customers don't have to. Every insight in this guide is the result of a disciplined strategic plan refined through experimentation and necessary trial and error.
These lessons apply whether you're running a five-person team or a thousand-person organization. The principles are the same. The mistakes are the same. And the opportunity, if you get it right, is enormous.
1. AI Agents Aren't Software
This is the first and most important thing to internalize. Software is deterministic. You write a function, you pass it an input, you get the same output every time. That's what makes it reliable. It's also what makes it limited.
Agents are fundamentally different. They have their own reasoning capabilities. They interpret context. They generate responses that vary, and that variation isn't a bug-it's the feature. It's the very thing that makes them useful in complex, ambiguous, real-world business situations where rigid logic breaks down.
The problem is that most enterprise leaders are still treating agents like software deployments. They expect deterministic behavior. They expect to configure it once and walk away. And when the agent does something unexpected, they call it a failure. That's like hiring a new analyst and being surprised when they ask a clarifying question.
The mental model shift is this: Managing agents is closer to managing employees. The better you guide and coach them, the better they perform. Clear, well-written briefs produce better work. Active monitoring catches mistakes early. Calibration through examples - showing what good looks like and what bad looks like - drives improvement over time. And when a single agent is overloaded with too many tasks, the answer isn't to push harder. It's to hire another one.
The best agent builders we've seen operate like great managers. They design work assignments across a team of specialized agents, coordinating them the way a competent leader coordinates an organization. That's the unlock.
2. Think in Terms of 'Jobs to Be Done,' Not 'Jobs'
There's a persistent myth that you can build one universal agent to replace an entire human role. You can't. Not today. And probably not for a long time. The roles humans play in an enterprise are composed of dozens of discrete tasks, each with its own context, judgment requirements, and edge cases. Trying to automate an entire job with a single agent is a recipe for mediocrity across the board.
The breakthrough comes when you break the job down into its component tasks - the specific "jobs to be done" - and build agents that master each one.
Take our Sales Development Representative (SDR) team as an example. SDRs are our front line for connecting with prospective customers. In their working day, they do many things: They schedule meetings with prospects, share product information, run initial qualifying conversations, hand off the most promising leads to account teams, log their calls, and create sales opportunities inside Salesforce.
When we built our Engagement Agent, we focused on specific, measurable tasks for a few scenarios that a human SDR did in their job. The goal was to provide our SDR team with more capacity to help them meet their quota by engaging with more customers. We did this by having the Engagement Agent keep following up when normally the SDR would need to stop to move on to other customers. We had the agent reach out to prospects that our human SDR teams didn't have capacity to speak with to engage with the customer until they were ready to book a meeting. We had the agent help customers with their initial product questions so that they would be ready to connect with our sales teams.
The results spoke for themselves. Our initial Engagement Agent pilot generatedmore than $120 million annualized pipelinein just the first few months of operation. Not by trying to be an SDR but by doing specific SDR tasks exceptionally well. Our SDR team couldn't have hit their targets without the Engagement Agent.
We didn't get it right overnight. In our first version, the agent could beat only the bottom 10% of our human SDRs. To improve, we looked at what our top sellers do: how they write the best emails, create the right headline, and deliver a call to action. We measured how those actions led to more customer appointments and more revenue. Through consistent measurement and experimentation, we made the Engagement Agent better and better. Once we knew what worked, we built those insights into the product. Today, that agent can beat 90% of our sellers in those tasks. We built the Engagement Agent for ourselves, made it great through our own experience, and then delivered that refined version for our customers.
3. Measuring Agent Competency
If you accept that agents are more like employees than software, you have to measure them like employees too. That means evaluating competency on a specific task, not running generic benchmarks that have nothing to do with your business context.
Think about a call center. Entry-level call center employees work from fixed scripts. They're effective within narrow guardrails, and they're measured on adherence and resolution within those bounds. As they advance in their expertise, they earn more flexibility. They can deviate from the script. They can handle more-complex customer situations. Their competency is measured by their ability to resolve increasingly difficult cases while maintaining quality standards.
AI agents should be evaluated with this same comparative rigor. An agent that's new to a task should be tightly constrained, with every output reviewed. As it demonstrates competency - resolving simple tickets accurately and consistently - you give it more latitude. Over time, the best agents graduate from handling simple inquiries to resolving complex, multistep customer cases that would have previously required senior human intervention.
Applying this level of rigor to our AI agents served as a mirror for our own organization, revealing that we must measure agent success against a human baseline. In our own journey of deploying agents, we found we often had more granular data on the agents than on all of the tasks humans do in their jobs. By rigorously quantifying what excellence looks like for the work our teams do today, we can measure whether an agent is actually delivering on its mission.
This isn't theoretical. It's how we're building competency models inside Salesforce right now. The metric that matters isn't whether the agent is "smart." It's whether the agent can do this specific thing, reliably, at a level that meets or exceeds human performance.
4. The Abundant Enterprise
Here's where the economics gets interesting and where most executives are still underestimating what's possible.
Every business leader we work with has a list of things they would do if they had unlimited headcount. Research every customer in the pipeline. Audit every data asset. Provide every employee with a personalized technical assistant. Run proactive outreach to every lead, not just the ones above the scoring threshold. These aren't fantasies - they're strategies that have clear ROI. They just couldn't be justified under the economics of human labor.
Agents flip that equation. When the cost of intelligence on demand drops by an order of magnitude, the math changes on hundreds of initiatives that were previously shelved. Functions that weren't worth the lift suddenly become trivial to deploy.
When we introduced Account POV in Sales Agent, our sales teams unlocked new consistency, quality, and scale. Manual account POVs were labor-intensive, taking four to five hours to complete. Not every account was covered, and the quality varied. Today, we have 100% coverage with expert-level POVs automatically generated for every account.
We call this the abundant enterprise. It's an organization that operates at a level of coverage and responsiveness that was, until now, mathematically impossible. Not because the strategy wasn't sound but because the cost structure didn't allow it. Agents remove that constraint.
Even at Salesforce, with the resources we have, there were prospective customers we couldn't engage and data quality issues we couldn't address at scale. Agents have changed that. And we're still early.
5. Trust, Observability, and the Agent Development Lifecycle
None of this works without trust. And trust, with a probabilistic system, has to be earned.
Because agents aren't deterministic, they're inherently subject to model drift. Subtle changes in the underlying model, the data it receives, or the context it operates in can cause output quality to degrade. Hallucinations emerge. Accuracy drops. Performance drifts in ways that are invisible unless you're actively looking for them.
This is why observability isn't optional - it's foundational. You need comprehensive data measurement across every agent's outputs. You need real-time monitoring that detects when performance is degrading before it impacts the business. And you need the ability to tune the dials - adjusting constraints, retraining on new examples, and narrowing scope in response to what the data tells you.
We've formalized this into what we call the Agent Development Lifecycle (ADLC). The core principle is simple: Autonomy is granted incrementally. An agent starts with tight human oversight. As it demonstrates competency in a specific domain, measured against human baselines for speed and accuracy, it earns more independence. This isn't a one-time evaluation. It's a continuous process of measurement, calibration, and graduated trust.
The ADLC is the bridge between the probabilistic nature of foundation models and the deterministic expectations of enterprise operations. It's how you move from experimentation to production without accepting unacceptable risk.
6. The Era of Predictive Competency
Looking ahead, what's most exciting isn't what agents can do today. It's where the trajectory leads.
We're moving toward a fundamental shift in the human-machine relationship inside the enterprise. As agents become more competent, the human role transitions from doing the work to mentoring the agents that do it. The best managers in the future won't be judged by how many people they lead. They'll be judged by how effectively they orchestrate teams of agents alongside human talent.
The destination is what we call predictive competency: agents that don't just respond to instructions but anticipate business needs. Unlike today's agents, which solve problems in an input-output model, the agents of the future will recognize patterns across the enterprise and proactively surface insights, initiate workflows, and resolve issues before they become problems. This isn't science fiction. The building blocks exist today. The gap is in how we deploy, measure, and trust these systems.
At Salesforce, we're pioneering this ecosystem. Our current solutions are designed to bridge the gap between probabilistic models and deterministic business needs. Every tool we build, every agent we deploy, is a step toward a future in which the enterprise operates with a level of sophistication that matches - and in many domains exceeds - human judgment.
In the coming installments of this series, we will dive deeper into the mechanics of this shift:
We're building from the ground up as practitioners. As Customer Zero, we're documenting this journey in real time to guide enterprise transformation toward a more autonomous, competent, and abundant future.