Case study
Teaching a CRM to qualify its own deals
Daniel Hull · Founder, 80x
Published · Updated
Scout is a qualification agent I built for a legal-tech company selling AI tools to in-house legal teams. It reads sales notes and fills in the CRM’s MEDIC fields on its own. Most of the work had nothing to do with the AI; it was making the thing safe to run unattended against a couple thousand live deals.
The problem
The company qualifies deals with MEDIC: Metrics, Economic Buyer, Decision Criteria, Identified Pain, Champion, plus two they tack on, Paper Process and Competition. The idea is that if you can fill those in for a deal, you have a decent sense of whether it’s real and what’s left before it closes.
In practice the fields sit empty. Not for lack of information. The information is right there in the deal’s notes. It’s that moving it from a note into the right structured field is tedious, and tedious loses to the next call every time. A rep hears “we’re losing about forty hours a month to manual contract review,” which is a clean Metric, types it into a call summary, and there it stays, never reaching the Metrics field where anyone running the forecast would see it.
The company asked whether an agent could just do that part: read the notes, pull the MEDIC signals out, write them into Attio. The raw materials were all there. The notes existed, the CRM had an API, and Claude is good at turning prose into structured data. I figured the first version was an afternoon’s work.
It wasn’t, though not because the extraction was hard. The extraction was the easy part. I’ll get it out of the way quickly so I can spend the rest of this on the part that wasn’t.
What I built
I called it Scout. Once a day it walks the company’s pipeline and, for each deal that’s still live, does roughly this:
- reads the recent notes (Granola call summaries and hand-typed notes both land as Attio notes here, so that’s the whole corpus),
- throws out the junk, since a note that just says “how about 2pm Tuesday?” has no MEDIC in it,
- hands what’s left to Claude and asks for the signals back as JSON, one citation per claim,
- works out whether each signal is worth writing, and
- writes only to its own fields, never to anything a person filled in.
Then it logs what it did. I’ll come back to the log, because it ended up earning its keep more than anything else in here.
The whole thing is a deterministic Python script on a 6am cron. It only calls Claude when a deal actually has new notes, so most deals on most days cost nothing. There’s no agent sitting in a loop “thinking.” I know an autonomous loop is the fashionable shape for this kind of thing, but for a job that runs unattended against a system of record, boring and predictable was what I wanted.
There’s also a read-only version that lives in Slack. You type @Scout and a company name and it answers with that deal’s MEDIC picture and whatever it can pull from the latest notes, citations attached. It never writes anything. It’s mostly there so the team can see what Scout would say without it having said it.
The hard part is trust, not the AI
Here’s what I underestimated. Wiring notes through a model and PATCHing the result back into the CRM is genuinely easy, and a demo of it looks great. The problem is that it should never run against a real pipeline in that form, and the reasons why are most of the actual engineering.
A CRM is a system of record, which is a roundabout way of saying people act on what’s in it. Reps make commitments based on those fields; managers build forecasts off them. So the failure that actually scares you isn’t the model being a bit wrong. It’s the model being confidently wrong: overwriting something a rep spent a call earning, say Economic Buyer: CFO, budget confirmed quietly replaced by a guess, with nobody noticing until a forecast is sitting on top of it. Do that once and people stop trusting the fields, which was the whole point of having them.
So a lot of Scout is restraint. Four things, mostly.
1. The agent gets its own fields
This is the load-bearing decision. Every MEDIC concept exists twice in the schema. There’s the field the rep owns, economic_buyer, and a parallel one Scout owns, scout_economic_buyer. Scout reads the first and only ever writes the second.
Each suggestion is really three fields: the value, a _source holding the link it came from, and a _status the rep flips as they accept or reject it. The rep sees the suggestion next to their own field and accepts it with a click. What I like about this setup is that it isn’t a rule the model has to remember. Scout doesn’t refrain from clobbering the rep’s field because the prompt asked it to. It can’t, because there’s no code path that writes there. I trust prompts about as far as I can throw them. Schema constraints I trust.
The logic for what to do with a given signal is dull on purpose:
- rep’s field has something in it: leave it alone, but log that we saw it
- rep’s field empty, no suggestion there yet: write the suggestion
- rep’s field empty, the same suggestion already there: do nothing
- rep’s field empty, a different suggestion there: replace it
Four outcomes, and a finding only turns into a real write if two further switches are both on. More on those below.
The only mildly subtle part is what “empty” means, which depends on the field type. A text field is empty when it’s blank. A linked record or a dropdown counts as filled the moment there’s anything in it, because nobody half-fills those by accident.
2. No citation, no write
The rule I cared most about is that Scout can’t state a MEDIC fact it can’t point at. Every finding has to come back with the source note’s ID, a link, and the verbatim sentence it came from, not a paraphrase of it.
Models don’t reliably do what you tell them, so there’s a plain check after the response that drops anything that doesn’t comply:
def clean_findings(findings, valid_ids):
valid_slugs = set(MEDIC_CONCEPTS)
cleaned = []
for f in findings or []:
if f.get("field") not in valid_slugs: # a real MEDIC concept?
continue
if f.get("source_id") not in valid_ids: # cites a note we actually gave it?
continue
if all(f.get(k) for k in ("value", "excerpt", "source_url", "source_type")):
cleaned.append(f) # and a full citation?
return cleanedThe line doing the real work is the middle one. Scout hands the model a fixed set of notes with their IDs. If a finding comes back citing an ID that wasn’t in that set, which is exactly what a fabricated citation looks like, it’s gone before it can turn into a write. The model is free to suggest whatever it likes; only the things that trace back to a note it was actually shown make it through.
The nice consequence is that every value Scout writes traces back to a specific sentence in a specific note. When someone asks why a deal’s champion is listed as the VP of Legal, there’s a quote to point at instead of a shrug.
3. Two locks, both off by default
Going live needs two separate switches on at once: an --apply flag on the run, and a LIVE_WRITES=1 in the environment. With either one missing, Scout does everything except the final write. Full extraction, full decisions, full logging, and then nothing leaves the building. Partly this is belt-and-suspenders, and partly it’s that I didn’t trust my own typing: a wrong command can’t push to two thousand records, because the default state is decide-everything-write-nothing, and only the deliberate combination of both switches changes that.
4. Log everything, especially the non-events
Underneath is a SQLite file, one row per finding Scout considers. Run id, timestamp, deal, field, the old value, the new value, what the rep had at the time, the decision, the model, the citation. The part that matters is that it logs the things it didn’t do: skipped because the rep already owned the field, or “would have written this, but we’re in dry-run.” Not just the writes.
I didn’t think of the log as the important part when I was building it. It turned out to be the thing that made the whole project debuggable, for reasons I’ll get to in a second.
The extraction itself
This part is deliberately unclever. No tools, no loop, one model call per deal that has fresh notes. The prompt does a few specific things that matter more than they look. It’s told that saying nothing is fine: at most one row per field, and only when the notes actually support it. Left alone these models want to fill in every field you show them, and here that instinct is exactly wrong. The readable summary and the verbatim quote go in separate fields, so the citation stays literal and checkable. A teammate speculating in a comment thread is flagged as not the same as the customer saying it, so internal guesses don’t get promoted to facts. And the system prompt is identical on every call, so it’s cached, which across a couple thousand deals adds up to real money.
What actually went wrong
This is the honest part, and it’s where most of the calendar time went. None of these were in the design. They were all the same kind of bug, too: not a crash, just the agent quietly doing nothing useful while looking like it worked.
The sources weren’t what the spec said. The plan assumed three of them, calls and emails and notes. The real Attio workspace turned out to have no call transcripts exposed through the API and no email object at all. Everything usable was in notes. So Scout is notes-first, and the call and email handling still exists but sits switched off, waiting for data that may never arrive. Had I built to the spec instead of opening the actual workspace first, I’d have shipped something that confidently found nothing.
The second one is my favorite, because it hid inside dry-run. Scout skips any note it has already handled for a given deal and field, which is how a daily job avoids re-suggesting the same thing forever. But the original “already handled?” check didn’t look at whether the previous handling was a real write or a dry-run. I’d done months of dry-run testing, so the log was full of rows that all said handled. The day we flipped it live, Scout dutifully looked at everything, recognized its own dry-run ghosts, and wrote almost nothing. The fix was one clause, count it as handled only if it actually wrote. But a live agent that runs green and produces nothing is about the easiest failure in the world to miss, and the only reason I caught it was that the log had recorded every one of those skips.
The third was a cap. There was a fifty-deals-per-run limit left over from testing. The pipeline has 2,086 active deals and the query returned them newest first, so Scout was lovingly processing the fifty most recently created deals, the brand-new ones with no notes on them yet, and never getting anywhere near the older, late-stage deals where the signal actually lives. Nothing was broken. The cap was just pointed at the wrong end of the list, and the result was a tool that ran every morning and found nothing worth writing.
The thread running through all three is that the failures don’t announce themselves. A crash you notice immediately. An agent that runs cleanly and does subtly the wrong amount of nothing, you can watch for a week and not see. The log is what made each of them visible, which is the closest thing to a lesson I’d pull out of the whole project: write down what the thing decided not to do, not just what it did.
What I’d keep
Strip away the sales specifics and this is one answer to a question that keeps coming up: how do you let a model maintain something people actually depend on. A few things I’d do again without thinking about it.
Give the agent its own namespace, so it physically can’t write where people write. “Be careful” is not a safety property; “there is no code path” is one. Make the citation check real code that runs after the model, not a sentence in the prompt, so an uncited claim isn’t discouraged but unsaveable. Default to doing nothing, behind more than one switch, because the cost of an accidental no-op is zero and the cost of an accidental write to a few thousand records is that nobody trusts the fields again. And log the decisions, including the non-decisions, since those are the rows that catch the silent failures.
The thing I keep coming back to is that almost none of this was about making the model smarter. The model was fine. The work was building enough structure around it that it stays useful even when it’s wrong, because it will be wrong sometimes, and a tool people stop trusting the moment it’s occasionally wrong is a tool they stop using.
Scout still runs daily against the company’s pipeline. A shorter, engagement-shaped summary of this project lives in the studio’s work section.