Wiser

Designing trust into AI-powered workflows

The AI matches thousands of products in seconds. The hard part is getting Ops to trust the answers - and to override the wrong ones without breaking a sweat.

Wiser AI Matching Tool
Business Problem
Wiser had multiple products built on separate legacy internal matching systems - each with its own logic and portal. That meant higher costs, more complexity, and a lot of confusion. We needed an AI-powered tool with a unified approach to reduce costs and enhance data quality. This is a 0 to 1 brand-new product, not a reskin of the legacy ones.
My Role
  • Owned end-to-end design as sole designer (research → UAT) and drove discovery on matching logic, edge cases, and shared terminology
  • Took on PM responsibilities - scope, stakeholder alignment, and UAT coordination
  • Partnered with Ops/Eng leadership to review AI results, make tradeoffs, and unblock decisions
Timeline
2025 H1
Team
1 PM · 1 Designer (me) · 8 Devs
Context

What is Matching?

Matching is how we link a customer's catalog product to the same product sold by other retailers. It's the foundation of Wiser's app - every price insight depends on these links being correct.

Retailer data is messy (inconsistent titles, attributes, identifiers), so matches can be inaccurate or incomplete. And match quality is critical: it's the #1 reason customers churn.

1
A product
2
The system
3
Match results
4
Why it matters
Step 1
A customer sells Product A
They want to know: who else sells the same item, and at what price? That's the question matching answers.
👟
Nike Air Max 2025
$89.99
BrandNike
SizeUS 10
ColorBlack/White
Customer's catalog
?
🛒
Other retailers?
Step 2
The matching system searches
The system scans millions of retailer listings and uses AI to find the same product - despite messy, inconsistent data.
How the pipeline works
👟 Product A
Customer catalog
Matching engine
Scans all retailers
AI
✓ Matches
Linked results
Step 3
Results can be right, wrong, or missing
Retailer data is messy. The system does its best - but results need to be reviewed. This is why Ops exists.
Nike Air Max 2025 - amazon.com
Brand · Title · Price all align
Accurate
Nike Air Max 2025 - walmart.com
UPC token match confirmed
Accurate
Nike Air Max 2024 - footlocker.com
Wrong year - similar title, different product
Inaccurate
⚠️
Nike Air Max 2025 - target.com
Exists but wasn't found - incomplete data
Missing
Why it matters
Bad data = customer churn
Every price insight is built on match data. When matches are wrong, insights are wrong - and customers leave.
📉
The #1 reason customers churn
Bad match data breaks trust fast. A customer sees wrong competitor prices, makes decisions on them, loses revenue, and blames Wiser.

This is why accurate, complete matching isn't a nice-to-have - it's the foundation everything else depends on.
1 / 4
User Problem

Our internal team struggled with ineffective workflows (and 90s UI)

Ops had to jump between different portals (and different legacy logic) to manage matching across products. The workflow was fragmented, inconsistent, and slowed them down.

Their job: ensure matches are accurate and complete for customers. That means setting matching rules, reviewing results, spotting issues, and fixing them.

Legacy matching system UI

The legacy system Ops were working with

Discovery

What Ops actually needed

Before designing anything, I ran research sessions with the Ops team - shadowing their workflow and asking them to walk through real tasks. Three questions kept surfacing. Not about features - about the work itself.

"I don't know if I've seen everything."
No coverage confidence
Ops had no way to know if they'd reviewed all risky matches for a product, or if something had slipped through another portal undetected.
"I fix it here, but it breaks somewhere else."
Fragmented state across systems
Changes in one legacy portal didn't reliably propagate. A fix in one place could quietly create a gap downstream - with no way to know until a customer noticed.
"I'm not sure the rule is doing what I think."
Rules felt like a black box
There was no way to verify a rule was working as intended without checking output case by case - making every rule change feel uncertain and slow.

The underlying problem wasn't the legacy UI. It was lack of control and visibility - Ops couldn't trust that their actions were having the effect they intended, so every decision felt fragile.

Define Direction

Where to use AI (and where not to)?

The goal wasn't to "AI everything." I mapped the end-to-end Ops workflow and helped decide what AI should do and where humans must stay in the loop.

The MVP principle
AI generates matches at scale - humans review, identify issues, and correct them.
AI handles this
Human stays in the loop
Step 1
Step 2
Step 3
Step 4
Define rules
Generate matches
Vector + LLM pipeline
Review results
Spot issues, verify
Address issues
Fix and correct
Human
Sets IF/THEN rules that drive the AI. High judgment, low frequency.
AI
Runs at scale - token, vector, and LLM. Too large for humans to do alone.
Human
Reviews AI output at scale with filters. Catches errors before they hit customers.
Human
Adds or removes matches manually. Stakes too high to automate corrections.

Why not automate more? Bad matches flow directly to customers and erode trust. The cost of an AI error is too high to remove human oversight from review and correction - at least at this stage.

Design Challenges

What the design had to solve

01
How do you give non-technical users real control over AI?
Ops need to influence what the AI does, but they're not ML engineers. A simple toggle is too coarse - full parameter control is too technical. The interface had to feel powerful without requiring ML fluency.
02
How do you surface issues across thousands of matches without manual scanning?
Ops can't review every result individually. The tool needed to help them find what matters - surfacing patterns and high-risk results without requiring them to already know what to look for.
03
How do you show AI results users can act on, not just interpret?
AI gives a confidence score. But a number alone is just another black box - Ops still can't tell if they should trust it or where it breaks down. Confidence needed to be legible, not just visible.
Solution

An AI-powered matching tool that's smooth and trustworthy

Feature 01

Manage AI-powered rules and control what goes to LLM

Users set rules to send uncertain cases to LLM validation - separating confirmed matches from cases needing review. The system generates a similarity score based on vector embedding attributes, with configurable thresholds.

Manage matching rules
Feature 02

Slice and view results in multiple ways

Review matching results at scale - switch between views, scan key signals, and customize columns to support different review workflows and use cases.

View results
Feature 03

Identify issues faster with filters and signals

Use robust filters and key signals - match count, price delta, suspicious patterns - to surface potential issues quickly without manual scanning.

Filters and signals
Feature 04

Take action in context

Find listings to add as matches, remove incorrect matches in a click, and jump to the exact view you need - without losing your place.

In-context actions
Outcome

Happier Ops team, better data, stronger retention

−56%
Task Time
"This is so intuitive and much more efficient - it helps us get our jobs done much faster."
~$700K
Saved / Year
New tool gave Ops everything they need - we could sunset all the old legacy systems entirely.
↑ Trust
Customer Retention
Improved data quality built customer trust and helped retain key accounts like Best Buy.
Key Process

How we got there

Process 01

How decisions laddered up - and where they got hard

Three challenges drove four features. Not every mapping was 1-to-1 - and a few decisions needed multiple iterations to get right. Here's the ladder, then the two trade-offs that took the most thinking.

Decision map How features ladder back to challenges
Challenge 01 Real AI control without ML fluency
Feature 01 Manage AI rules Direct answer - gives Ops a rule-based dial instead of ML knobs
Trade-off 01 ↓ Sliders vs. rules How I designed the dial without requiring ML literacy - see the deep-dive below
Challenge 02 Surface issues across 34K+ listings without manual scanning
Feature 02 Slice and view Multiple lenses (catalog, listings, matches) for different review modes
Feature 03 Filters and signals Cut straight to high-risk results - price delta, match count, anomalies
Challenge 03 Make AI results actionable, not just visible
Feature 04 Take action in context Add, remove, or jump to a match in one click - no losing place
Trade-off 02 ↓ Calibrated confidence Show why a score is what it is - see the deep-dive below

Two of the three challenges have a deep-dive trade-off below - Challenge 01 (the AI control dial) and Challenge 03 (how to display confidence). Each one had multiple options on the table; the trade-offs walk through the rejected one alongside the chosen one, and why.

Trade-off 01 Giving Ops AI control - sliders vs. rules

Ops needs to influence what the AI does. The question was how to expose that control without requiring ML literacy.

Option A · Threshold sliders
Surface the underlying ML thresholds: vector similarity, token weight, LLM trigger. Precise, scientific, gives ML-fluent users full control. But Ops aren't ML engineers - 0.85 means nothing without ML fluency.
Vector similarity threshold
0.85
Token weight
0.40
LLM trigger threshold
0.60
Confidence floor
0.75
Option B · IF-THEN rule builder (chosen)
Same dial, different knob. An IF-THEN rule builder grounded in Ops' language. Same underlying math - vector, token, LLM thresholds - but expressed as business logic Ops already speak.
IF Price diff > 30% REJECT
IF Title contains "Refurbished" REJECT
IF Brand mismatch → LLM
IF Score > 0.9 CONFIRM
+ Add rule

Threshold sliders give precise control - to ML-fluent users. Most Ops are product experts, not ML engineers. The IF-THEN builder respects that: same dial, with a knob users already know how to turn. AI does the math; Ops make the call.

Trade-off 02 Displaying AI confidence - precision vs. clarity

All matches went through LLM validation carry an AI confidence score. The question was how to show it. A raw percentage looks precise - but precise and useful are not the same thing.

Option A · Raw score
Show the percentage. Looks scientific. But 87% confident compared to what? Without context, Ops can't tell if 61% is a near-miss or clearly wrong - every borderline still needs manual judgment.
👟
Nike Air Max 2025
amazon.com · $92.99
87%
👟
Nike Air Max 2025 Sneaker
walmart.com · $89.00
61%
👟
Nike Air Max 2025
target.com · $94.99
43%
Option B · Calibrated tier (chosen)
Replace the percentage with one of two clear tiers grounded in historical Ops confirmation data. No middle zone - the system either commits to a confidence, or hands judgment back to Ops.
👟
Nike Air Max 2025
amazon.com · $92.99
High confidence
Strong match · 87%
👟
Nike Air Max 2025 Sneaker
walmart.com · $89.00
Low confidence
Needs review · 61%
👟
Nike Air Max 2025
target.com · $94.99
Low confidence
Needs review · 43%

A score is a tool for machines. A calibrated tier is a tool for humans. "Needs review" tells Ops what to do. "61%" tells them nothing unless they already know what 61% means for this product type - and Ops shouldn't have to learn that. Two tiers replace false precision with honest, actionable clarity - and the percentage is still there for users who want it.

Trade-off 03 Find Match layout - context vs. simplicity

Ops need to search and filter listings fast to find potential missing matches and keep customer data complete. The question was how much to show at once.

Option A · Clean view
Searchable listings side-by-side - matched listings ordered on top, with a single search field for the unmatched ones below.
Find Match Option A - clean view
Option B · Persistent context (chosen)
Matched listings stay in their own table - users can keep scanning the matches while they search below, no context loss.
Find Match Option B - persistent context

Went with Option B - it adds cognitive load, but it's productive cognitive load. Ops use matched listings as reference points to spot similarities and find likely matches faster.

Process 02

Vibe coding to speed things up

We needed to shut down legacy systems ASAP - so the validation timeline was tight (5–6 weeks). I used UX Pilot, V0, Figma Make, and Claude Code to explore and iterate faster. As a 0→1 product, I validated high-level flow and IA with users first, before polishing details.

Timeline
5–6 weeks
Tools used
UX Pilot
V0
Figma Make
Claude CodeSINCE 2026
Claude Code - new addition to my workflow in 2026. Lets me develop code directly, no handoff to devs - I raise the PR myself.
Approach
01
Validate IA & flow
With real users first
02
Rapid iteration
AI tools to move fast
03
Polish details
After direction locked
Process 03

Facilitate tech discovery and keep everyone aligned

Matching has sophisticated logic - hard to keep everyone aligned without shared language. I drove discovery sessions, created workflow diagrams, and established ubiquitous language so the team could decide without confusion.

Matching strategy and version-type diagram
Phase 1 Takeaway

Keep aligned, keep moving

Building a new AI-powered matching system meant new logic, a complex workflow, and high stakes for data quality. To move fast, it's critical to get the team aligned through ubiquitous language, workflow diagrams, and frequent checkpoints - so everyone makes decisions from the same version of the truth.

Phase 2 · 2025 H2

"Design trust into AI-powered workflows"

We launched the platform. Users launched questions: "wait…why?" - How do we make AI results clear, testable, and trustworthy?

Context
Platform launched, but Ops didn't fully trust the AI results - adoption stayed low and UAT surfaced a lot of questions. The core problem isn't training. It's trust.
My Role
Sole designer, end-to-end. Led discovery → flows → UI → iteration. Identified trust blockers and designed patterns to make AI results explainable, testable, and safe to iterate on.
Timeline
2025 H2
Team
1 PM · 1 Designer (me) · 7 Devs
Context

What is AI doing?

AI is used in two steps: Vector matching and LLM validation. We embed selected attributes to generate a similarity score; if a case is still uncertain, users can set pre-validation rules to route it to the LLM (a prompt) for a final yes/no.

Listings to match
Token based matching
UPC ✓
MPN ✓
EAN ✓
Token found · exact identifier match
AI
Vector based matching
0.91
Similarity score
Embedding attributes
Brand
Title
Price
Model
Score > 0.6 → confirmed candidate
Pre-validation - rules set by users
IF Price Difference > 30%
→ LLM
IF Title contains "Refurbished"
Reject
IF Score > 0.9
Confirm
Confirmed match
Rejected match
Uncertain - send to LLM
AI
LLM validation (prompt)
System prompt
Decision (match or not)
One-sentence reasoning explaining why.
78%
Confidence
Confirmed match
Rejected match
Problem

Internal teams hesitate to adopt the AI-powered workflow

The new platform launched, but adoption is slowing because Ops doesn't feel confident using it. During UAT, the same questions kept surfacing: how the system works, why it produced a result, and whether the data is reliable - making the workflow feel risky to rely on for real decisions.

The core problem isn't "training." It's trust.

Discovery

What's blocking adoption?

Through task-based UAT and debriefs with Ops, I treated feedback as a trust audit: I logged recurring "how/why" questions, observed hesitation points, and clustered patterns into themes. 3 trust blockers emerged:

01
High Cognitive Load
"How does it work?"
The system introduces multiple concepts - pipelines, rules, AI outcomes, data states. Users struggle to form a correct mental model quickly, so every decision feels slower and more error-prone.
02
Low Traceability
"Why did this happen?"
Users can't easily see what inputs drove an outcome, what logic was applied, or where data came from - so results feel like a black box and trust breaks down.
03
Unclear Impact
"What will this trigger?"
Rule changes can send large volumes to the LLM, but users can't predict impact (volume + cost) upfront - so they avoid experimenting or updating rules at all.
Solution

4 patterns to make AI trustworthy

Pattern 01 · Traceability

Explain reasoning with Match Lineage

For every match result, surface the full reasoning chain - token/vector signals, the rules applied, and (when used) the LLM's one-sentence rationale with a confidence score. Everything is auditable.

Match Lineage panel
Pattern 02 · Transparency

Show and filter by data source

Users needed to understand where each match came from - when it was added and which pipeline produced it (token, vector, or LLM validation). Source-aware labels and filters let them isolate results, compare pipeline performance, and troubleshoot with confidence.

Filter by data source
Pattern 03 · Safety

Give users a sandbox (in a system that isn't one)

Match edits are high-stakes: each accept/reject/override updates the catalog customers see, triggers LLM revalidation cost, and writes a permanent audit row. A draft mode flow lets Ops review and edit dozens of matches without every save going live - publish the batch once it's ready, with aggregated impact you can actually evaluate.

Draft mode sandbox
Pattern 04 · Predictability

Make cost impact visible before running

Users hesitate to reprocess or update rules because they can't predict the cost impact - "Am I about to spend $5K?" anxiety. I surfaced estimated volume and LLM cost before users commit to a run, turning anxiety into informed confidence.

Cost impact visibility
A closer look · Pattern 03

How I got to "draft mode"

Two ways to keep match edits safe. The interesting question wasn't whether to add friction - it was where to put it.

Option A · Save + confirm per match
Each match edit triggers a confirmation modal. Every accept, reject, or override interrupts. Friction at the wrong moment - you've already decided; the modal just stalls you.
iPhone 14 Pro 256GB → BestBuy #1024 Edit ✏
Sony WH-1000XM5 → Amazon B0B... Saving…
Publish this match decision?
1 match will go live in the catalog.
Customers will see this immediately.
Cancel Publish
Option B · Draft mode + batch publish (chosen)
Match edits accumulate as drafts. Publish them when you're ready. Friction at the right moment - one decision covers many matches, with the impact aggregated.
3 drafts · 3 matches updated · ready to publish Publish all
iPhone 14 Pro 256GB → BestBuy #1024 DRAFT
Sony WH-1000XM5 → Amazon B0B... DRAFT
Dyson V15 Detect → Amazon B09... Live

Save + confirm per match felt fast - and that was the problem. Every match decision is high-stakes (catalog correctness, customer-facing data, audit trail), but a modal after every save trains users to click through without reading. Draft mode flips it: edits are cheap, only publishing is high-stakes - and it happens once for the whole batch, with aggregated impact you can actually understand. Friction belongs on the irreversible step, not the review step.

Outcome

Less friction, fewer incidents, faster reviews

↓ Q's
Faster Reviews
Match Lineage + clear reasoning means fewer "how/why" questions - Ops can self-serve answers and reviews move faster.
↓ Risk
Fewer Incidents
Draft-first iteration and cost guardrails reduce costly mistakes - bad pushes, unintended customer impact, and expensive reruns.
↑ Confidence
Higher Adoption
When Ops understands and trusts the AI output, they actually use the tool - and the data quality flywheel starts spinning.
Next Step

Trust is a continuous job

Trust didn't end at launch - it needs ongoing visibility and control. Next, I'd focus on:

01
Explain "why vector"
Make vector similarity scores more interpretable - what drove the score, what changed between runs.
02
Prompt tuning with guardrails
Customer-level prompt configs with versioning, approvals, and built-in guidance when results change.
03
Quality metrics per run
Run-level accuracy and completeness scores so teams can judge whether changes actually helped.
Takeaway

Trust is the product

AI isn't powerful just because it's "AI" - it shouldn't be a buzzword taped onto a workflow. What makes it actually useful is when users understand it, trust it, and feel safe acting on it. Designing for trust means building in explainability, predictability, and control from the start - not as an afterthought after adoption stalls.

Next Project
Product-Led Growth onboarding