Wiser

Designing trust into AI-powered workflows

The AI matches thousands of products in seconds. The hard part is getting Ops to trust the answers - and to override the wrong ones without breaking a sweat.

Business Problem

Wiser had multiple products built on separate legacy internal matching systems - each with its own logic and portal. That meant higher costs, more complexity, and a lot of confusion. We needed an AI-powered tool with a unified approach to reduce costs and enhance data quality. This is a 0 to 1 brand-new product, not a reskin of the legacy ones.

My Role

Owned end-to-end design as sole designer (research → UAT) and drove discovery on matching logic, edge cases, and shared terminology
Took on PM responsibilities - scope, stakeholder alignment, and UAT coordination
Partnered with Ops/Eng leadership to review AI results, make tradeoffs, and unblock decisions

Timeline

2025 H1

Team

1 PM · 1 Designer (me) · 8 Devs

Context

What is Matching?

Matching is how we link a customer's catalog product to the same product sold by other retailers. It's the foundation of Wiser's app - every price insight depends on these links being correct.

Retailer data is messy (inconsistent titles, attributes, identifiers), so matches can be inaccurate or incomplete. And match quality is critical: it's the #1 reason customers churn.

A product

The system

Match results

Why it matters

Step 1

A customer sells Product A

They want to know: who else sells the same item, and at what price? That's the question matching answers.

👟

Nike Air Max 2025

$89.99

BrandNike

SizeUS 10

ColorBlack/White

Customer's catalog

🛒

Other retailers?

Step 2

The matching system searches

The system scans millions of retailer listings and uses AI to find the same product - despite messy, inconsistent data.

How the pipeline works

👟 Product A

Customer catalog

→

Matching engine

Scans all retailers

→

✓ Matches

Linked results

Step 3

Results can be right, wrong, or missing

Retailer data is messy. The system does its best - but results need to be reviewed. This is why Ops exists.

✅

Nike Air Max 2025 - amazon.com

Brand · Title · Price all align

Accurate

✅

Nike Air Max 2025 - walmart.com

UPC token match confirmed

Accurate

❌

Nike Air Max 2024 - footlocker.com

Wrong year - similar title, different product

Inaccurate

⚠️

Nike Air Max 2025 - target.com

Exists but wasn't found - incomplete data

Missing

Why it matters

Bad data = customer churn

Every price insight is built on match data. When matches are wrong, insights are wrong - and customers leave.

📉

The #1 reason customers churn

Bad match data breaks trust fast. A customer sees wrong competitor prices, makes decisions on them, loses revenue, and blames Wiser.

This is why accurate, complete matching isn't a nice-to-have - it's the foundation everything else depends on.

1 / 4

User Problem

Our internal team struggled with ineffective workflows (and 90s UI)

Ops had to jump between different portals (and different legacy logic) to manage matching across products. The workflow was fragmented, inconsistent, and slowed them down.

Their job: ensure matches are accurate and complete for customers. That means setting matching rules, reviewing results, spotting issues, and fixing them.

The legacy system Ops were working with

Discovery

What Ops actually needed

Before designing anything, I ran research sessions with the Ops team - shadowing their workflow and asking them to walk through real tasks. Three questions kept surfacing. Not about features - about the work itself.

"I don't know if I've seen everything."

No coverage confidence

Ops had no way to know if they'd reviewed all risky matches for a product, or if something had slipped through another portal undetected.

"I fix it here, but it breaks somewhere else."

Fragmented state across systems

Changes in one legacy portal didn't reliably propagate. A fix in one place could quietly create a gap downstream - with no way to know until a customer noticed.

"I'm not sure the rule is doing what I think."

Rules felt like a black box

There was no way to verify a rule was working as intended without checking output case by case - making every rule change feel uncertain and slow.

The underlying problem wasn't the legacy UI. It was lack of control and visibility - Ops couldn't trust that their actions were having the effect they intended, so every decision felt fragile.

Define Direction

Where to use AI (and where not to)?

The goal wasn't to "AI everything." I mapped the end-to-end Ops workflow and helped decide what AI should do and where humans must stay in the loop.

The MVP principle

AI generates matches at scale - humans review, identify issues, and correct them.

AI handles this

Human stays in the loop

Step 1

Step 2

Step 3

Step 4

Define rules

Generate matches

Vector + LLM pipeline

Review results

Spot issues, verify

Address issues

Fix and correct

Human

Sets IF/THEN rules that drive the AI. High judgment, low frequency.

Runs at scale - token, vector, and LLM. Too large for humans to do alone.

Human

Reviews AI output at scale with filters. Catches errors before they hit customers.

Human

Adds or removes matches manually. Stakes too high to automate corrections.

Why not automate more? Bad matches flow directly to customers and erode trust. The cost of an AI error is too high to remove human oversight from review and correction - at least at this stage.

Design Challenges

What the design had to solve

How do you give non-technical users real control over AI?

Ops need to influence what the AI does, but they're not ML engineers. A simple toggle is too coarse - full parameter control is too technical. The interface had to feel powerful without requiring ML fluency.

How do you surface issues across thousands of matches without manual scanning?

Ops can't review every result individually. The tool needed to help them find what matters - surfacing patterns and high-risk results without requiring them to already know what to look for.

How do you show AI results users can act on, not just interpret?

AI gives a confidence score. But a number alone is just another black box - Ops still can't tell if they should trust it or where it breaks down. Confidence needed to be legible, not just visible.

Solution

An AI-powered matching tool that's smooth and trustworthy

Feature 01

Manage AI-powered rules and control what goes to LLM

Users set rules to send uncertain cases to LLM validation - separating confirmed matches from cases needing review. The system generates a similarity score based on vector embedding attributes, with configurable thresholds.

Feature 02

Slice and view results in multiple ways

Review matching results at scale - switch between views, scan key signals, and customize columns to support different review workflows and use cases.

Feature 03

Identify issues faster with filters and signals

Use robust filters and key signals - match count, price delta, suspicious patterns - to surface potential issues quickly without manual scanning.

Feature 04

Take action in context

Find listings to add as matches, remove incorrect matches in a click, and jump to the exact view you need - without losing your place.

Outcome

Happier Ops team, better data, stronger retention

−56%

Task Time

"This is so intuitive and much more efficient - it helps us get our jobs done much faster."

~$700K

Saved / Year

New tool gave Ops everything they need - we could sunset all the old legacy systems entirely.

↑ Trust

Customer Retention

Improved data quality built customer trust and helped retain key accounts like Best Buy.

Key Process

How we got there

Process 01

How decisions laddered up - and where they got hard

Three challenges drove four features. Not every mapping was 1-to-1 - and a few decisions needed multiple iterations to get right. Here's the ladder, then the two trade-offs that took the most thinking.

Decision map How features ladder back to challenges

Challenge 01 Real AI control without ML fluency

Feature 01 Manage AI rules Direct answer - gives Ops a rule-based dial instead of ML knobs

Trade-off 01 ↓ Sliders vs. rules How I designed the dial without requiring ML literacy - see the deep-dive below

Challenge 02 Surface issues across 34K+ listings without manual scanning

Feature 02 Slice and view Multiple lenses (catalog, listings, matches) for different review modes

Feature 03 Filters and signals Cut straight to high-risk results - price delta, match count, anomalies

Challenge 03 Make AI results actionable, not just visible

Feature 04 Take action in context Add, remove, or jump to a match in one click - no losing place

Trade-off 02 ↓ Calibrated confidence Show why a score is what it is - see the deep-dive below

Two of the three challenges have a deep-dive trade-off below - Challenge 01 (the AI control dial) and Challenge 03 (how to display confidence). Each one had multiple options on the table; the trade-offs walk through the rejected one alongside the chosen one, and why.

Trade-off 01 Giving Ops AI control - sliders vs. rules

Ops needs to influence what the AI does. The question was how to expose that control without requiring ML literacy.

Option A · Threshold sliders

Surface the underlying ML thresholds: vector similarity, token weight, LLM trigger. Precise, scientific, gives ML-fluent users full control. But Ops aren't ML engineers - 0.85 means nothing without ML fluency.

Vector similarity threshold

0.85

Token weight

0.40

LLM trigger threshold

0.60

Confidence floor

0.75

Option B · IF-THEN rule builder (chosen)

Same dial, different knob. An IF-THEN rule builder grounded in Ops' language. Same underlying math - vector, token, LLM thresholds - but expressed as business logic Ops already speak.

IF Price diff > 30% REJECT

IF Title contains "Refurbished" REJECT

IF Brand mismatch → LLM

IF Score > 0.9 CONFIRM

+ Add rule

Threshold sliders give precise control - to ML-fluent users. Most Ops are product experts, not ML engineers. The IF-THEN builder respects that: same dial, with a knob users already know how to turn. AI does the math; Ops make the call.

Trade-off 02 Displaying AI confidence - precision vs. clarity

All matches went through LLM validation carry an AI confidence score. The question was how to show it. A raw percentage looks precise - but precise and useful are not the same thing.

Option A · Raw score

Show the percentage. Looks scientific. But 87% confident compared to what? Without context, Ops can't tell if 61% is a near-miss or clearly wrong - every borderline still needs manual judgment.

👟

Nike Air Max 2025

amazon.com · $92.99

87%

👟

Nike Air Max 2025 Sneaker

walmart.com · $89.00

61%

👟

Nike Air Max 2025

target.com · $94.99

43%

Option B · Calibrated tier (chosen)

Replace the percentage with one of two clear tiers grounded in historical Ops confirmation data. No middle zone - the system either commits to a confidence, or hands judgment back to Ops.

👟

Nike Air Max 2025

amazon.com · $92.99

High confidence

Strong match · 87%

👟

Nike Air Max 2025 Sneaker

walmart.com · $89.00

Low confidence

Needs review · 61%

👟

Nike Air Max 2025

target.com · $94.99

Low confidence

Needs review · 43%

A score is a tool for machines. A calibrated tier is a tool for humans. "Needs review" tells Ops what to do. "61%" tells them nothing unless they already know what 61% means for this product type - and Ops shouldn't have to learn that. Two tiers replace false precision with honest, actionable clarity - and the percentage is still there for users who want it.

Trade-off 03 Find Match layout - context vs. simplicity

Ops need to search and filter listings fast to find potential missing matches and keep customer data complete. The question was how much to show at once.

Find Match Option A - clean view — **Searchable listings side-by-side** - matched listings ordered on top, with a single search field for the unmatched ones below.

Find Match Option B - persistent context — **Matched listings stay in their own table** - users can keep scanning the matches while they search below, no context loss.

Went with Option B - it adds cognitive load, but it's productive cognitive load. Ops use matched listings as reference points to spot similarities and find likely matches faster.

Process 02

Vibe coding to speed things up

We needed to shut down legacy systems ASAP - so the validation timeline was tight (5–6 weeks). I used UX Pilot, V0, Figma Make, and Claude Code to explore and iterate faster. As a 0→1 product, I validated high-level flow and IA with users first, before polishing details.

Timeline

5–6 weeks

Tools used

UX Pilot

Figma Make

Claude CodeSINCE 2026

Claude Code - new addition to my workflow in 2026. Lets me develop code directly, no handoff to devs - I raise the PR myself.

Approach

Validate IA & flow

With real users first

→

Rapid iteration

AI tools to move fast

→

Polish details

After direction locked

Process 03

Facilitate tech discovery and keep everyone aligned

Matching has sophisticated logic - hard to keep everyone aligned without shared language. I drove discovery sessions, created workflow diagrams, and established ubiquitous language so the team could decide without confusion.

Matching strategy and version-type diagram

Phase 1 Takeaway

Keep aligned, keep moving

Building a new AI-powered matching system meant new logic, a complex workflow, and high stakes for data quality. To move fast, it's critical to get the team aligned through ubiquitous language, workflow diagrams, and frequent checkpoints - so everyone makes decisions from the same version of the truth.

Phase 2 · 2025 H2

"Design trust into AI-powered workflows"

We launched the platform. Users launched questions: "wait…why?" - How do we make AI results clear, testable, and trustworthy?

Context

Platform launched, but Ops didn't fully trust the AI results - adoption stayed low and UAT surfaced a lot of questions. The core problem isn't training. It's trust.

My Role

Sole designer, end-to-end. Led discovery → flows → UI → iteration. Identified trust blockers and designed patterns to make AI results explainable, testable, and safe to iterate on.

Timeline

2025 H2

Team

1 PM · 1 Designer (me) · 7 Devs

Context

What is AI doing?

AI is used in two steps: Vector matching and LLM validation. We embed selected attributes to generate a similarity score; if a case is still uncertain, users can set pre-validation rules to route it to the LLM (a prompt) for a final yes/no.

Listings to match

Token based matching

UPC ✓

MPN ✓

EAN ✓

Token found · exact identifier match

Vector based matching

0.91

Similarity score

Embedding attributes

Brand

Title

Price

Model

Score > 0.6 → confirmed candidate

Pre-validation - rules set by users

IF Price Difference > 30%

→ LLM

IF Title contains "Refurbished"

Reject

IF Score > 0.9

Confirm

Confirmed match

Rejected match

Uncertain - send to LLM

LLM validation (prompt)

System prompt

Decision (match or not)

One-sentence reasoning explaining why.

78%

Confidence

Confirmed match

Rejected match

Problem

Internal teams hesitate to adopt the AI-powered workflow

The new platform launched, but adoption is slowing because Ops doesn't feel confident using it. During UAT, the same questions kept surfacing: how the system works, why it produced a result, and whether the data is reliable - making the workflow feel risky to rely on for real decisions.

The core problem isn't "training." It's trust.

Discovery

What's blocking adoption?

Through task-based UAT and debriefs with Ops, I treated feedback as a trust audit: I logged recurring "how/why" questions, observed hesitation points, and clustered patterns into themes. 3 trust blockers emerged:

High Cognitive Load

"How does it work?"

The system introduces multiple concepts - pipelines, rules, AI outcomes, data states. Users struggle to form a correct mental model quickly, so every decision feels slower and more error-prone.

Low Traceability

"Why did this happen?"

Users can't easily see what inputs drove an outcome, what logic was applied, or where data came from - so results feel like a black box and trust breaks down.

Unclear Impact

"What will this trigger?"

Rule changes can send large volumes to the LLM, but users can't predict impact (volume + cost) upfront - so they avoid experimenting or updating rules at all.

Solution

4 patterns to make AI trustworthy

Pattern 01 · Traceability

Explain reasoning with Match Lineage

For every match result, surface the full reasoning chain - token/vector signals, the rules applied, and (when used) the LLM's one-sentence rationale with a confidence score. Everything is auditable.

Pattern 02 · Transparency

Show and filter by data source

Users needed to understand where each match came from - when it was added and which pipeline produced it (token, vector, or LLM validation). Source-aware labels and filters let them isolate results, compare pipeline performance, and troubleshoot with confidence.

Pattern 03 · Safety

Give users a sandbox (in a system that isn't one)

Match edits are high-stakes: each accept/reject/override updates the catalog customers see, triggers LLM revalidation cost, and writes a permanent audit row. A draft mode flow lets Ops review and edit dozens of matches without every save going live - publish the batch once it's ready, with aggregated impact you can actually evaluate.

Pattern 04 · Predictability

Make cost impact visible before running

Users hesitate to reprocess or update rules because they can't predict the cost impact - "Am I about to spend $5K?" anxiety. I surfaced estimated volume and LLM cost before users commit to a run, turning anxiety into informed confidence.

A closer look · Pattern 03

How I got to "draft mode"

Two ways to keep match edits safe. The interesting question wasn't whether to add friction - it was where to put it.

Option A · Save + confirm per match

Each match edit triggers a confirmation modal. Every accept, reject, or override interrupts. Friction at the wrong moment - you've already decided; the modal just stalls you.

iPhone 14 Pro 256GB → BestBuy #1024 Edit ✏

Sony WH-1000XM5 → Amazon B0B... Saving…

Option B · Draft mode + batch publish (chosen)

Match edits accumulate as drafts. Publish them when you're ready. Friction at the right moment - one decision covers many matches, with the impact aggregated.

3 drafts · 3 matches updated · ready to publish Publish all

iPhone 14 Pro 256GB → BestBuy #1024 DRAFT

Sony WH-1000XM5 → Amazon B0B... DRAFT

Dyson V15 Detect → Amazon B09... Live

Save + confirm per match felt fast - and that was the problem. Every match decision is high-stakes (catalog correctness, customer-facing data, audit trail), but a modal after every save trains users to click through without reading. Draft mode flips it: edits are cheap, only publishing is high-stakes - and it happens once for the whole batch, with aggregated impact you can actually understand. Friction belongs on the irreversible step, not the review step.

Outcome

Less friction, fewer incidents, faster reviews

↓ Q's

Faster Reviews

Match Lineage + clear reasoning means fewer "how/why" questions - Ops can self-serve answers and reviews move faster.

↓ Risk

Fewer Incidents

Draft-first iteration and cost guardrails reduce costly mistakes - bad pushes, unintended customer impact, and expensive reruns.

↑ Confidence

Higher Adoption

When Ops understands and trusts the AI output, they actually use the tool - and the data quality flywheel starts spinning.

Next Step

Trust is a continuous job

Trust didn't end at launch - it needs ongoing visibility and control. Next, I'd focus on:

Explain "why vector"

Make vector similarity scores more interpretable - what drove the score, what changed between runs.

Prompt tuning with guardrails

Customer-level prompt configs with versioning, approvals, and built-in guidance when results change.

Quality metrics per run

Run-level accuracy and completeness scores so teams can judge whether changes actually helped.

Takeaway

Trust is the product

AI isn't powerful just because it's "AI" - it shouldn't be a buzzword taped onto a workflow. What makes it actually useful is when users understand it, trust it, and feel safe acting on it. Designing for trust means building in explainability, predictability, and control from the start - not as an afterthought after adoption stalls.