Case study·trade-code

HS Code Classifier

Live

A free ITC-HS classifier that returns an 8-digit export code with its legal basis cited and a calibrated confidence band — a six-stage Gemini pipeline backstopped by a mechanical SQL verifier, graded by an honest eval harness instead of a demo.

View live Source

hscode.prevyl.com

12,460

ITC-HS codes

84%

top-3 accuracy

Gemini 3.5 Flash

model

01The challenge

Why this needed to exist.

Every Indian exporter must put an 8-digit ITC-HS code on every shipment, and the taxonomy is unforgiving: 21 sections, 97 chapters, and 12,460 tariff lines whose distinctions turn on legal notes and General Interpretive Rules rather than common sense. Getting it wrong risks ₹50K–5L in customs penalties, so small exporters either burn 30+ minutes per product hunting the catalogue or pay a consultant ₹2K–10K per lookup. A language model can guess a code in seconds — but a confidently wrong code is worse than none, and a black-box guess nobody can check is unfileable. The real problem isn't speed; it's producing a code an exporter can audit against the actual tariff text before staking a shipment on it.

02The approach

How I built it.

Prevyl runs a six-stage pipeline (L0–L5) rather than a single prompt. L0 normalizes the query and flags multi-material goods; L1 (Google Gemini 3.5 Flash) triages each request into classify, ask, or refuse and extracts attributes; L2 does hybrid retrieval over the whole catalogue — pgvector HNSW search on 1536-dimension Gemini embeddings alongside Postgres full-text, reranked by Gemini Flash; L3 applies 1,505 deterministic chapter-exclusion rules and narrows the field to a handful of candidates; L4 (Gemini again) selects one code and must cite the heading or note and the General Interpretive Rule it relied on; and L5 — pure SQL and TypeScript, no model — mechanically verifies that citation against the real database text and the query embedding, feeding structured failures back to L4 to repair, up to two rounds. When a description is too thin to separate two siblings, the system asks one targeted question instead of guessing. Confidence is surfaced as a High/Medium/Low band whose underlying number is stripped at the type, client, and server layers, so the interface can't imply precision it doesn't have — and every result exports as a cited 'Classification Record' PDF.

03Impact

What it did.

Prevyl is live and free at hscode.prevyl.com — an Express backend on Railway, a Next.js 16 frontend on Vercel, and Supabase Postgres with pgvector. What makes the accuracy claims trustworthy is the eval harness: a 385-case master suite over a frozen gold denominator, scored with real calibration (Brier, ECE, Wilson intervals) and McNemar significance gating so one principled change is judged per round. The validated numbers are measured, not marketed — 75.2% outright on the exact 8-digit code, 77.9% effective once clarifying-question recovery is counted, 84.4% in the top three, 87.3% at chapter and 84.1% at heading level — with a separate 60-case messy-real-world-input suite holding the honest figure at 67.8%. Median latency is about 26 seconds, and because every change is gated against the suite, a regression can't ship hidden behind a cherry-picked example.

04Tech stack

What I used — and why.

Google Gemini 3.5 Flash

Drives L1 triage, L4 selection, and the L2 reranker (thinking level low), served through Google Vertex AI in production with the Gemini Developer API as a drop-in fallback. It replaced an earlier OpenAI GPT-4o-mini pipeline — and the eval surfaced a finding that shaped the design: the selection bottleneck is information, not model size (a larger model picked the same wrong sibling), so effort went into retrieval and verification rather than a bigger brain.

Postgres + pgvector on Supabase

The 12,460-line ITC-HS taxonomy lives in a normalized schema (sections → chapters → headings → subheadings → tariff lines) with FK and regex CHECK constraints, chapter notes, and 1,505 exclusion rules as first-class data. Every tariff line carries a 1536-dim embedding for HNSW cosine retrieval, run beside Postgres full-text search so semantic and lexical recall back each other up.

L5 mechanical verifier

A pure SQL/TypeScript layer — no LLM — that runs ten checks on every selected code. The load-bearing one is verbatim-citation containment: the note or heading the model quoted must actually match the database text above a strict threshold, and the code's embedding must clear a cosine floor against the query. It's what lets the product promise 'a code you can check' instead of 'trust the AI', and it powers the repair loop that catches and re-selects bad codes.

Evaluation harness

A 385-case suite over a frozen gold denominator measures outright vs ask-recovered accuracy, top-3, per-level routing, confident-wrong rate, and calibration (Brier + ECE with bootstrap CIs), with McNemar significance testing. It's why the accuracy figures are specific and defensible — and why regressions are caught before they ship rather than after.

Want something similar?

Available for freelance projects and contract engineering. Usually reply within 24 hours.

Let's talk More work