Production Proof
The body of measurable engineering evidence backing SuiteCentral 2.0’s “production-ready” claim.
What it is
The set of objective, verifiable signals that SuiteCentral 2.0 is not vaporware: passing test suites, broad coverage, real connectivity proofs, multi-provider AI redundancy.
Why it matters (to the adoption case)
Squire’s CTO and CFO will not approve a pilot for software that hasn’t been engineered carefully. Production proof is what converts “interesting product story” into “responsible technology bet.” 01-executive-summary places it on slide 4 — between the differentiation pitch and the Squire-specific value framing — which signals it’s load-bearing for the executive case.
The numbers
Note: the slide-script numbers below are from a presentation created circa late 2025/early 2026. The Preston-Test repo
README.md(current as of April 2026) shows higher numbers. Both are accurate snapshots in time — the slide-script numbers are what executives saw in the presentation; the README numbers are what’s true today. Don’t conflate.
Per 01-executive-summary slide 4 and 11-role-brief-cto (slide-script vintage)
- 100% suite pass rate: 392/392 suites
- 100% executed-test pass rate: 9,038/9,038
- 9,038/9,061 total: 23 intentionally skipped (confirmed across two sources)
- 95%+ AI accuracy (per 11-role-brief-cto only — methodology not yet ingested)
- Multi-provider AI stack (asserted as production-validated)
- NetSuite sandbox connectivity (asserted as proven)
- Failure-path visibility and fallback handling (per 11-role-brief-cto — see cto)
Per read-talking-points (TALKING-POINTS vintage, formally ingested 2026-04-07)
- 100% suite pass rate: 392/392 suites
- 9,207/9,237 tests passing: 30 tests intentionally skipped (the Talking-Points vintage skip baseline)
- Six production connectors (not a test number, but part of the “what’s proven” talking point)
Per 15-start-here-async-standalone (CURRENT vintage, formally re-baselined 2026-06-12)
- 100% suite pass rate: 607/607 suites
- 12,254/12,270 tests passing: 16 intentionally skipped
- 68.73% statement coverage
- 2,282 tracked files / ~854K text LOC repository scale snapshot
These match the Preston-Test repo README.md and are the numbers a reviewer sees on the live demo site Start Here page.
Canonical test breakdown (per 26-canonical-metrics-and-wording + 04-technical-proof)
The canonical way to cite the test counts per the style guide:
- 100% suite pass rate (607 suites)
- 100% of executed tests passed (12,254 tests across unit, integration, and E2E)
- Full breakdown: 11,731 unit (0 skipped), 503 integration (16 skipped), 20 E2E portal
Math check: 11,731 + 503 + 20 = 12,254 passing. 0 + 16 + 0 = 16 skipped (all skipped tests are in integration, none in unit or E2E). 12,254 + 16 = 12,270 total. Perfect reconciliation.
Canonical coverage (per 26-canonical-metrics-and-wording)
| Metric | Value |
|---|---|
| Statements | 68.73% |
| Branches | 57.69% |
| Functions | 71.29% |
| Lines | 68.98% |
Short form per the style guide: “69% line coverage across 53,903 lines of production TypeScript.” The 53,903 number is the production TypeScript subset — the 15-start-here-async-standalone page’s “~854K text LOC” figure is the total repo (code + tests + config + docs).
Four production AI providers (per 04-technical-proof + 26-canonical-metrics-and-wording)
The multi-provider AI stack is now fully enumerated:
| Provider | Model (primary) | Role | Cost per mapping |
|---|---|---|---|
| OpenAI | GPT-5.4 mini (default; earlier sources cited GPT-4o) | Primary inference | ~$0.0007 (benchmark-measured) |
| Anthropic Claude | Claude Haiku 4.5 (Sonnet 4.6 upgrade tier; earlier env config showed claude-sonnet-4-5-20250929) | Secondary / validation | ~$0.0011 (benchmark-measured) |
| OpenRouter | Multi-model | Routing / fallback | Free tier available |
| LMStudio | Llama 3.1 8B | On-premise / fallback | Free (local) |
All four operational per the technical proof document. Per-mapping costs are measured by the Phase-B accuracy benchmark matrix (2026-06-10 live run — Claude Haiku 4.5 measured 96.7% top-1 on the NetSuite pair, edging out GPT-5.4 mini’s 95.1%). The $0.003/mapping shown in oracle-comparison’s live demo is a legacy Claude 3.5 Sonnet-era price.
Canonical OpenAI model list (per src/services/ai/ModelCatalogService.ts)
The actual source-of-truth capability matrix in the code lists these OpenAI models:
| Model | Context window | Vision | JSON mode | Tool use | Reasoning |
|---|---|---|---|---|---|
| gpt-4o | 128K | ✓ | ✓ | ✓ | ✓ |
| gpt-4o-mini | 128K | ✓ | ✓ | ✓ | — |
| gpt-4.1 | 128K | ✓ | ✓ | ✓ | ✓ |
Note: ai-provider-system (the AI Provider System doc) lists older OpenAI models (GPT-4, GPT-4 Turbo, GPT-3.5 Turbo) — that document is stale and predates the gpt-4o upgrade. The canonical list is ModelCatalogService.ts; 04-technical-proof (March 3, 2026) already reflects the current gpt-4o primary.
Canonical Claude model list
| Model | Context window |
|---|---|
| claude-3-5-sonnet-20241022 | 200K |
| claude-3-opus-20240229 | 200K |
| claude-3-haiku-20240307 | 200K |
Also from ModelCatalogService.ts. Consistent with ai-provider-system.
The 7-provider total (per ai-provider-system)
In addition to the 4 production providers above, the code supports 3 more:
- Grok (xAI) — experimental. Models: grok-beta, grok-vision-beta.
- Gemini (Google) — experimental. Models: gemini-1.5-flash (1M context), gemini-1.5-pro (2M context).
- Rule-Based Engine — deterministic fallback. 78% accuracy baseline; no external API calls.
Canonical wording per 26-canonical-metrics-and-wording is “4 production-ready AI providers” — the 3 extras are categorized as experimental or fallback and should not be counted in pitch materials.
AI accuracy — measured signals (per 04-technical-proof)
| Metric | Last-measured Signal | When |
|---|---|---|
| Field Mapping Accuracy | 95.1% top-1 (GPT-5.4 mini) / 96.7% (Claude Haiku 4.5) on SFDC→NetSuite, 100% both providers on SFDC→Business Central, 0 hallucinations — Phase-B fixture benchmark matrix (npm run benchmark:ai), disclosed as a fixture figure, not a production number | Jun 2026 |
| Confidence Calibration | 90%+ | Oct 2025 |
| Multi-provider Consensus Boost | +5-15% | Oct 2025 |
The field-mapping accuracy row is intentionally qualified — the canonical 04-technical-proof removed an earlier unqualified-percentage-range framing because the absolute upper bound was a single-source point estimate, not a reproducible measurement. Once a benchmark harness ships (tracked as a Tier-3 follow-up in the source-of-truth repository’s A-grade remediation plan), the absolute number returns to the table sourced from a CI-emitted artifact. Confidence calibration of 90%+ means the AI’s self-reported confidence scores match the actual correctness rate. The multi-provider consensus boost is the architectural payoff for running four providers — they vote on ambiguous mappings.
NetSuite integration proof (per 04-technical-proof)
- Squire’s actual NetSuite sandbox:
TSTDRV2698307— first concrete Squire-specific infrastructure identifier in the corpus - Auth: OAuth 1.0 HMAC-SHA256
- Connector source:
src/connectors/NetSuiteConnector.ts(500+ LOC) - Verified CRUD: Customer records, Vendor records, Transaction records, Custom record types, Saved searches — full Create / Read / Update / Delete / Search on all five
SOC 2 Trust Services Criteria mapped to production code (per compliance-dashboard)
All 5 TSC categories are implemented and each is backed by specific source files where applicable. See compliance-dashboard for the full mapping. Summary:
- CC6 Security: JWT auth, RBAC, timing-safe key validation, rate limiting, production guards
- A1 Availability: Health checks, circuit breakers, DR with RTO/RPO, Kubernetes auto-scaling (2-10 replicas)
- PI1 Processing Integrity: AI confidence scoring, hallucination detection, schema drift blocking (
SCHEMA_DRIFT_BLOCKEDresult code), DB-persisted reasoning traces - C1 Confidentiality: DLP/PII detection (10 patterns per the actual code — see DLP reconciliation below), masking utility, encrypted credential storage
- P1 Privacy: GDPR/CCPA compliance, audit trail logging, 90-day default data retention
DLP pattern count — reconciled from source code and dashboard HTML
The PII detection surface spans two subsystems and the compliance dashboard dynamically reports a combined count:
DLPService.ts (src/services/security/DLPService.ts, lines 53-65) — 10 regex patterns:
| # | Pattern name | What it matches |
|---|---|---|
| 1 | ssn | Social Security Numbers (3-2-4 format or 9 digits) |
| 2 | creditCard | 16-digit credit cards (4-4-4-4) |
| 3 | email | Email addresses |
| 4 | phoneUS | US phone numbers |
| 5 | phoneIntl | International phone numbers |
| 6 | medicalRecordNumber | MRN / Medical Record # patterns |
| 7 | accountNumber | Account # patterns (8-17 digits) |
| 8 | ipAddress | IPv4 addresses |
| 9 | apiKey | Generic API keys (32+ alphanumeric) |
| 10 | jwt | JWT tokens |
GovernanceService.ts (src/services/ai/orchestrator/GovernanceService.ts, lines 381-398) — 6 content-filter patterns (partial overlap with DLPService):
- ssn, email, phone, credit_card, ip_address, name (title-prefix name detection)
Compliance dashboard (public/compliance-dashboard.html, lines 375-378) — the JavaScript snapshot renders 14 patterns when the page loads in unauthenticated/demo mode:
SSN, credit card, email, phone, intl phone, medical record, IP address, API key, JWT, bank account, DOB, passport, driver’s license, name
When authenticated, the dashboard fetches live from /api/compliance/dlp-patterns and replaces the snapshot with the API’s real-time count. A [snapshot] badge appears to distinguish snapshot mode from live API data.
Reconciliation:
- 10 of the 14 snapshot items have confirmed regex implementations in DLPService.ts
nameis implemented in GovernanceService.ts (11th confirmed)- DOB, passport, driver’s license — not found as regex patterns in either service file. These three may represent planned additions, patterns behind the
/api/compliance/dlp-patternsendpoint at runtime, or the design-intent target that the snapshot was written to reflect. A CTO who needs to verify can check the API endpoint directly. - The “8 patterns” figure that appeared in earlier wiki source summaries came from a stale NotebookLM web scrape of the compliance dashboard. NotebookLM’s extraction captured a pre-Alpine.js render state with different content than the actual page. The repo HTML source has the 14-pattern snapshot, not 8. Earlier wiki claims of “the dashboard is lying” have been corrected.
Reconciliation with source summaries:
- 04-technical-proof says “9 patterns” — counts phones as one, omits GovernanceService patterns. Reasonable approximation of the 10 DLPService patterns.
- compliance-dashboard — the NotebookLM scrape originally showed “8 patterns” but this was a scrape artifact. The actual repo HTML says 14. Updated.
- oracle-comparison — also showed “8 patterns” from the same scrape vintage. Corrected to note the actual snapshot says 14.
Vintage comparison table
| Vintage | Suites | Tests passing | Skipped | Total | Sources |
|---|---|---|---|---|---|
| Slide | 379/379 | 9,038 | 23 | 9,061 | 01-executive-summary, 11-role-brief-cto |
| Talking-Points | 404/404 | 9,207 | 30 | 9,237 | read-talking-points, read-elevator-pitch |
| Current | 607/607 | 12,254 | 6 | 12,270 | 15-start-here-async-standalone |
Trajectory between vintages
- Slide → Talking-Points: +25 suites, +169 passing tests, +7 skipped tests (the skipped count rose from 23 to 30 across this vintage)
- Talking-Points → Current: +58 suites, +917 passing tests, -24 skipped (the skip count dropped from 30 to 6 between vintages — the PR
#694/#695skip-discipline cleanup pruned long-staleit.skipplaceholders) - Full arc: +83 suites and +1,086 passing tests between slide and current
Three observations: (1) the test base has grown consistently across three measured points in time — this is a real codebase with active engineering, not a static pitch deck; (2) the skip count went down from 30 (Talking-Points) to 6 (Current), reflecting an active prune of stale it.skip placeholders rather than the earlier “frozen list” pattern; (3) coverage is reported only in the Current vintage — earlier snapshots optimize for “100% pass rate” (an easier executive number) over coverage percent.
Mixed-vintage caveat: a Path B reviewer will see three different test counts depending on which page they land on. Start Here has Current; Leadership Talking Points has Talking-Points vintage; the CTO role brief has Slide vintage. Same package, three vintages. Worth flagging to the asset owner — see demo-site.
What this proves
- The engineering organization can ship. Many teams claim “production-ready” with sub-1k test counts; 9k+ is materially different.
- The system has been exercised in real conditions. NetSuite sandbox connectivity is not an in-memory simulation.
- The AI provider abstraction works. Multi-provider stack means no single-vendor dependency.
What this does NOT prove
- The 68.73% coverage figure means 35%+ of statements are uncovered. Squire’s CTO will likely ask which subsystems are under-covered. Open question for next technical ingest.
- “100% pass rate” excludes the 23 skipped tests — what are they, why are they skipped? Not in the slide script. Open question.
Open questions
- Where is the coverage gap? (Which modules / subsystems are under-covered?)
- What are the 23 skipped tests, and is there a plan to enable them?
- What does the multi-provider AI stack actually consist of? (Probably OpenAI + Claude + OpenRouter + LMStudio per the README, but not confirmed from a formally-ingested source yet.)
- What does “NetSuite sandbox connectivity proof” mean concretely — read-only metadata? Two-way write tests? Auth round-trips?
- What is the 95%+ AI accuracy measuring? Now PARTIALLY answered by ai-governance-layer-video (01:32): “We reduced manual field mapping from 15 hours to 30 seconds with 95% accuracy.” So the 95% is specifically about field-mapping accuracy, not AI accuracy generally. Still single-task / single-source for methodology; needs
04-TECHNICAL-PROOF.mdorAI Provider System Documentationfor evaluation-harness detail. - Field mapping efficiency claim: 15 hours → 30 seconds is a dramatic efficiency claim. At face value that’s a ~1,800× speed-up. The 15-hour baseline matches read-elevator-pitch’s “three years ago our problem was manual mapping, with about 15 hours of labor per integration.” The 30-second target is from ai-governance-layer-video. The ratio is what makes the “per-integration” economics work for the HintonBurdick-driven client-base doubling.
Sources
- 01-executive-summary — claims 2, 6, 7 (test counts slide-vintage, multi-provider stack, NetSuite connectivity)
- 11-role-brief-cto — second-source confirmation of 9038/9061 slide-vintage; also claims (slide-vintage, single-source) “95%+ AI accuracy” — the canonical 04-technical-proof now describes field-mapping accuracy as qualified (“measurably improved across Phases 1-5; absolute numbers depend on schema complexity and are tracked as a Tier-3 follow-up — benchmark harness”), so treat this 95%+ figure as a single-source slide-vintage data point until the Tier-3 harness ships
- 15-start-here-async-standalone — claims 12-15 (CURRENT-vintage 12,254/12,270, 100% 607/607 suites, 68.73% coverage, 2,282 files / ~854K LOC)
- read-talking-points — claim 4 (TALKING-POINTS-vintage 9,207/9,237, 392/392 suites, 16 skipped) and claim 5 (six production connectors)
- read-elevator-pitch — claim 7 (second-source confirmation of Talking-Points vintage test counts)
- ai-governance-layer-video — claims 3, 12 (video claims 95% mapping accuracy — single-source point estimate that the canonical 04-technical-proof now qualifies as “measurably improved… tracked as Tier-3 follow-up”; also 9,000+ tests; 15-hours-to-30-seconds efficiency quantification)
- 04-technical-proof — all claims re: canonical test breakdown, 9 AI providers with model names, AI accuracy (qualified field-mapping accuracy per Tier-3 benchmark roadmap, 90%+ confidence calibration, +5-15% consensus boost), NetSuite sandbox TSTDRV2698307, full CRUD verified, line coverage 68.98%
- 26-canonical-metrics-and-wording — canonical test sequence, 4 coverage metrics (statements / branches / functions / lines), 53,903 lines of production TypeScript, AI provider per-mapping costs
- compliance-dashboard — 5 SOC 2 Trust Services Criteria mapped to production code with source file paths