Now accepting pilot partners

See how they actually work.
Not how they interview.

AI-native work trials that predict on-the-job performance — with evidence. Stop guessing. Start knowing.

A
B
C
D

Trusted by founders from YC, SPC, and more

polymath-demo.mp4

Watch the demo

Scroll
The Problem

Hiring is broken.
Everyone knows it.

You've felt it. That sinking feeling three months in when your "great interview" turns into a performance problem. The process is designed to produce false positives.

Resumes lie

Credentials and keywords don't predict performance. You're hiring based on marketing, not capability.

1-hour interviews are theater

Candidates perform. They've practiced the questions. You learn who interviews well, not who ships.

LeetCode is cargo cult

Inverting binary trees has nothing to do with building products. You're testing memorization, not problem-solving.

AI changes everything

Your best hire in 2025 uses AI to 10x their output. Traditional interviews penalize this. You're selecting for the past.

"The best predictor of future performance is past performance — in similar conditions."

So why are we still using artificial conditions to predict real work?

How It Works

Four steps to certainty

A week-long work trial that shows you exactly who someone is — not who they pretend to be in an interview.

01

Instrumented Sandbox

Real environment. Real tools.

Candidate gets SSH access to a fully-equipped VM with their choice of AI tools — Claude Code, Cursor, whatever they use. Just like their actual job.

Connected to sandbox-7f3a2b
$claude --version
Claude Code v1.0.14
$git clone project-repo && cd project-repo
$npm run dev
02

Ambiguous Assignment

Designed to reveal agency.

A real-world project with incomplete requirements. We test how they handle ambiguity, make decisions, and turn chaos into shipped code.

Assignment Brief

Build a support inbox triage system. Ingest tickets, cluster by topic, propose auto-replies, admin UI for review.

"Requirements intentionally incomplete. Ask questions. Make decisions. Ship something real."

ambiguityagencytradeoffs
03

Observer Agent

Intelligent probing. Not surveillance.

Our AI observes commits, commands, and tool usage. At key moments, it asks surgical questions to understand their thinking.

P
I noticed you chose SQLite over Postgres. What's driving that decision?
C
Faster to prototype. I'll note Postgres migration as tech debt if this scales.
P
What would you cut first if time halves?
04

Evidence-Based Report

Signal, not vibes.

A comprehensive hiring report with rubric scores, evidence clips, timeline analysis, and a clear recommendation you can act on.

Candidate ReportHIRE
Agency
4.5
AI Leverage
4.8
Problem Framing
4.2
Speed to Value
4.6
"Strong evidence of independent thinking and effective AI usage..."
The Report

Decisions backed by evidence

Not a score. Not a vibe. A comprehensive report that makes the hiring decision obvious — with receipts.

Candidate Evaluation Report

Full-Stack Engineer • Jan 2025

STRONG HIRE
Active Time

14.2 hours

Commits

47

Decisions Logged

12

Rubric Scores

Problem Framing4.5
Plan Quality4.4
AI Leverage4.6
Engineering Fundamentals4.7

Key Evidence

14:23 — "Chose to ship auth-less MVP first, documented security as Day 2 priority. Explicitly traded speed for completeness."

16:45 — When tests failed, diagnosed root cause in 8 mins using Claude Code. Fixed without introducing regressions.

What we measure

Eight dimensions that actually predict on-the-job success. Each scored 1–5 with specific evidence.

Problem Framing

Did they define success clearly before building?

Plan Quality

Milestones, sequencing, realistic scope

AI Leverage

Uses tools well, avoids hallucination traps

Engineering Fundamentals

Testing, correctness, architecture decisions

Speed to Value

Ships a working v0 quickly

Iteration & Debugging

How they react when stuck

Communication

Clarity, tradeoffs, stakeholder handling

Ownership & Agency

Proactive decisions, escalates with options

Why It Works

Built for the real world

Not another coding test. A fundamental rethinking of how to predict who will ship in your environment.

Tests ambiguity tolerance

Real work is messy. We give incomplete specs and shifting requirements. You see who thrives in chaos and who freezes.

Measures actual agency

Do they wait for permission or make decisions? Do they escalate blockers with options? Agency is observable, not self-reported.

Observes AI usage (doesn't ban it)

Your best hire uses AI as a force multiplier. We measure how well they leverage tools — not whether they can work without them.

Week-long signal vs 1-hour snapshot

See how they work over time. Energy management, iteration cycles, how they handle getting stuck. Not a performance, a pattern.

Defensible hiring decisions

Every recommendation backed by timestamped evidence. No more 'gut feeling' regret. Show your board exactly why you hired them.

Works with AI-native candidates

Traditional interviews penalize AI users. We're designed for 2025: the best engineers use every tool available. So should the evaluation.

Traditional hiring vs. Polymath

Traditional Hiring
  • Resume screening (marketing document)
  • LeetCode (tests memorization)
  • Behavioral interviews (rehearsed answers)
  • "Culture fit" (vibes-based)
  • Decision made on 3 hours of performance
Polymath Society
  • Real work trial (actual output)
  • Ambiguous problems (tests thinking)
  • Observed behavior (not self-reported)
  • Evidence-backed rubric (defensible)
  • Week-long signal (patterns, not moments)
Who It's For

For teams who can't afford to guess

Bad hires cost 6–12 months. The wrong engineer can kill a startup. If that risk keeps you up at night, we built this for you.

Startup Founders

Seed to Series B

You can't afford a bad hire. Every engineer either accelerates or drags. Get certainty before committing $200K+/year.

"We've been burned twice by 'great interviews.' Never again."

Hiring Managers

Engineering leads

Your gut says hire, but can you defend it? Get a report that makes the decision obvious — and backs you up later.

"Finally, something that shows me how they actually work."

Technical Recruiters

In-house teams

Resumes all look the same. LeetCode scores don't predict success. Give your hiring managers signal they can act on.

"The report sells itself. Hiring managers trust it."

Currently onboarding pilot partners

We're working with a small group of startups to refine the evaluation process. If you're hiring engineers in the next 3 months, let's talk.

Become a Pilot Partner
Get Early Access

Join the waitlist

Be among the first to transform how you evaluate candidates. We're accepting a limited number of pilot partners.

No spam, everReply within 48 hoursFree pilot for early partners