AGI REMAINS UNTESTED IN VC.
VCBENCH SETS THE STANDARD.
MODELS EVALUATED
13
HIGHEST PRECISION
GPT-5 58.0%
HIGHEST F0.5 SCORE
Vela-PolicyInduction-Gemini-2.5-Flash 34.0%
This scatter plot shows the relationship between Precision and F₀.₅ scores for various models on the VCBench founder-success prediction task. Human benchmarks (Y Combinator at ~14% precision and Tier-1 VCs at ~23% precision) provide context for model performance. Models above these benchmarks demonstrate beyond-human capabilities in founder-success prediction.
Rank | Model | Organization | Precision (%) | Recall (%) | F₀.₅ (%) | Cost In ($) | Cost Out ($) | Latency (s) | Score |
---|---|---|---|---|---|---|---|---|---|
1 | Vela-PolicyInduction-Gemini-2.5-Flash | Vela | 41.0 | 20.2 | 34.0 | $0.30 | $2.50 | 8.5 | 90.2 |
2 | o3 | Anthropic | 52.3 | 8.3 | 25.2 | $2.00 | $8.00 | 6.9 | 89.5 |
3 | DeepSeek-V3 | DeepSeek | 34.2 | 10.9 | 23.8 | $0.27 | $1.10 | 10.1 | 87.2 |
4 | DeepSeek-R1 | DeepSeek | 27.4 | 15.2 | 23.5 | $0.55 | $2.19 | 37.8 | 86.8 |
5 | GPT-4o | OpenAI | 35.7 | 9.8 | 23.1 | $2.50 | $10.00 | 3.6 | 85.5 |
6 | GPT-4o-mini | OpenAI | 29.4 | 11.2 | 22.1 | $0.15 | $0.60 | 3.0 | 84.2 |
7 | Gemini-2.5-Pro | 17.4 | 62.2 | 20.3 | $1.25 | $10.00 | 10.7 | 82.3 | |
8 | Claude-3.5-Haiku | Anthropic | 15.9 | 55.2 | 18.5 | $0.80 | $4.00 | 3.4 | 79.7 |
9 | GPT-5 | OpenAI | 58.0 | 4.2 | 16.2 | $1.25 | $10.00 | 1.5 | 78.4 |
10 | Gemini-2.5-Flash | 13.4 | 72.8 | 16.0 | $0.30 | $2.50 | 8.3 | 77.8 | |
11 | Humans (Tier-1 VCs) | Tier1 | 23.0 | 5.2 | 10.7 | $0.00 | $0.00 | 0.0 | 70.0 |
12 | Humans (Y Combinator) | YC | 14.0 | 6.9 | 8.6 | $0.00 | $0.00 | 0.0 | 65.0 |
13 | Random Classifier | Baseline | 9.0 | 9.0 | 9.0 | $0.00 | $0.00 | 0.0 | 50.0 |
VCBench introduces the first standardized benchmark for founder-success prediction in venture capital. Benchmarks such as SWE-bench (software engineering) and ARC-AGI (general reasoning) have shown how shared datasets accelerate progress toward AGI. Venture capital is a particularly compelling testbed: it is a domain of uncertain and incomplete information where even expert humans perform poorly.
At the inception stage, the market index baseline is only 1.9% precision. Y Combinator achieves 3.2% precision (about 1.7× over the index), while tier-1 venture firms average 5.6% precision (about 2.9×). Recalls for both typically hover around 6%. To compare these human benchmarks fairly against VCBench, where the positive prevalence is 9%, we adjust for prevalence differences. Recall and false positive rates are invariant to prevalence, so precision is recalculated at π=9%. Under this normalization, tier-1 VC precision rises to about 23% at 5.2% recall (F0.5 ≈ 10.7), while YC precision normalizes to about 14% at 6.9% recall (F0.5 ≈ 8.6). These values remain below the strongest model baselines, underscoring both the difficulty of the task and the relevance of VC as a proxy for testing human-level and beyond-human-level intelligence.
Our initial release, vcbench-founder-prediction-v1, contains 9,000 anonymized founder profiles, of which 810 (9%) are labeled successful. Success is defined rigorously: a founder’s company either exited or IPO’d above $500M, or raised more than $500M in funding. Profiles were collected from LinkedIn and Crunchbase, then passed through a multi-stage pipeline for data standardization, filtering, enrichment, and anonymization. This process reduced identifiable founders by over 80% in adversarial testing, while preserving predictive features such as education quality (via QS university rankings), job histories, and industry clustering.
The benchmark enables consistent comparisons between prediction systems. In our baseline study, nine state-of-the-art models were evaluated across six folds of the dataset. Precision was emphasized through the F0.5 metric, reflecting the higher cost of false positives in VC decision-making. The strongest performer achieved 5.9× the dataset baseline precision, already surpassing normalized human performance.
By framing venture capital as a benchmarked environment, VCBench provides both an applied challenge and a representation of AGI-level testing: success depends on reasoning under uncertainty, extracting weak signals from sparse histories, and outperforming human experts.
Initiated by Vela Research, the research arm of Vela Partners.
VC Bench evaluates AI models on venture capital functionality tasks.
Updated in real-time as new submissions are processed.