A standardized benchmark

AGI remains untested in VC.

VCBench evaluates how well AI models and humans predict founder success in venture capital. 22 systems on one head-to-head leaderboard, updated as new submissions arrive.

Models evaluated
22
across 9 organizations
Best precision
87.5%
Reasoned-Rule-Mining · Vela + Oxford
Best F₀.₅
36.6%
Verifiable-RL · Vela + Oxford
About VCBench

VCBench introduces the first standardized benchmark for founder-success prediction in venture capital. Given a structured profile of a founder and their company, a model emits a probability that the company will reach a defined success milestone within a fixed horizon.

Overview

Benchmarks such as SWE-bench (software engineering) and ARC-AGI (general reasoning) have shown how shared datasets accelerate progress toward AGI. Venture capital is a particularly compelling testbed. It is a domain of uncertain and incomplete information where even expert humans perform poorly.

Human Benchmarks

At the inception stage, the market index baseline is only 1.9% precision. Y Combinator achieves 3.2% precision (about 1.7× over the index), while tier-1 venture firms average 5.6% precision (about 2.9×). Recalls for both hover around 6%. To compare these human benchmarks with VCBench, we normalize using the dataset's 9% random success baseline. Under this normalization, tier-1 VC precision rises to about 23% (F0.5 ≈ 10.7), while YC precision normalizes to about 14% (F0.5 ≈ 8.6). These values remain below the strongest model baselines. This emphasizes both the difficulty of the task and the relevance of VC as a proxy for testing human-level and beyond-human-level intelligence.

Dataset

Our initial release contains 9,000 anonymized founder profiles, of which 810 (9%) are labeled successful. Success is defined as a founder leading a company that either exits or IPOs with a valuation above $500M, or raises more than $500M in funding. Profiles were collected from LinkedIn and Crunchbase, then passed through a multi-stage pipeline for data standardization, filtering, enrichment, and anonymization. This process reduced identifiable founders by over 90% in adversarial testing, while preserving predictive features such as education quality (via QS university rankings), job histories, and industry clustering.

Next steps

VCBench is a living, community-driven benchmark that grows with feedback, new features, and fresh evaluation modes, offering a solid foundation for reproducible research and more realistic tests of decision-making under uncertainty. If you want to participate, notice errors or be part of the benchmark committee, please reach out to benchmark@vela.partners.

This project is initiated by the University of Oxford and Vela Research, the research arm of Vela Partners.

Precision vs F₀.₅
Higher precision means fewer false bets. Higher F₀.₅ means better overall, weighted toward precision.
Filter by org
0%20%40%60%80%95%Precision (%)010203040F₀.₅ (%)
Vela + OxfordIndependentOpenAIAnthropicGoogleDeepSeekColumbiaHumansBaseline
Model rankings
Sorted by F₀.₅. Precision, recall, and F₀.₅ on the private test set.
#
Model
Organization
1
Verifiable-RL
Vela + Oxford
42.6
23.6
36.6
2
Policy-Induction
Vela + Oxford
41.0
20.2
34.0
3
GemVC-v0
Independent · Madhusudhana Naidu
39.4
20.3
32.9
4
Structured-Rule-Stump
Independent · Yagiz Ihlamur
32.8
18.0
28.1
5
Random-Rule-Forest
Vela + Oxford
42.5
12.1
28.1
6
verifiable-reasoning
Vela + Oxford
30.6
21.0
27.7
7
large-founder-model-v0
Vela + Oxford
31.7
17.5
27.2
8
GPT-4o
OpenAI
30.0
16.3
25.7
9
FinGPT-VC2
Columbia
24.4
27.2
24.9
10
GPT-4o-mini
OpenAI
31.5
11.1
23.0
11
FinGPT-VC1
Columbia
21.8
24.2
22.2
12
o3
OpenAI
43.2
7.4
21.5
13
Reasoned-Rule-Mining
Vela + Oxford
87.5
5.0
21.0
14
Gemini-2.5-Pro
Google
17.1
58.0
19.9
15
DeepSeek-Reasoner
DeepSeek
31.8
6.9
18.4
16
Claude-3.5-Haiku-Latest
Anthropic
15.8
46.4
18.2
17
GPT-5
OpenAI
59.1
4.2
16.2
18
Gemini-2.5-Flash
Google
12.5
68.4
14.9
19
DeepSeek-Chat
DeepSeek
80.6
3.0
12.1
20
Tier-1 VCs
Humans
23.0
5.2
10.7
21
Random Classifier
Baseline
9.0
9.0
9.0
22
Y Combinator
Humans
14.0
6.9
8.6