A standardized benchmark

AGI remains untested in VC.

VCBench evaluates how well AI models and humans predict founder success in venture capital. 22 systems on one head-to-head leaderboard, updated as new submissions arrive.

Models evaluated

across 9 organizations

Best precision

87.5%

Reasoned-Rule-Mining · Vela + Oxford

Best F₀.₅

36.6%

Verifiable-RL · Vela + Oxford

About VCBench

VCBench introduces the first standardized benchmark for founder-success prediction in venture capital. Given a structured profile of a founder and their company, a model emits a probability that the company will reach a defined success milestone within a fixed horizon.

Overview

Benchmarks such as SWE-bench (software engineering) and ARC-AGI (general reasoning) have shown how shared datasets accelerate progress toward AGI. Venture capital is a particularly compelling testbed. It is a domain of uncertain and incomplete information where even expert humans perform poorly.

Human Benchmarks

At the inception stage, the market index baseline is only 1.9% precision. Y Combinator achieves 3.2% precision (about 1.7× over the index), while tier-1 venture firms average 5.6% precision (about 2.9×). Recalls for both hover around 6%. To compare these human benchmarks with VCBench, we normalize using the dataset's 9% random success baseline. Under this normalization, tier-1 VC precision rises to about 23% (F0.5 ≈ 10.7), while YC precision normalizes to about 14% (F0.5 ≈ 8.6). These values remain below the strongest model baselines. This emphasizes both the difficulty of the task and the relevance of VC as a proxy for testing human-level and beyond-human-level intelligence.

Dataset

Our initial release contains 9,000 anonymized founder profiles, of which 810 (9%) are labeled successful. Success is defined as a founder leading a company that either exits or IPOs with a valuation above $500M, or raises more than $500M in funding. Profiles were collected from LinkedIn and Crunchbase, then passed through a multi-stage pipeline for data standardization, filtering, enrichment, and anonymization. This process reduced identifiable founders by over 90% in adversarial testing, while preserving predictive features such as education quality (via QS university rankings), job histories, and industry clustering.

Next steps

VCBench is a living, community-driven benchmark that grows with feedback, new features, and fresh evaluation modes, offering a solid foundation for reproducible research and more realistic tests of decision-making under uncertainty. If you want to participate, notice errors or be part of the benchmark committee, please reach out to benchmark@vela.partners.

This project is initiated by the University of Oxford and Vela Research, the research arm of Vela Partners.

Precision vs F₀.₅

Higher precision means fewer false bets. Higher F₀.₅ means better overall, weighted toward precision.

Filter by org

Vela + OxfordIndependentOpenAIAnthropicGoogleDeepSeekColumbiaHumansBaseline

Model rankings

Sorted by F₀.₅. Precision, recall, and F₀.₅ on the private test set.

Model

Organization

Verifiable-RL

Vela + Oxford

42.6

23.6

36.6

Policy-Induction

Vela + Oxford

41.0

20.2

34.0

GemVC-v0

Independent · Madhusudhana Naidu

39.4

20.3

32.9

Structured-Rule-Stump

Independent · Yagiz Ihlamur

32.8

18.0

28.1

Random-Rule-Forest

Vela + Oxford

42.5

12.1

28.1

verifiable-reasoning

Vela + Oxford

30.6

21.0

27.7

large-founder-model-v0

Vela + Oxford

31.7

17.5

27.2

GPT-4o

OpenAI

30.0

16.3

25.7

FinGPT-VC2

Columbia

24.4

27.2

24.9

GPT-4o-mini

OpenAI

31.5

11.1

23.0

FinGPT-VC1

Columbia

21.8

24.2

22.2

OpenAI

43.2

7.4

21.5

Reasoned-Rule-Mining

Vela + Oxford

87.5

5.0

21.0

Gemini-2.5-Pro

Google

17.1

58.0

19.9

DeepSeek-Reasoner

DeepSeek

31.8

6.9

18.4

Claude-3.5-Haiku-Latest

Anthropic

15.8

46.4

18.2

GPT-5

OpenAI

59.1

4.2

16.2

Gemini-2.5-Flash

Google

12.5

68.4

14.9

DeepSeek-Chat

DeepSeek

80.6

3.0

12.1

Tier-1 VCs

Humans

23.0

5.2

10.7

Random Classifier

Baseline

9.0

Y Combinator

Humans

14.0

6.9

8.6