VCBENCH

QUICK STATS

MODELS EVALUATED

13

HIGHEST PRECISION

GPT-5 58.0%

HIGHEST F0.5 SCORE

Vela-PolicyInduction-Gemini-2.5-Flash 34.0%

RESOURCES

Paper

F₀.₅ vs Precision for Vanilla LLMs on VCBench

Filter by

Data

ALLvcbench-founder-prediction-v1

Model Provider

ALLRandom classifierHumans (Y Combinator)Humans (Tier-1 VCs)GoogleOpenAIAnthropicDeepSeekVela

Understanding the Graph

This scatter plot shows the relationship between Precision and F₀.₅ scores for various models on the VCBench founder-success prediction task. Human benchmarks (Y Combinator at ~14% precision and Tier-1 VCs at ~23% precision) provide context for model performance. Models above these benchmarks demonstrate beyond-human capabilities in founder-success prediction.

Model Rankings
RankModelOrganization
Precision (%)
Recall (%)
F₀.₅ (%)
Cost In ($)
Cost Out ($)
Latency (s)
Score
1
Vela-PolicyInduction-Gemini-2.5-Flash
Vela41.020.234.0$0.30$2.508.590.2
2
o3
Anthropic52.38.325.2$2.00$8.006.989.5
3
DeepSeek-V3
DeepSeek34.210.923.8$0.27$1.1010.187.2
4
DeepSeek-R1
DeepSeek27.415.223.5$0.55$2.1937.886.8
5
GPT-4o
OpenAI35.79.823.1$2.50$10.003.685.5
6
GPT-4o-mini
OpenAI29.411.222.1$0.15$0.603.084.2
7
Gemini-2.5-Pro
Google17.462.220.3$1.25$10.0010.782.3
8
Claude-3.5-Haiku
Anthropic15.955.218.5$0.80$4.003.479.7
9
GPT-5
OpenAI58.04.216.2$1.25$10.001.578.4
10
Gemini-2.5-Flash
Google13.472.816.0$0.30$2.508.377.8
11
Humans (Tier-1 VCs)
Tier123.05.210.7$0.00$0.000.070.0
12
Humans (Y Combinator)
YC14.06.98.6$0.00$0.000.065.0
13
Random Classifier
Baseline9.09.09.0$0.00$0.000.050.0

ABOUT VC BENCH

Overview

VCBench introduces the first standardized benchmark for founder-success prediction in venture capital. Benchmarks such as SWE-bench (software engineering) and ARC-AGI (general reasoning) have shown how shared datasets accelerate progress toward AGI. Venture capital is a particularly compelling testbed: it is a domain of uncertain and incomplete information where even expert humans perform poorly.

Human Benchmarks

At the inception stage, the market index baseline is only 1.9% precision. Y Combinator achieves 3.2% precision (about 1.7× over the index), while tier-1 venture firms average 5.6% precision (about 2.9×). Recalls for both typically hover around 6%. To compare these human benchmarks fairly against VCBench, where the positive prevalence is 9%, we adjust for prevalence differences. Recall and false positive rates are invariant to prevalence, so precision is recalculated at π=9%. Under this normalization, tier-1 VC precision rises to about 23% at 5.2% recall (F0.5 ≈ 10.7), while YC precision normalizes to about 14% at 6.9% recall (F0.5 ≈ 8.6). These values remain below the strongest model baselines, underscoring both the difficulty of the task and the relevance of VC as a proxy for testing human-level and beyond-human-level intelligence.

Dataset

Our initial release, vcbench-founder-prediction-v1, contains 9,000 anonymized founder profiles, of which 810 (9%) are labeled successful. Success is defined rigorously: a founder’s company either exited or IPO’d above $500M, or raised more than $500M in funding. Profiles were collected from LinkedIn and Crunchbase, then passed through a multi-stage pipeline for data standardization, filtering, enrichment, and anonymization. This process reduced identifiable founders by over 80% in adversarial testing, while preserving predictive features such as education quality (via QS university rankings), job histories, and industry clustering.

Results

The benchmark enables consistent comparisons between prediction systems. In our baseline study, nine state-of-the-art models were evaluated across six folds of the dataset. Precision was emphasized through the F0.5 metric, reflecting the higher cost of false positives in VC decision-making. The strongest performer achieved 5.9× the dataset baseline precision, already surpassing normalized human performance.

By framing venture capital as a benchmarked environment, VCBench provides both an applied challenge and a representation of AGI-level testing: success depends on reasoning under uncertainty, extracting weak signals from sparse histories, and outperforming human experts.

Initiated by Vela Research, the research arm of Vela Partners.

VC Bench evaluates AI models on venture capital functionality tasks.

Updated in real-time as new submissions are processed.