AI Benchmark Leaderboard
Model Evaluation · Metrics · Visualization
Benchmark Pipeline
STEP A
Train & Infer
Model + Dataset → Train → Infer → Results
STEP B
Calc Metrics
Inference → Metrics → Summary Report
STEP C
Rank Models
Debiased Weights → Dominance → PageRank
Why — Fair comparison requires identical conditions — same dataset, same scale, same pipeline.
How — Each model is implemented per its paper, trained up to 1,000 epochs with early stopping, and the best checkpoint is selected for inference.
What — Best checkpoint → 500 standardized samples + training/inference time records.
How — Each model is implemented per its paper, trained up to 1,000 epochs with early stopping, and the best checkpoint is selected for inference.
What — Best checkpoint → 500 standardized samples + training/inference time records.
Why — A single metric misleads. We measure quality, diversity, and efficiency simultaneously.
How — 500 generated vs 500 real samples, compared across FID·CD·MAE (quality), Precision·Recall·Coverage (diversity), and params·time (efficiency).
What — Per-model metric report with ↑ higher-is-better and ↓ lower-is-better indicators.
How — 500 generated vs 500 real samples, compared across FID·CD·MAE (quality), Precision·Recall·Coverage (diversity), and params·time (efficiency).
What — Per-model metric report with ↑ higher-is-better and ↓ lower-is-better indicators.
Why — Simply averaging metrics is unfair — correlated metrics and different scales distort results. BenchRank solves this.
How — Debias correlated metrics → build head-to-head dominance graph → PageRank scoring → one Total Score per model.
What — Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.
How — Debias correlated metrics → build head-to-head dominance graph → PageRank scoring → one Total Score per model.
What — Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.
3D GenerationDONE
100%
DeepJEB + DrivAerNet · 8192pts · S/M/L/XL
6 Models Complete
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D
🏆 PointFlow — Best on Total Score
METRICS
MV-FID
FPD
CD
EMD
F-Score
MS-SSIM
Precision
Recall
Density
Coverage
Train Time
Infer Time
click for metrics
3D Evaluation — FieldDONE
100%
DeepJEB + DrivAerNet · 8192pts · S/M/L/XL
6 Models Complete
Transolver++, AB-UPT, Transolver, PointNet, RegDGCNN, GeoFNO
Transolver++, AB-UPT, Transolver, PointNet, RegDGCNN, GeoFNO
🏆 Transolver — Best on Total Score
METRICS
MAE
RMSE
MAPE
R²
Rel-L2
MaxAE
MAC
Train Time
Infer Time
click for metrics
2D GenerationDONE
100%
DeepJEB + DrivAerNet · 128×128 · S/M/L/XL
9 Models Complete
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE
🏆 DDPM — Best on FID · Precision · Recall · Coverage
METRICS
IS
FID
LPIPS
PSNR
MS-SSIM
Precision
Density
Recall
Coverage
Train Time
Infer Time
click for metrics
Mission
Objective accuracy evaluation of AI models and decision support for optimal model selection
Benchmark Dataset · Evaluation Methods · Automation Framework
Annual KPI
90%
Built-in Model Coverage
90%
Workflow Coverage
20 WF
Annual Roadmap
▲ Hide