Engineering AI Leaderboard
Model Evaluation ยท Metrics ยท Visualization
Benchmark Pipeline
STEP A
Train & Infer
Train each model โ produce standardized outputs
STEP B
Compute Metrics
Inference โ Metrics โ Summary Report
STEP C
Rank Models
Fair multi-metric ranking via BenchRank
Why โ Fair comparison requires identical conditions: same dataset, same scale, same pipeline.
How โ Each model is implemented per its paper and trained with early stopping on a held-out validation metric.
What โ Best-validation checkpoint (not the final-epoch model) โ 500 standardized samples + training/inference time records.
How โ Each model is implemented per its paper and trained with early stopping on a held-out validation metric.
What โ Best-validation checkpoint (not the final-epoch model) โ 500 standardized samples + training/inference time records.
Why โ A single metric misleads. We measure quality, distribution-level behavior, and efficiency simultaneously.
How โ Task-appropriate metrics for each output: FID ยท Coverage for generation, MAE ยท RMSE ยท Rยฒ for prediction. Plus efficiency (params, time).
What โ Per-model metric report with โ higher-is-better and โ lower-is-better indicators.
How โ Task-appropriate metrics for each output: FID ยท Coverage for generation, MAE ยท RMSE ยท Rยฒ for prediction. Plus efficiency (params, time).
What โ Per-model metric report with โ higher-is-better and โ lower-is-better indicators.
Why โ Simply averaging metrics is unfair: correlated metrics and different scales distort results. BenchRank solves this.
How โ Debias correlated metrics โ build head-to-head dominance graph โ PageRank scoring โ one Total Score per model.
What โ Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.
How โ Debias correlated metrics โ build head-to-head dominance graph โ PageRank scoring โ one Total Score per model.
What โ Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.
3D GenerationDONE
100%
DeepJEB + DeepWheel + DrivAerNet ยท 8192pts ยท S/M/L/XL
6 Models Complete
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D
๐ DeepSDF โ Best on Total Score
2D GenerationDONE
100%
DeepJEB + DeepWheel + DrivAerNet ยท 128ร128 ยท S/M/L/XL
9 Models Complete
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE
๐ DDPM โ Best on Total Score
3D Prediction โ FieldDONE
100%
DeepJEB + DrivAerNet + DrivAerML ยท 8192pts ยท S/M/L/XL
9 Models Complete
GeoTransolver, Transolver++, Transolver, DoMINO, LinearNO (ร2), RegDGCNN, PointNet, GeoFNO
GeoTransolver, Transolver++, Transolver, DoMINO, LinearNO (ร2), RegDGCNN, PointNet, GeoFNO
๐ GeoTransolver โ Best on Total Score
2D Prediction โ FieldDONE
100%
DeepWheel + DeepJEB + PDEBench-Darcy + AirfRANS ยท 128ร128 ยท depth / stress / PDE / RANS
8 Models Complete
U-Net, ResNet-UNet, Attention U-Net, U-Net++, SegFormer-B0, FPN, GLPN, DPT-Hybrid
U-Net, ResNet-UNet, Attention U-Net, U-Net++, SegFormer-B0, FPN, GLPN, DPT-Hybrid
๐ FPN (ResNet-18) โ Best on Total Score
3D Prediction โ ScalarDONE
100%
DeepWheel ยท 8192pts ยท mass / mode7 / mode11
10 Models
PointNet, DGCNN, PointNet++ (ร2), PCT (ร2), Point Transformer (ร2), PointMLP (ร2)
PointNet, DGCNN, PointNet++ (ร2), PCT (ร2), Point Transformer (ร2), PointMLP (ร2)
๐ DGCNN โ Best on Total Score
2D Prediction โ ScalarDONE
100%
DeepWheel ยท 128ร128 ยท mass / mode7 / mode11
7 Models Complete
SimpleCNN, ResNet-18/34, EfficientNet-B0, ConvNeXt-Tiny, DenseNet-121, ViT-Tiny
SimpleCNN, ResNet-18/34, EfficientNet-B0, ConvNeXt-Tiny, DenseNet-121, ViT-Tiny
๐ DenseNet-121 โ Best on Total Score
1D Prediction โ ScalarDONE
100%
Concrete + Airfoil + CMAPSS ยท tabular / timeseries ยท S/M/L/XL
16 Models Complete
MLP, FT-Transformer, NODE, TabNet, TabPFN, XGBoost, LightGBM, Random Forest, Gaussian Process, Ridge, LSTM, BiLSTM, DCNN, TCN, CNN-LSTM, DAST
MLP, FT-Transformer, NODE, TabNet, TabPFN, XGBoost, LightGBM, Random Forest, Gaussian Process, Ridge, LSTM, BiLSTM, DCNN, TCN, CNN-LSTM, DAST
๐ TabPFN โ Best on Total Score
Mission
Objective accuracy evaluation of AI models and decision support for optimal model selection
Benchmark Dataset ยท Evaluation Methods ยท Automation Framework
Annual KPI
90%
Built-in Model Coverage
90%
Workflow Coverage
20 WF
Annual Roadmap
Hide
Ultimate Goal
Show
Agentic Leaderboard
A system that automatically recommends the optimal AI model based on context and conditions
Objective Comparison
Universal Metrics
Context-aware Recommendation
Auto-Validation Pipeline