Engineering AI Leaderboard
Model Evaluation ยท Metrics ยท Visualization
Real-World Engineering Decisions, Backed by Objective Evaluation
๐Ÿ”ฌReal engineering data, not academic benchmarks
๐ŸญRealistic data scales, not unlimited training sets
Most AI benchmarks evaluate on academic datasets at unlimited scale. Manufacturing engineers face a different reality: real geometries, limited data, and metrics that matter in production. Narnia Benchmark tests SOTA models on real engineering datasets across multiple data scales, so teams pick what actually works for their data โ€” not what wins on a synthetic leaderboard.
STEP A
Train & Infer
Train each model โ†’ produce standardized outputs
STEP B
Compute Metrics
Inference โ†’ Metrics โ†’ Summary Report
STEP C
Rank Models
Fair multi-metric ranking via BenchRank
Why โ€” Fair comparison requires identical conditions: same dataset, same scale, same pipeline.
How โ€” Each model is implemented per its paper and trained with early stopping on a held-out validation metric.
What โ€” Best-validation checkpoint (not the final-epoch model) โ†’ 500 standardized samples + training/inference time records.
Why โ€” A single metric misleads. We measure quality, distribution-level behavior, and efficiency simultaneously.
How โ€” Task-appropriate metrics for each output: FID ยท Coverage for generation, MAE ยท RMSE ยท Rยฒ for prediction. Plus efficiency (params, time).
What โ€” Per-model metric report with โ†‘ higher-is-better and โ†“ lower-is-better indicators.
Why โ€” Simply averaging metrics is unfair: correlated metrics and different scales distort results. BenchRank solves this.
How โ€” Debias correlated metrics โ†’ build head-to-head dominance graph โ†’ PageRank scoring โ†’ one Total Score per model.
What โ€” Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.
3D GenerationDONE
100%
DeepJEB + DeepWheel + DrivAerNet ยท 8192pts ยท S/M/L/XL
6 Models Complete
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D
24/24 cells complete0 pending
๐Ÿ† DeepSDF โ€” Best on Total Score
2D GenerationDONE
100%
DeepJEB + DeepWheel + DrivAerNet ยท 128ร—128 ยท S/M/L/XL
9 Models Complete
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE
9 done0 pending
๐Ÿ† DDPM โ€” Best on Total Score
3D Prediction โ€” FieldDONE
100%
DeepJEB + DrivAerNet + DrivAerML ยท 8192pts ยท S/M/L/XL
9 Models Complete
GeoTransolver, Transolver++, Transolver, DoMINO, LinearNO (ร—2), RegDGCNN, PointNet, GeoFNO
9 done0 pending
๐Ÿ† GeoTransolver โ€” Best on Total Score
2D Prediction โ€” FieldDONE
100%
DeepWheel + DeepJEB + PDEBench-Darcy + AirfRANS ยท 128ร—128 ยท depth / stress / PDE / RANS
8 Models Complete
U-Net, ResNet-UNet, Attention U-Net, U-Net++, SegFormer-B0, FPN, GLPN, DPT-Hybrid
8 done0 pending
๐Ÿ† FPN (ResNet-18) โ€” Best on Total Score
3D Prediction โ€” ScalarDONE
100%
DeepWheel ยท 8192pts ยท mass / mode7 / mode11
10 Models
PointNet, DGCNN, PointNet++ (ร—2), PCT (ร—2), Point Transformer (ร—2), PointMLP (ร—2)
10 done0 pending
๐Ÿ† DGCNN โ€” Best on Total Score
2D Prediction โ€” ScalarDONE
100%
DeepWheel ยท 128ร—128 ยท mass / mode7 / mode11
7 Models Complete
SimpleCNN, ResNet-18/34, EfficientNet-B0, ConvNeXt-Tiny, DenseNet-121, ViT-Tiny
7 done0 pending
๐Ÿ† DenseNet-121 โ€” Best on Total Score
1D Prediction โ€” ScalarDONE
100%
Concrete + Airfoil + CMAPSS ยท tabular / timeseries ยท S/M/L/XL
16 Models Complete
MLP, FT-Transformer, NODE, TabNet, TabPFN, XGBoost, LightGBM, Random Forest, Gaussian Process, Ridge, LSTM, BiLSTM, DCNN, TCN, CNN-LSTM, DAST
16 done0 pending
๐Ÿ† TabPFN โ€” Best on Total Score
Objective accuracy evaluation of AI models and decision support for optimal model selection
Benchmark Dataset ยท Evaluation Methods ยท Automation Framework
90%
Built-in Model Coverage
90%
Workflow Coverage
20 WF
Hide
1Q
Core Pipeline
+ MVP Leaderboard
~2026.02
DONE | 3 WF
2Q
Model Expansion
+ Domain Extension
2026.03~05
DONE | 7 WF
NOW
3Q
Full Coverage
+ 100% Validation
2026.06~08
In Progress | 20 WF
4Q
Agentic Leaderboard
+ Competitor Benchmark
2026.09~11
Planned | 20 WF
3Q Details (Current)
Comprehensive 3D/2D/1D workflow coverage with fully automated verification
Overall Annual Progress ~55%
Show
Agentic Leaderboard
A system that automatically recommends the optimal AI model based on context and conditions
Objective Comparison
Universal Metrics
Context-aware Recommendation
Auto-Validation Pipeline