Narnia Labs — AI Benchmark Leaderboard

AI Benchmark Leaderboard

Model Evaluation · Metrics · Visualization

"Delivering Real-World Value, Objective Model Evaluation"

🔬Compare AI model performance against SOTA and find the optimal model

🏭Select models optimized for manufacturing and enterprise projects

Data-Driven Selection

Engineering-First

Fair Comparison

Benchmark Pipeline

STEP A

Train & Infer

Model + Dataset → Train → Infer → Results

STEP B

Calc Metrics

Inference → Metrics → Summary Report

STEP C

Rank Models

Debiased Weights → Dominance → PageRank

Why — Fair comparison requires identical conditions — same dataset, same scale, same pipeline.
How — Each model is implemented per its paper, trained up to 1,000 epochs with early stopping, and the best checkpoint is selected for inference.
What — Best checkpoint → 500 standardized samples + training/inference time records.

Why — A single metric misleads. We measure quality, diversity, and efficiency simultaneously.
How — 500 generated vs 500 real samples, compared across FID·CD·MAE (quality), Precision·Recall·Coverage (diversity), and params·time (efficiency).
What — Per-model metric report with ↑ higher-is-better and ↓ lower-is-better indicators.

Why — Simply averaging metrics is unfair — correlated metrics and different scales distort results. BenchRank solves this.
How — Debias correlated metrics → build head-to-head dominance graph → PageRank scoring → one Total Score per model.
What — Ranked leaderboard per scale (S/M/L/XL), switchable between quality-only and quality+efficiency views.

3D GenerationDONE

100%

DeepJEB + DrivAerNet · 8192pts · S/M/L/XL

6 Models Complete
3D-GAN, DeepSDF, PointFlow, ShapeGF, AtlasNet, Diffusion3D

6 done0 pending

🏆 PointFlow — Best on Total Score

View Leaderboard →

METRICS

MV-FID FPD CD EMD F-Score MS-SSIM Precision Recall Density Coverage Train Time Infer Time

click for metrics

3D Evaluation — FieldDONE

100%

DeepJEB + DrivAerNet · 8192pts · S/M/L/XL

6 Models Complete
Transolver++, AB-UPT, Transolver, PointNet, RegDGCNN, GeoFNO

6 done0 pending

🏆 Transolver — Best on Total Score

View Leaderboard →

METRICS

MAE RMSE MAPE R² Rel-L2 MaxAE MAC Train Time Infer Time

click for metrics

2D GenerationDONE

100%

DeepJEB + DrivAerNet · 128×128 · S/M/L/XL

9 Models Complete
GAN, VAE, DCGAN, LSGAN, WGAN-CP/GP, R1GAN, DDPM, VQVAE

9 done0 pending

🏆 DDPM — Best on FID · Precision · Recall · Coverage

View Leaderboard →

METRICS

IS FID LPIPS PSNR MS-SSIM Precision Density Recall Coverage Train Time Infer Time

click for metrics

Mission

Objective accuracy evaluation of AI models and decision support for optimal model selection

Benchmark Dataset · Evaluation Methods · Automation Framework

Annual KPI

90%

Built-in Model Coverage

90%

Workflow Coverage

20 WF

Annual Roadmap

▲ Hide

Core Pipeline
+ MVP Leaderboard

~2026.02

DONE | 3 WF

NOW

Model Expansion
+ Domain Extension

2026.03~05

90% | 7 WF

Full Coverage
+ 100% Validation

2026.06~08

100% | 20 WF

Agentic Leaderboard
+ Competitor Benchmark

2026.09~11

110% | 20 WF

2Q Details (Current)

New domain & task expansion — 2D/3D Evaluation pipelines, 3D Generation scaling, and dataset infrastructure

Overall Annual Progress ~30%

Ultimate Goal

▼ Show

Agentic Leaderboard

A system that automatically recommends the optimal AI model based on context and conditions

Objective Comparison

Universal Metrics

Context-aware Recommendation

Auto-Validation Pipeline