SerenQA

Model Evaluation Leaderboard

Performance on Knowledge Retrieval and Reasoning Tasks

Knowledge Retrieval (T1)

Best scores bolded, 2nd best underlined

One-Hop
Two-Hop
Multiple(3+)Hop
Intersection
DeepSeek-V320.4578.7172.883.4610.719.861.976.226.552.647.158.03
GPT-4o19.7177.1660.172.086.367.891.44.24.851.564.655.21
Claude-3.5-Haiku13.2848.5448.739.7839.0132.894.438.6414.081.383.94.66
Llama-3.3-70B19.2870.6774.5816.6344.3456.572.9810.1611.894.89.616.05
DeepSeek-R1-70B19.8769.0780.0812.033743.422.978.0613.113.496.1616.46
Med42-V2-70B18.3469.4369.925.9219.1219.740.230.511.210.080.130.68
Qwen3-32B0.371.271.270.160.650.650.240.360.48000
Qwen3-8B10.0737.2439.830.982.873.950.92.014.851.581.915.62
DeepSeek-R1-8B1.273.415.510000.040.240.24000
Med42-V2-8B8.1123.949.151.053.313.971.714.074.120.040.130.14
0 of 12 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Hit(%) across Hops

Subgraph Reasoning (T2)

Best scores bolded, 2nd best underlined

LLM Ensemble
Expert Crowdsourced
RNS Guided
DeepSeek-V32.2833.3410.1012.3063.340.12.2533.3260.106
Llama-3.3-70B2.5193.8420.072.5533.8530.0682.5313.8290.075
DeepSeek-R1-70B2.5732.2060.2232.5722.2380.2042.5822.2020.217
Qwen-2.5-72B2.0242.6830.1532.0932.7150.1522.1142.7190.155
Mixtral-8x7B2.2712.9630.6422.2722.9580.612.3472.9240.632
Qwen-2.5-32B2.2432.9290.1482.2552.910.1462.262.8860.152
Gamma-2-27B2.3653.410.0882.3813.4390.0842.3853.4150.089
Mistral-24B2.1143.0160.1412.1143.0480.1362.1343.0490.141
Qwen-2.5-7B1.921.8170.5921.91.8480.581.9551.8320.593
→ wo. summary---------
0 of 16 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Faithful., Compre., SerenCov

Serendipity Exploration (T3)

Best scores bolded, 2nd best underlined

LLM Ensemble
Expert Crowdsourced
RNS Guided
DeepSeek-V32.4360.4820.0482.4940.4620.0612.5380.4630.077
Llama-3.3-70B2.5370.5020.0462.5590.4830.0672.5940.4780.106
DeepSeek-R1-70B2.5440.5050.0432.5650.4780.0862.630.4830.127
Qwen-2.5-72B1.9350.4240.0320.4090.0342.0330.4180.049
Mixtral-8x7B2.2690.4280.0282.3370.4160.052.4090.4120.07
Qwen-2.5-32B2.1580.3240.0162.250.3120.0222.220.3060.042
Gamma-2-27B2.3570.450.0332.3790.4140.0572.4430.4310.08
Mistral-24B1.9030.2120.0111.9620.2040.0232.0060.2130.035
Qwen-2.5-7B1.6360.2210.0221.7210.2290.0261.7080.2150.041
→ wo. summary2.4470.4820.052.4820.4630.0952.510.4680.134
0 of 16 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Relevance, TypeMatch, SerenHit

Notes