SerenQA

Model Evaluation Leaderboard

Performance on Knowledge Retrieval and Reasoning Tasks

Knowledge Retrieval (T1)

Best scores bolded, 2nd best underlined

	One-Hop			Two-Hop			Multiple(3+)Hop			Intersection

DeepSeek-V3	20.45	78.71	72.88	3.46	10.71	9.86	1.97	6.22	6.55	2.64	7.15	8.03
GPT-4o	19.71	77.16	60.17	2.08	6.36	7.89	1.4	4.2	4.85	1.56	4.65	5.21
Claude-3.5-Haiku	13.28	48.54	48.73	9.78	39.01	32.89	4.43	8.64	14.08	1.38	3.9	4.66
Llama-3.3-70B	19.28	70.67	74.58	16.63	44.34	56.57	2.98	10.16	11.89	4.8	9.6	16.05
DeepSeek-R1-70B	19.87	69.07	80.08	12.03	37	43.42	2.97	8.06	13.11	3.49	6.16	16.46
Med42-V2-70B	18.34	69.43	69.92	5.92	19.12	19.74	0.23	0.51	1.21	0.08	0.13	0.68
Qwen3-32B	0.37	1.27	1.27	0.16	0.65	0.65	0.24	0.36	0.48	0	0	0
Qwen3-8B	10.07	37.24	39.83	0.98	2.87	3.95	0.9	2.01	4.85	1.58	1.91	5.62
DeepSeek-R1-8B	1.27	3.41	5.51	0	0	0	0.04	0.24	0.24	0	0	0
Med42-V2-8B	8.11	23.9	49.15	1.05	3.31	3.97	1.71	4.07	4.12	0.04	0.13	0.14

0 of 12 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Hit(%) across Hops

Subgraph Reasoning (T2)

Best scores bolded, 2nd best underlined

	LLM Ensemble			Expert Crowdsourced			RNS Guided

DeepSeek-V3	2.283	3.341	0.101	2.306	3.34	0.1	2.253	3.326	0.106
Llama-3.3-70B	2.519	3.842	0.07	2.553	3.853	0.068	2.531	3.829	0.075
DeepSeek-R1-70B	2.573	2.206	0.223	2.572	2.238	0.204	2.582	2.202	0.217
Qwen-2.5-72B	2.024	2.683	0.153	2.093	2.715	0.152	2.114	2.719	0.155
Mixtral-8x7B	2.271	2.963	0.642	2.272	2.958	0.61	2.347	2.924	0.632
Qwen-2.5-32B	2.243	2.929	0.148	2.255	2.91	0.146	2.26	2.886	0.152
Gamma-2-27B	2.365	3.41	0.088	2.381	3.439	0.084	2.385	3.415	0.089
Mistral-24B	2.114	3.016	0.141	2.114	3.048	0.136	2.134	3.049	0.141
Qwen-2.5-7B	1.92	1.817	0.592	1.9	1.848	0.58	1.955	1.832	0.593
→ wo. summary	-	-	-	-	-	-	-	-	-

0 of 16 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Faithful., Compre., SerenCov

Serendipity Exploration (T3)

Best scores bolded, 2nd best underlined

	LLM Ensemble			Expert Crowdsourced			RNS Guided

DeepSeek-V3	2.436	0.482	0.048	2.494	0.462	0.061	2.538	0.463	0.077
Llama-3.3-70B	2.537	0.502	0.046	2.559	0.483	0.067	2.594	0.478	0.106
DeepSeek-R1-70B	2.544	0.505	0.043	2.565	0.478	0.086	2.63	0.483	0.127
Qwen-2.5-72B	1.935	0.424	0.03	2	0.409	0.034	2.033	0.418	0.049
Mixtral-8x7B	2.269	0.428	0.028	2.337	0.416	0.05	2.409	0.412	0.07
Qwen-2.5-32B	2.158	0.324	0.016	2.25	0.312	0.022	2.22	0.306	0.042
Gamma-2-27B	2.357	0.45	0.033	2.379	0.414	0.057	2.443	0.431	0.08
Mistral-24B	1.903	0.212	0.011	1.962	0.204	0.023	2.006	0.213	0.035
Qwen-2.5-7B	1.636	0.221	0.022	1.721	0.229	0.026	1.708	0.215	0.041
→ wo. summary	2.447	0.482	0.05	2.482	0.463	0.095	2.51	0.468	0.134

0 of 16 row(s) selected.

Rows per page

Page 1 of 2

Scatter Plot: Relevance, TypeMatch, SerenHit

Notes

Bold indicates the best score in each column.
Underline indicates the second-best score.
→ wo. summary rows show ablation results without summary input.
All metrics are averaged across multiple tasks and datasets.
Higher is better for all reported metrics.