Performance on Knowledge Retrieval and Reasoning Tasks
One-Hop | Two-Hop | Multiple(3+)Hop | Intersection | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-V3 | 20.45 | 78.71 | 72.88 | 3.46 | 10.71 | 9.86 | 1.97 | 6.22 | 6.55 | 2.64 | 7.15 | 8.03 |
| GPT-4o | 19.71 | 77.16 | 60.17 | 2.08 | 6.36 | 7.89 | 1.4 | 4.2 | 4.85 | 1.56 | 4.65 | 5.21 |
| Claude-3.5-Haiku | 13.28 | 48.54 | 48.73 | 9.78 | 39.01 | 32.89 | 4.43 | 8.64 | 14.08 | 1.38 | 3.9 | 4.66 |
| Llama-3.3-70B | 19.28 | 70.67 | 74.58 | 16.63 | 44.34 | 56.57 | 2.98 | 10.16 | 11.89 | 4.8 | 9.6 | 16.05 |
| DeepSeek-R1-70B | 19.87 | 69.07 | 80.08 | 12.03 | 37 | 43.42 | 2.97 | 8.06 | 13.11 | 3.49 | 6.16 | 16.46 |
| Med42-V2-70B | 18.34 | 69.43 | 69.92 | 5.92 | 19.12 | 19.74 | 0.23 | 0.51 | 1.21 | 0.08 | 0.13 | 0.68 |
| Qwen3-32B | 0.37 | 1.27 | 1.27 | 0.16 | 0.65 | 0.65 | 0.24 | 0.36 | 0.48 | 0 | 0 | 0 |
| Qwen3-8B | 10.07 | 37.24 | 39.83 | 0.98 | 2.87 | 3.95 | 0.9 | 2.01 | 4.85 | 1.58 | 1.91 | 5.62 |
| DeepSeek-R1-8B | 1.27 | 3.41 | 5.51 | 0 | 0 | 0 | 0.04 | 0.24 | 0.24 | 0 | 0 | 0 |
| Med42-V2-8B | 8.11 | 23.9 | 49.15 | 1.05 | 3.31 | 3.97 | 1.71 | 4.07 | 4.12 | 0.04 | 0.13 | 0.14 |
Rows per page
LLM Ensemble | Expert Crowdsourced | RNS Guided | |||||||
|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-V3 | 2.283 | 3.341 | 0.101 | 2.306 | 3.34 | 0.1 | 2.253 | 3.326 | 0.106 |
| Llama-3.3-70B | 2.519 | 3.842 | 0.07 | 2.553 | 3.853 | 0.068 | 2.531 | 3.829 | 0.075 |
| DeepSeek-R1-70B | 2.573 | 2.206 | 0.223 | 2.572 | 2.238 | 0.204 | 2.582 | 2.202 | 0.217 |
| Qwen-2.5-72B | 2.024 | 2.683 | 0.153 | 2.093 | 2.715 | 0.152 | 2.114 | 2.719 | 0.155 |
| Mixtral-8x7B | 2.271 | 2.963 | 0.642 | 2.272 | 2.958 | 0.61 | 2.347 | 2.924 | 0.632 |
| Qwen-2.5-32B | 2.243 | 2.929 | 0.148 | 2.255 | 2.91 | 0.146 | 2.26 | 2.886 | 0.152 |
| Gamma-2-27B | 2.365 | 3.41 | 0.088 | 2.381 | 3.439 | 0.084 | 2.385 | 3.415 | 0.089 |
| Mistral-24B | 2.114 | 3.016 | 0.141 | 2.114 | 3.048 | 0.136 | 2.134 | 3.049 | 0.141 |
| Qwen-2.5-7B | 1.92 | 1.817 | 0.592 | 1.9 | 1.848 | 0.58 | 1.955 | 1.832 | 0.593 |
| → wo. summary | - | - | - | - | - | - | - | - | - |
Rows per page
LLM Ensemble | Expert Crowdsourced | RNS Guided | |||||||
|---|---|---|---|---|---|---|---|---|---|
| DeepSeek-V3 | 2.436 | 0.482 | 0.048 | 2.494 | 0.462 | 0.061 | 2.538 | 0.463 | 0.077 |
| Llama-3.3-70B | 2.537 | 0.502 | 0.046 | 2.559 | 0.483 | 0.067 | 2.594 | 0.478 | 0.106 |
| DeepSeek-R1-70B | 2.544 | 0.505 | 0.043 | 2.565 | 0.478 | 0.086 | 2.63 | 0.483 | 0.127 |
| Qwen-2.5-72B | 1.935 | 0.424 | 0.03 | 2 | 0.409 | 0.034 | 2.033 | 0.418 | 0.049 |
| Mixtral-8x7B | 2.269 | 0.428 | 0.028 | 2.337 | 0.416 | 0.05 | 2.409 | 0.412 | 0.07 |
| Qwen-2.5-32B | 2.158 | 0.324 | 0.016 | 2.25 | 0.312 | 0.022 | 2.22 | 0.306 | 0.042 |
| Gamma-2-27B | 2.357 | 0.45 | 0.033 | 2.379 | 0.414 | 0.057 | 2.443 | 0.431 | 0.08 |
| Mistral-24B | 1.903 | 0.212 | 0.011 | 1.962 | 0.204 | 0.023 | 2.006 | 0.213 | 0.035 |
| Qwen-2.5-7B | 1.636 | 0.221 | 0.022 | 1.721 | 0.229 | 0.026 | 1.708 | 0.215 | 0.041 |
| → wo. summary | 2.447 | 0.482 | 0.05 | 2.482 | 0.463 | 0.095 | 2.51 | 0.468 | 0.134 |
Rows per page
→ wo. summary rows show ablation results without summary input.