Learning to Reason Across Parallel Samples for LLM Reasoning
Jianing Qi1 · Xi Ye2 · Hao Tang3 · Zhigang Zhu4 · Eunsol Choi5
1CUNY Graduate Center 2Princeton University 3BMCC, CUNY 4CCNY, CUNY 5New York University
A compact 0.5 – 3 B add-on that reaches 94 % of Pass@5 (only 6 % relative gap) — all without any base-model fine-tuning.
Abstract
Scaling test‑time compute by sampling multiple reasoning paths yields large gains but leaves an oracle gap. We introduce SSA, a tiny LLM fine‑tuned with GRPO to read k candidate solutions and emit one final answer. On GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench, SSA reaches 56.1 % accuracy—just 3.6 pp shy of the Pass@5 oracle—and beats 7 B process‑reward models while training on <5 % of their data. The model generalises across base‑family (Qwen → Llama‑3), base‑size (7 → 32 B) and k without re‑tuning.
Method at a Glance
Parallel candidates → small RL‑tuned SSA → final answer
- Step 1: Freeze a base LLM; sample k solutions.
- Step 2: Concatenate solutions + prompt → SSA (0.5–3 B).
- Training: GRPO with a sparse, verifiable reward (correct / format).
Results
SSA‑3B surpasses a 7 B process‑reward verifier. The same checkpoint can be plugged into frozen base models up to 32 B with no re‑tuning.
The charts below show the average of GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench.

| Pass@1 (base) | 7 B | 1 | 45.5 |
| Pass@5 (oracle) | — | 5 | 59.7 |
| Majority vote | — | 5 | 49.7 |
| Qwen‑PRM | 7 B | 5 | 53.0 |
| SSA (ours) | 3 B | 5 | 56.1 |
Below we also juxtapose SSA‑3B with sequential RL models that fine‑tune the entire 7 B / 14 B / 32 B backbone. Despite its 10× smaller footprint, SSA stays within 2–3 pp accuracy with similar tokens during test time.

Discussion
Key Questions:
- Does RL optimize the output distribution? We find that we can separate the reasoning abilities out from the model weights. By plug a small SSA on top of the base model outputs, we can signficantly increase the performance just like full RL training. It is possible to view SSA as shaping the output distribution of the base model not through weights but through sampled outputs.
- How helpful is the thinking process? Similar to the previous works, we find that thinking process might not be helpful for the final answer in terms of benchmark performance. Is thinking necessary for the model? Probably not. But it does seem that thinking process is a bit more robust to our of domain generalization when we test it on the other tasks such as MMLU-PRO and ARC-C.
- Can RL learn new reasoning abilities? The more interesting part is to use the SSA on top of the truncated reasoning outputs. We find that even a small SSA can recover most of the answer even when last 10% of the reasoning process and answers are truncated. However, pure RL version is much worse than the RL+SFT version. We believe it is an interesting direction to explore stitching the reasoning processes to arrive at the final answer.
.png)

