SSA: Learning to Reason Across Parallel Samples for LLM Reasoning

4 weeks ago 1

Learning to Reason Across Parallel Samples for LLM Reasoning

Jianing Qi¹ · Xi Ye² · Hao Tang³ · Zhigang Zhu⁴ · Eunsol Choi⁵

¹CUNY Graduate Center ²Princeton University ³BMCC, CUNY ⁴CCNY, CUNY ⁵New York University

A compact 0.5 – 3 B add-on that reaches 94 % of Pass@5 (only 6 % relative gap) — all without any base-model fine-tuning.

Architecture diagram comparing SSA to parallel and sequential test‑time scaling

Abstract

Scaling test‑time compute by sampling multiple reasoning paths yields large gains but leaves an oracle gap. We introduce SSA, a tiny LLM fine‑tuned with GRPO to read k candidate solutions and emit one final answer. On GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench, SSA reaches 56.1 % accuracy—just 3.6 pp shy of the Pass@5 oracle—and beats 7 B process‑reward models while training on <5 % of their data. The model generalises across base‑family (Qwen → Llama‑3), base‑size (7 → 32 B) and k without re‑tuning.

Method at a Glance

Step 1: Freeze a base LLM; sample k solutions.
Step 2: Concatenate solutions + prompt → SSA (0.5–3 B).
Training: GRPO with a sparse, verifiable reward (correct / format).

Results

SSA‑3B surpasses a 7 B process‑reward verifier. The same checkpoint can be plugged into frozen base models up to 32 B with no re‑tuning.

The charts below show the average of GSM8K, MATH, AIME‑24, AMC‑23 and OlympiadBench.

Bar chart comparing SSA to parallel‑scaling baselines

MethodParamskAvg Acc ↑

Pass@1 (base)	7 B	1	45.5
Pass@5 (oracle)	—	5	59.7
Majority vote	—	5	49.7
Qwen‑PRM	7 B	5	53.0
SSA (ours)	3 B	5	56.1

Below we also juxtapose SSA‑3B with sequential RL models that fine‑tune the entire 7 B / 14 B / 32 B backbone. Despite its 10× smaller footprint, SSA stays within 2–3 pp accuracy with similar tokens during test time.

Discussion

Key Questions:

Does RL optimize the output distribution? We find that we can separate the reasoning abilities out from the model weights. By plug a small SSA on top of the base model outputs, we can signficantly increase the performance just like full RL training. It is possible to view SSA as shaping the output distribution of the base model not through weights but through sampled outputs.
How helpful is the thinking process? Similar to the previous works, we find that thinking process might not be helpful for the final answer in terms of benchmark performance. Is thinking necessary for the model? Probably not. But it does seem that thinking process is a bit more robust to our of domain generalization when we test it on the other tasks such as MMLU-PRO and ARC-C.
Can RL learn new reasoning abilities? The more interesting part is to use the SSA on top of the truncated reasoning outputs. We find that even a small SSA can recover most of the answer even when last 10% of the reasoning process and answers are truncated. However, pure RL version is much worse than the RL+SFT version. We believe it is an interesting direction to explore stitching the reasoning processes to arrive at the final answer.

BibTeX

@misc{qi2025learningreasonparallelsamples, title={Learning to Reason Across Parallel Samples for LLM Reasoning}, author={Jianing Qi and Xi Ye and Hao Tang and Zhigang Zhu and Eunsol Choi}, year={2025}, eprint={2506.09014}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.09014}, }

Read Entire Article