Refed

Synthetic data has been a powerful driver in advancing LLMs. It enables the creation of high-quality datasets at scale, overcoming the obstacles that come with manual data collection - such as high costs, time, and effort. Building upon this, we introduce a novel methodology for synthesizing high-quality instruction-tuning datasets using Reference-Level Feedback.

Our method revolves around collecting feedback from high-quality reference samples. This feedback captures the desirable characteristics that make these reference samples effective, and we use it to guide the synthesis process. Our experimental results demonstrate significant improvements in the synthesized data quality and efficiency compared to traditional feedback approaches.

Feedback comparison. — An overview of our data synthesis pipeline. Starting from our seed data, we select a reference sample and collect *Reference-Level Feedback* for both the instruction and response. Instruction feedback is used to synthesize new instructions. We generate their corresponding responses, and then improve it using the response feedback.

Reference-Level Feedback

Feedback is a well-known approach for improving synthetic data quality. Traditional approaches operate at the sample-level: an LLM generates a response, receives feedback (either through self-reflection or an external source), and then refines its original response. This approach has proven effective in enancing LLM performance on alignment benchmarks and reinforcing key principles such as helpfulness and truthfulness.

Our method takes on a different approach by collecting feedback at the reference-level, from the carefully selected reference samples in seed data. Many approaches already use seed data as in-context examples during the synthesis process. We further leverage it by systematically analyzing the samples and capturing information about their desirable characteristics (i.e. clarity, relevance) through feedback. This feedback is then used throughout the synthesis process.

To be more specific, feedback is collected on both the instruction and response components of each reference sample. The instruction-specific feedback is used to guide the synthesis of new instructions, and response-specific feedback is used to refine the corresponding responses. Since synthesized instructions share key characteristics of their reference counterparts, response-specific feedback remains relevant and is used to improve the quality of synthesized responses. This framework enables us to systematically propagate the desirable qualities of reference samples to newly generated samples, establishing overall higher quality standards for data synthesis.

Experiments

We present REFED, a high-quality instruction tuning dataset made up of 10K samples. It was created using our framework, with GPT-4o-mini as our teacher model and the LIMA training dataset as our seed data.

To evaluate the effectiveness of our dataset, we finetune various language models then assess their instruction-following abilities with AlpacaEval 2.0 and Arena-Hard. These benchmarks use an LLM as a judge to compare model responses against reference responses and present metrics like the win rate or length-controlled win rate.

How Effective is Reference-Level Feedback for Data Synthesis?

To demonstrate the effectiveness of Reference-Level Feedback, we finetune Llama-3.1-8B-Instruct on datasets synthesized using several approaches:

No feedback: just the seed dataset, no synthesis.
Reference-Level Instruction Feedback: Using instruction feedback to synthesize new instructions, then generating a response for these instructions.
Reference-Level Instruction Feedback and Sample-Level Response Feedback: Using instruction feedback to synthesize new instructions, generating a response for these instructions, collecting sample-level feedback on this response, then improving it.
Reference-Level Instruction + Response Feedback: Our proposed framework, using both reference-level instruction and response feedback.

The results show an increase in performance everytime we introduce a component of our framework. We also see that the models trained on datasets synthesized with Reference-Level Feedback outperform those trained on datasets that were synthesized with sample-level feedback.

How Does Our Method Compare Against Other Baselines?

We compare our model against various other baselines. This includes Llama-3.1-8B-Instruct finetuned on various well-known synthetic datasets, as well as leading SFT-based, 8B-parameter models from the AlpacaEval 2.0 leaderboard. Our results shows that training on our dataset achieves state-of-the-art performance, and even sometimes outperforms significantly larger and more powerful models such as GPT-3.5 and Llama-3.1-405B-Instruct.

Does REFED Generalize To Different Model Architectures?

We also show that finetuning on REFED yields improvements across different models (Llama-3.1-8B and Mistral-7B) for both the base and instruct variants. Our results demonstrate consistant improvement across all model variants, with the instruct variant showing the most significant improvements. Notably, the base models finetuned on REFED either outperform or are competitive with their instruct counterparts.

Does Filtering Enhance the Effectiveness?

Lastly, we explore the effectiveness of different filtering strategies:

LLM-Judge Filtering: We use an LLM judge to compare the original respone with the improved response. We keep the samples where the improved response is ranked higher than the original response.
ROUGE-L Similarity Filtering: Starting wih a randomly selected sample, we iteratively add candidates where the instruction's maximum similarity score with existing instructions is below a certain threshold.

Using these methods, along with random sampling, we create samples of size 1K, 2K, 4K, and 8K. We fintune Llama-3.1-8B-Instruct on these subsets and report the length-controlled win rate in the graph below. Results indicate that LLM-Judge filtering is the most effective, with the ROUGE-L similarity filtering also showing some improvements.

BibTeX

@misc{mehri2025samplelevelfeedbackusingreferencelevel,
        title={Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis}, 
        author={Shuhaib Mehri and Xiusi Chen and Heng Ji and Dilek Hakkani-Tür},
        year={2025},
        eprint={2502.04511},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2502.04511}, 
  }