Synthetic data has been a powerful driver in advancing LLMs. It enables the creation of high-quality datasets at scale, overcoming the obstacles that come with manual data collection - such as high costs, time, and effort. Building upon this, we introduce a novel methodology for synthesizing high-quality instruction-tuning datasets using Reference-Level Feedback.
Our method revolves around collecting feedback from high-quality reference samples. This feedback captures the desirable characteristics that make these reference samples effective, and we use it to guide the synthesis process. Our experimental results demonstrate significant improvements in the synthesized data quality and efficiency compared to traditional feedback approaches.
Feedback is a well-known approach for improving synthetic data quality. Traditional approaches operate at the sample-level: an LLM generates a response, receives feedback (either through self-reflection or an external source), and then refines its original response. This approach has proven effective in enancing LLM performance on alignment benchmarks and reinforcing key principles such as helpfulness and truthfulness.
Our method takes on a different approach by collecting feedback at the reference-level, from the carefully selected reference samples in seed data. Many approaches already use seed data as in-context examples during the synthesis process. We further leverage it by systematically analyzing the samples and capturing information about their desirable characteristics (i.e. clarity, relevance) through feedback. This feedback is then used throughout the synthesis process.
To be more specific, feedback is collected on both the instruction and response components of each reference sample. The instruction-specific feedback is used to guide the synthesis of new instructions, and response-specific feedback is used to refine the corresponding responses. Since synthesized instructions share key characteristics of their reference counterparts, response-specific feedback remains relevant and is used to improve the quality of synthesized responses. This framework enables us to systematically propagate the desirable qualities of reference samples to newly generated samples, establishing overall higher quality standards for data synthesis.
We present REFED, a high-quality instruction tuning dataset made up of 10K samples. It was created using our framework, with GPT-4o-mini as our teacher model and the LIMA training dataset as our seed data.
To evaluate the effectiveness of our dataset, we finetune various language models then assess their instruction-following abilities with AlpacaEval 2.0 and Arena-Hard. These benchmarks use an LLM as a judge to compare model responses against reference responses and present metrics like the win rate or length-controlled win rate.
To demonstrate the effectiveness of Reference-Level Feedback, we finetune Llama-3.1-8B-Instruct on datasets synthesized using several approaches:
The results show an increase in performance everytime we introduce a component of our framework. We also see that the models trained on datasets synthesized with Reference-Level Feedback outperform those trained on datasets that were synthesized with sample-level feedback.
We compare our model against various other baselines. This includes Llama-3.1-8B-Instruct finetuned on various well-known synthetic datasets, as well as leading SFT-based, 8B-parameter models from the AlpacaEval 2.0 leaderboard. Our results shows that training on our dataset achieves state-of-the-art performance, and even sometimes outperforms significantly larger and more powerful models such as GPT-3.5 and Llama-3.1-405B-Instruct.
We also show that finetuning on REFED yields improvements across different models (Llama-3.1-8B and Mistral-7B) for both the base and instruct variants. Our results demonstrate consistant improvement across all model variants, with the instruct variant showing the most significant improvements. Notably, the base models finetuned on REFED either outperform or are competitive with their instruct counterparts.
Lastly, we explore the effectiveness of different filtering strategies:
@misc{mehri2025samplelevelfeedbackusingreferencelevel,
title={Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis},
author={Shuhaib Mehri and Xiusi Chen and Heng Ji and Dilek Hakkani-Tür},
year={2025},
eprint={2502.04511},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.04511},
}