✨ The Self-Taught Genius: When Fine-Tuning and Prompt Optimization Collide

步子哥

In a not-so-distant future, imagine an AI that learns to improve its own performance much like a budding scientist refining experiments in a busy laboratory. Picture a system that not only receives instructions but also hones its own “prompts” and “weights” to solve complex tasks—from answering multi-hop questions to cracking arithmetic puzzles and classifying delicate signals from nature. This is the promise of modern Natural Language Processing (NLP) pipelines, and it’s the heart of the breakthrough captured in “Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together.”

In this article, we explore how researchers from Stanford University have developed a dual approach that brings together the best of fine-tuning and prompt optimization—two techniques that have long stood apart. Their clever “BetterTogether” algorithm is transforming how we think about modular language model (LM) programs and is steadily raising the bar for what AI systems can achieve.

🧩 Building Modular Minds: The Promise of LM Programs

Imagine building an intricate machine—not a simple clock, but a sophisticated system where each part plays a precise, calibrated role. In the world of NLP, this “machine” is a modular LM program. Rather than relying on one monolithic model to tackle every task, engineers now design pipelines where different modules focus on different sub-tasks. For example:

A module might generate search queries based on a question.
Another module might then retrieve relevant passages.
Yet another produces the final answer after reasoning over the collected data.

These pipelines, often seen in systems like Retrieval Augmented Generation (RAG), are remarkably powerful because they break down complex issues into bite-sized challenges. However, this modularity also comes with a burden: How do you ensure that every module (with its own prompt template and LM weight configuration) works in harmony to produce the best overall result?

The study we’re discussing tackles exactly that issue. It introduces a framework for simultaneously optimizing both the discrete language prompts and the continuous LM weights across the pipeline to maximize a downstream performance metric. And it does so by letting the LM essentially “teach itself” to improve, in a cycle that alternates between prompt optimization and weight fine-tuning.

🔍 Redefining the Art of Optimization

🤖 From Static Prompts to Self-Improving Agents

Traditionally, building NLP systems has involved a lot of manual engineering of prompt strings—those carefully crafted instructions that tell the model what to do. Fine-tuning, on the other hand, adjusts the underlying neural network weights to better fit the data. Until now, these two approaches have largely been pursued separately. But what if you could have your cake and eat it too? What if the same LM could both adjust its parameters and refine its instructions for itself?

The BetterTogether algorithm achieves this by first optimizing the prompt templates using a bootstrapping method. In this stage, the system uses a few-shot strategy to automatically generate numerous candidate prompts, evaluates them against a quality metric, and picks the best-performing set. Think of it as letting the AI sift through a library of potential instructions and select the ones that best fit its current understanding.

Once the best prompts are identified, the LM’s weights are fine-tuned using data that has been self-generated from these well-optimized prompts. In effect, the LM is bootstrapping training labels for every module based not on hand-curated data but on its own “correct” outputs as determined by the metric. After this weight optimization (using techniques such as Low Rank Adaptation, or LoRA), the system once again optimizes its prompts to adjust to the newly fine-tuned weights. This alternating strategy results in an impressive synergy: the LM not only learns how to answer questions better but also learns how to ask itself the right questions.

🛠️ Mathematical Formulation: The Optimization Objective

At its core, the strategy can be viewed in terms of an optimization problem. Given a modular LM program, denoted as Φ, the goal is to configure its module-level prompts (Π) and weights (Θ) to maximize the expected performance on a training set. Mathematically, we are solving:

\operatorname*{arg\,max}_{\Theta, \Pi}\,\frac{1}{|X|}\sum_{(x,m) \in X}\mu\Big(\Phi_{\langle\Theta,\Pi\rangle}(x), m\Big)

Here,
• X represents the set of training inputs along with metadata or labels,
• \mu is the downstream task metric (such as accuracy), and
• \Phi_{\langle\Theta,\Pi\rangle} is our program with its component weights \Theta and prompt templates \Pi.

This formulation shows that the optimization doesn’t simply rely on gradient descent (as in classic fine-tuning) because the overall system output has a discontinuous dependence on the discrete prompts. Instead, approximate strategies—like bootstrapping and random search—must be employed.

🌐 BetterTogether: A Symphony of Alternating Optimization

🎻 Tuning Prompts and Weights in Harmony

Let’s take a behind-the-scenes look at the BetterTogether strategy. The fundamental insight is that prompt optimization and weight fine-tuning should not be performed in isolation. Consider the analogy of an orchestra: fine-tuning is like the careful calibration of each instrument’s pitch, while prompt optimization is akin to conducting the ensemble so that every instrument plays in perfect synchrony.

In their approach, researchers first run the program using default (or “vanilla”) prompts, collecting performance data. Then they alternate between two steps:

Prompt Optimization:
Using a bootstrapping strategy (such as the BootstrapFewShotRS algorithm provided in DSPy), the system generates multiple candidate few-shot prompts for each module. It samples subsets of program traces (the intermediate outputs and reasoning steps) and scores each candidate based on how well it improves downstream performance. The candidate that scores highest on the validation split is chosen.
Weight Fine-Tuning:
With the newly optimized prompts in hand, the system then uses bootstrapped traces to fine-tune the LM’s weights. The BootstrapFinetune algorithm aggregates many module-level prompt-completion pairs to form a training dataset and then fine-tunes using techniques like LoRA. This results in updated weights that are better aligned with the task at hand.

After these two steps, prompt optimization is run again with the fine-tuned weights. Thus, the system benefits twice from the optimization: it first gets better at generating promising training examples, and then it refines its internal knowledge based on those examples.

📝 Pseudocode Walkthrough

For those who love to code, here’s a pseudocode snippet that encapsulates the essence of the BetterTogether algorithm:

function BetterTogether(Φ⟨Θ, Π⟩, X, μ)
    Π' ← OptimizePrompts(Φ⟨Θ, Π⟩, X, μ)
    Θ' ← FinetuneWeights(Φ⟨Θ, Π'⟩, X, μ)
    Π'' ← OptimizePrompts(Φ⟨Θ', Π⟩, X, μ)
    return Φ⟨Θ', Π''⟩
end function

This simple loop belies the complexity hidden underneath. The OptimizePrompts and FinetuneWeights functions aren’t simple gradient updates: they involve generating, filtering, and evaluating massive quantities of self-generated training data.

🚀 Real-World Applications: Experiments That Speak Volumes

The true test of any optimization strategy comes in its empirical results. Researchers applied the BetterTogether approach to three distinct tasks—each representing a different aspect of language understanding:

Multi-Hop Question Answering (HotPotQA):
Here, a modular program first creates a series of search queries to retrieve relevant passages from a corpus of five million Wikipedia abstracts. The final answer module then uses these passages to generate a concise factoid response. In experiments, BetterTogether strategies improved accuracy by as much as 78% in some cases compared to strategies that only optimized either the weights or the prompts.
Arithmetic Reasoning (GSM8K):
For grade-school math problems, the model uses chain-of-thought prompting to derive a series of logical steps leading to a numerical answer. Even small percentage gains (in the range of 2.5–10%) are significant given the precise nature of arithmetic reasoning.
Feature-Based Classification (Iris):
In the classic Iris classification task—where the goal is to sort flowers into distinct species based on sepal and petal measurements—BetterTogether produced gains from 3.5% to as high as 88% compared to baselines. This result is particularly striking because classification tasks, when approached purely with gradient descent, often yield diminishing returns when the dataset is small.

The paper reports these gains across a range of models, including mistral-7b-instruct-v0.2, llama-2-7b-chat, and llama-3-8b-instruct. In almost every evaluated pairing, the best performance was achieved when both optimization methods were applied in sequence rather than individually.

📊 Data in Action: A Snapshot from Table 1

Below is a simplified Markdown table summarizing some of the reported gains. (Note: all numbers are percentages representing accuracy on held-out test sets.)

Strategy	HotPotQA	GSM8K	Iris
Vanilla Zero-shot	17.2	40.3	26.0
Prompt Optimization Only (Π)	33.8	46.4	57.3
Weight Optimization Only (Θ)	22.9	40.7	29.3
BetterTogether (Π → Θ → Π)	37.6	46.8	52.7

While the exact numbers vary across different LM architectures and tasks, the overall message is clear: combining prompt and weight optimization leads to robust improvements that neither method can achieve alone.

🛠 DSPy: Empowering the Self-Improving AI

One of the exciting aspects of this approach is its implementation in the DSPy framework. DSPy—Declarative Self-improving Python—is more than just another library; it represents a paradigm shift in how we think about programming language models. Instead of laboriously tweaking prompt strings in isolation, DSPy encourages developers to compose modular pipelines as structured Python code. By decoupling the “what” (the signature and high-level task) from the “how” (the low-level prompt instructions and LM weights), DSPy enables rapid iteration and fine-grained optimization.

In DSPy, each module’s behavior is defined declaratively. Developers specify a signature (for example, “question → answer”) and then write a module that wraps a Language Model call. DSPy automatically converts these signatures into effective prompts and, crucially, provides a suite of optimizers—including the BootstrapFewShotRS and BootstrapFinetune algorithms—that make it possible to implement BetterTogether strategies.

For instance, consider a DSPy snippet that configures a Retrieval-Augmented Generation (RAG) system. With just a few lines of code, you can:

Connect to your favorite LM provider.
Define a module that generates a search query.
Implement a fine-tuning strategy by bootstrapping your training data from the system’s own outputs.
Optimize both the prompts and the LM weights in an alternating fashion with minimal human intervention.

All of this makes DSPy a fertile sandbox for researchers and practitioners who want to push the boundaries of what AI systems can do.

🌟 The Scientific Narrative: Teaching a Machine to Teach Itself

Let’s step back and imagine a laboratory late at night. The hum of computer fans is the only sound as a group of researchers watches in awe as their language model “learns” from itself. Rather than waiting for human experts to write more exhaustive instructions, the system begins proposing its own methods to interpret complex data. It starts by generating multiple candidate prompts, evaluates each one’s effectiveness, and then cascades these improvements into its weight parameters. In a few dramatic iterations—much like an apprentice honing a craft by learning from countless experiments—the AI produces results that are more than the sum of their parts.

This self-improving behavior is reminiscent of a young scientist who not only memorizes textbooks but learns from every experiment, refining hypotheses based on new evidence. When a system can both ask the right questions and learn to answer them better, we edge closer to having truly autonomous AI that can contribute to scientific discovery.

The BetterTogether algorithm shows that, when a large model is put to work, the interplay of prompt optimization and fine-tuning isn’t just additive—it’s multiplicative. Even when training data is sparse or gradient flow is unavailable, the model’s ability to bootstrap its own training process opens new avenues in AI self-reliance. This paradigm is particularly promising in scenarios where getting human-labeled data is expensive or impractical.

🔬 Behind the Scenes: Challenges and Trade-Offs

No revolutionary approach comes without its share of challenges. The study acknowledges several limitations and open questions that invite further exploration:

⚠️ Search Space Overload

The optimization problem tackled here is inherently intractable in its full form. The space of possible prompt strings and weight configurations is staggeringly large, and with no intermediate gradients available for every module, the system must rely on approximate search methods. This means that while gains of up to 78% or even 88% have been observed, the process is computationally expensive—not to mention sensitive to the choice of hyperparameters and the quality of bootstrapped training examples.

🔄 The Role of Bootstrapping

Bootstrapping self-generated training labels is both the secret sauce and the potential pitfall of the approach. The algorithms presented rely on collecting “traces”—records of intermediate outputs when the model’s final output is correct. However, if the vanilla program (i.e., without optimization) produces too few correct outputs, then there isn’t enough data for effective fine-tuning. In those cases, certain settings are marked with “--” in the experimental results because the available data is insufficient. Future work may need to explore more robust bootstrapping strategies or alternative forms of weight adaptation.

🧪 Methodological Boundaries

The research focuses exclusively on weight optimization via LoRA fine-tuning. It remains an open question whether other fine-tuning strategies—for example, full-parameter updates or even hybrid approaches—could obviate the need for explicit prompt optimization. Additionally, the intriguing interaction between prompt optimization and fine-tuning in modular systems is still not fully understood; the researchers themselves note that the reasons behind the observed improvements are not entirely clear. As with any new technique, the risk of unanticipated interactions between components persists, especially compared to the long-studied dynamics of standard gradient descent.

🔮 Looking Forward: The Dawn of Self-Improving AI Pipelines

As we stand on the cusp of a new era in AI, the implications of these findings are as profound as they are promising. By allowing an LM to both refine its instructions and adjust its internal parameters, we’re taking a significant step toward systems that can adapt on their own. In the near future, this could lead to:

More Reliable NLP Systems:
Modular pipelines that self-optimize will be inherently more robust, making them suitable for mission-critical applications in medicine, law, and scientific research.
Reduced Reliance on Hand-Labeling:
With self-generated training data guiding the optimization process, researchers may find that even small training sets are sufficient to produce high-quality outputs across a range of tasks.
The Democratization of AI Research:
Implemented in open-source frameworks like DSPy, these techniques lower the barrier to entry for smaller labs and independent researchers, fueling further innovation across the community.
A New Paradigm for AI Self-Improvement:
The idea that an LM could “teach itself” is not merely a technical trick—it hints at a future where machines might autonomously explore problems, generate hypotheses, and even assist in scientific breakthroughs.

Within the DSPy ecosystem, ongoing work on optimizers such as MIPROv2 and BetterTogether has already begun to change the way researchers approach LM deployment. Moreover, by making these methods accessible as modular building blocks, DSPy creates an environment where improvements can be composed—a sort of AI assembly line where each optimized module contributes to an increasingly capable system.

🌈 A Day in the Life of a Self-Taught LM

Let’s paint a picture. It’s early morning in a bustling research institute. Over coffee, a team of data scientists reviews the latest performance metrics of their deployed RAG system. Overnight, the system has been iteratively optimizing its prompts and weights using the BetterTogether strategy implemented via DSPy. The results?
• More precise search queries are guiding the retrieval step,
• Logical chains of thought have become more coherent, and
• The final answers on multi-hop questions are increasingly accurate.

Instead of chasing every new benchmark by manual intervention, the team now monitors a self-improving system that continually refines its methods—almost like an AI-powered research assistant that never sleeps. This vision of autonomous self-improvement is not science fiction; it is rapidly emerging from the labs and into production environments.

🧬 In Conclusion: The Future Is Self-Optimizing

The research on “Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together” offers a tantalizing glimpse into the future of AI. By combining prompt optimization and weight fine-tuning in an iterative, bootstrapped manner, the BetterTogether algorithm achieves performance gains that neither approach can reach on its own. In experiments spanning multi-hop QA, arithmetic reasoning, and classification—all across three different language models—the benefits are clear: self-teaching AI pipelines can outperform conventional methods by dramatic margins.

While challenges remain in terms of computational cost, data scarcity, and unexplained interactions between optimization components, the work opens exciting new avenues. It challenges the traditional boundaries between prompt engineering and neural fine-tuning and paves the way for systems that learn to learn autonomously.

For practitioners and researchers alike, frameworks like DSPy offer the practical tools needed to get started on this journey. They allow you to build, deploy, and—most importantly—optimize AI systems as if you were writing Python code rather than wrestling with brittle prompt strings. As these techniques mature, we may soon see AI that not only understands language but also understands how best to use its own language to solve ever more difficult problems.

📚 References

Soylu, D., Potts, C., & Khattab, O. (2024). Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together. Stanford University. Retrieved from http://dspy.ai
Khattab, O., et al. (2024). DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. In Proceedings of the Twelfth International Conference on Learning Representations.
Qi, P., et al. (2021). Answering Open-Domain Questions of Varying Reasoning Steps from Text. In Proceedings of EMNLP.
Cobbe, K., et al. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint.
Fisher, R. A. (1988). Iris. UCI Machine Learning Repository.

🚀 Epilogue: Beyond the Horizon

As AI systems become ever more enmeshed into the fabric of daily life, the need for robust, adaptable, and self-improving models is clearer than ever. The BetterTogether strategy described above is more than just a clever trick—it is a step toward systems that learn continuously, adapt to changing data landscapes, and eventually, might help us unlock answers to questions we haven’t even thought to ask yet.

With frameworks like DSPy driving open-source research, the democratization of these ideas is well underway. The journey from static prompt engineering to dynamic, self-taught AI is just beginning, and its trajectory promises to reshape not only technology, but the very nature of inquiry and discovery.

Welcome to the future of AI—a future where language models learn, optimize, and reinvent themselves to meet the challenges of tomorrow.

Whether you’re a researcher fascinated by the interplay of prompts and weights or a developer eager to build next-generation NLP systems, the BetterTogether paradigm and DSPy’s modular approach herald an era of unprecedented flexibility and innovation. As we continue to push the envelope of what AI can achieve, one thing is clear: in the collaboration of fine-tuning and prompt optimization, the whole truly is greater than the sum of its parts.

Happy coding, and here’s to a future of self-taught genius!

步子哥

✨ 自学成才的天才：当微调与提示优化相遇

在不远的未来，试想一下有一种 AI，它能像一位初出茅庐的科学家那样学会改进自己的表现，就像在繁忙实验室中不断完善实验一样。请想象这样一个系统——它不仅接收指令，还能磨炼自己的“提示（prompt）”和“权重（weights）”，以解决从多跳问题回答、数学推理到对自然界细微信号分类等复杂任务。这正是当代自然语言处理（NLP）流水线所承诺的，也是《微调与提示优化：两大步协同工作更佳》一文的核心突破所在。

本文将探讨斯坦福大学研究人员如何提出一种双轨方法，将长期以来各自独立发展的微调和提示优化两种技术合二为一。其巧妙的“BetterTogether”算法正改变我们对于模块化语言模型（LM）程序的认识，并稳步提升 AI 系统的整体表现。

🧩 构建模块化大脑：LM 程序的承诺

设想构建一台错综复杂的机器——不是简单的钟表，而是一个精密系统，其每个部分都发挥着准确、经过校准的作用。在 NLP 的世界中，这台“机器”便是模块化 LM 程序。工程师们不再仅依赖单一的统一模型来应对每项任务，而是设计出由不同模块组成的流水线，每个模块专注于不同的子任务。例如：

一个模块可能根据提问生成搜索查询；
另一个模块则根据查询检索出相关段落；
再有一个模块在推理整合这些信息后生成最终答案。

这些流水线常见于诸如检索增强生成（Retrieval Augmented Generation, RAG）的系统中，其强大之处在于将复杂议题分解成小块可解决的问题。然而，模块化也带来了一个负担：如何确保每个模块（各自拥有独立的提示模板和 LM 权重配置）协同工作，最终输出最佳结果？

本文讨论的研究正是针对这个问题展开。它提出了一种框架，旨在同时优化流水线中各模块的离散语言提示和连续 LM 权重，以极大地提升下游性能指标。为此，方法让 LM 自己“教会自己”改进，采用交替进行提示优化和权重微调的循环策略。

🔍 重新定义优化艺术

🤖 从静态提示到自我提升的代理

传统上，构建 NLP 系统需要大量手工设计提示字符串——那些精心制作的指令，用以告知模型该做什么。而微调则是调整底层神经网络的权重以更好地拟合数据。直到现在，这两种方法大多是分开进行的。但如果能既拥有两者的优点，会怎样？设想一种模型既能调整其参数，又能为自己改进提示？

BetterTogether 算法正是实现了这一点。它首先利用自举方法优化提示模板。在这一阶段，系统采用少样本策略自动生成大量候选提示，并根据质量指标进行评估，从中选出表现最佳的一组。可以把它想象成让 AI 在潜在指令的“图书馆”中进行筛选，挑选出最符合当前理解的那些。

在确定出最佳提示后，接下来使用这些经过优化的提示自举生成训练数据，对 LM 的权重进行微调。实际上，LM 的微调不再依赖于人工构造的数据，而是由其自身“正确”的输出（根据指标判断）生成的标签来指导。当使用诸如低秩适应（LoRA）等技术完成权重更新后，系统再次对提示进行优化，使其适应新微调的权重。如此交替策略带来了惊人的协同效应：LM 不仅变得更擅长回答问题，还学会了如何给自己提出正确的问题。

🛠️ 数学公式：优化目标的表达

从本质上讲，这一策略可以视为一个优化问题。设给定一个模块化 LM 程序，用 Φ 表示，其目标是配置其模块级提示（Π）和权重（Θ），以在训练集上最大化预期性能。数学上，我们需要解决的问题为：

\operatorname*{arg\,max}_{\Theta, \Pi}\,\frac{1}{|X|}\sum_{(x,m) \in X}\mu\Big(\Phi_{\langle\Theta,\Pi\rangle}(x), m\Big)

其中，
• X 代表训练输入集合及其相关元数据或标签，
• \mu 是下游任务指标（例如准确率），
• \Phi_{\langle\Theta,\Pi\rangle} 表示带有组件权重 \Theta 与提示模板 \Pi 的程序。

这个公式表明，优化过程不仅仅依赖于传统的梯度下降（如在经典微调中那样），因为整体系统输出与离散提示之间存在不连续性，因此必须采用自举和随机搜索等近似策略来进行求解。

🌐 BetterTogether：交替优化的交响乐

🎻 让提示与权重和谐调优

让我们走进 BetterTogether 策略的幕后，探究其工作原理。根本的见解在于：提示优化与权重微调不应孤立进行。打个比方，就像一支管弦乐队：微调就如调整各乐器的音高，而提示优化则像指挥家协调乐队，使每个乐器协同和谐地演奏。

在这种方法中，研究人员首先使用默认（或称“原始”）提示运行程序，收集表现数据。之后，他们交替执行两个步骤：

提示优化：
利用自举策略（例如 DSPy 中的 BootstrapFewShotRS 算法），系统为每个模块生成多个候选少样本提示。它抽样程序轨迹（即中间输出和推理步骤）的不同子集，并依据下游表现对每个候选项进行评分。最终，验证集得分最高的候选提示被选定。
权重微调：
在获得新优化的提示之后，系统利用自举生成的轨迹对 LM 权重进行微调。BootstrapFinetune 算法收集众多模块级提示-完成对，构成训练数据集，然后通过诸如 LoRA 的技术进行权重微调。最终得到的权重与任务更加契合。

在完成这两个步骤后，再次使用微调后的权重进行提示优化。这样，系统在优化过程中获得双重提升：首先提高了生成优秀训练示例的能力，其后又基于这些示例改进内部知识参数。

📝 伪代码讲解

对于钟爱编程的朋友，以下伪代码概括了 BetterTogether 算法的精髓：

function BetterTogether(Φ⟨Θ, Π⟩, X, μ)
    Π' ← OptimizePrompts(Φ⟨Θ, Π⟩, X, μ)
    Θ' ← FinetuneWeights(Φ⟨Θ, Π'⟩, X, μ)
    Π'' ← OptimizePrompts(Φ⟨Θ', Π⟩, X, μ)
    return Φ⟨Θ', Π''⟩
end function

这简短的循环掩盖了隐藏在其下的复杂性。函数 OptimizePrompts 与 FinetuneWeights 并非简单的梯度更新——它们涉及生成、筛选和评估大量自举训练数据。

🚀 实际应用：用数据说话的实验结果

任何优化策略的真正考验都在于其实验结果。研究人员将 BetterTogether 方法应用于三个不同任务，每个任务都展现了语言理解的不同侧面：

多跳问答（HotPotQA）：
在这个任务中，一个模块化程序首先生成一系列搜索查询，从五百万条 Wikipedia 摘要中检索相关段落，最后由答案模块综合这些段落生成简明回答。一些实验中，BetterTogether 策略相比单独优化方法，在准确率上最高提升达 78%。
算术推理（GSM8K）：
针对小学数学问题，模型采用连锁思维（chain-of-thought）提示来推导出逻辑步骤，从而得到数值答案。虽然提升幅度仅在 2.5% 到 10% 之间，但考虑到数学问题的精细性，这种提升意义重大。
基于特征的分类（Iris）：
在经典的 Iris 分类任务中，目标是将花朵按照花萼和花瓣的测量值分为几种不同种类。BetterTogether 方法相较于基线方法，在准确率上提升范围从 3.5% 到高达 88% 不等。这一结果格外引人注目，因为纯粹依靠梯度下降方法进行微调在数据集较小时常常效果递减。

论文报告，这些提升在不同模型之间均有体现，包括 mistral-7b-instruct-v0.2、llama-2-7b-chat 和 llama-3-8b-instruct。在几乎所有评估组合中，最佳表现均出现在同时应用了提示与权重优化的方法上，而非单独使用其中一种。

📊 数据展示：表 1 的一瞥

下面是一个简化的 Markdown 表格，汇总了部分报告的提升数据。（注：所有数值均为留出测试集上的百分比准确率）

策略	HotPotQA	GSM8K	Iris
原始零样本（Vanilla Zero-shot）	17.2	40.3	26.0
仅提示优化（Π）	33.8	46.4	57.3
仅权重优化（Θ）	22.9	40.7	29.3
BetterTogether（Π → Θ → Π）	37.6	46.8	52.7

虽然具体数值因 LM 架构和任务而略有不同，但总信息清晰：结合提示与权重优化能带来单独使用任一方法无法达到的稳健提升。

🛠 DSPy：驱动自我提升 AI 的利器

这一方法的令人兴奋之处在于其在 DSPy 框架中的实现。DSPy（Declarative Self-improving Python）不仅仅是另一个库；它代表了一种编程语言模型的新范式。开发者不再需要孤立于脆弱的提示字符串中苦苦摸索，而是以结构化 Python 代码构建模块化流水线。通过将“做什么”（即签名和任务描述）与“如何做”（即底层提示指令和 LM 权重）分离，DSPy 实现了快速迭代和精细优化。

在 DSPy 中，每个模块的行为都是以声明式方式定义的。开发者指定一个签名（例如，“问题 → 答案”），然后编写一个封装 LM 调用的模块。DSPy 自动将这些签名转化为高效的提示，并且提供了一系列优化器——包括 BootstrapFewShotRS 和 BootstrapFinetune 算法——使得实现 BetterTogether 策略成为可能。

例如，考虑这样一段 DSPy 代码，配置了一个检索增强生成（RAG）系统。只需几行代码，你就能：

连接到你喜爱的 LM 提供商；
定义一个生成搜索查询的模块；
通过自举从系统输出中构建训练数据；
最后以交替的方式优化提示和 LM 权重，而几乎无需人工干预。

这一切都使得 DSPy 成为一个理想的平台，无论是研究人员还是实践者，都能利用它推动 AI 系统的前沿。

🌟 科学叙事：教机器如何自我启蒙

让我们回顾一下一幅场景：深夜的实验室中，计算机风扇低鸣，一群研究者满怀敬畏地观看着他们的语言模型“自学”。在这个过程中，不再等待专家撰写更详尽的指令，系统开始自主提出解读复杂数据的方法。它首先生成多个候选提示，评估每个提示的有效性，然后将这些改进渐进式地融入到权重参数中。经过几次富有戏剧性的迭代 —— 就像徒弟通过无数实验磨炼技艺 —— AI 产生出的结果远远超过各自部分简单相加的效果。

这种自我提升的行为令人联想到一位年轻科学家不只是死记硬背课本，而是通过每一次实验不断修正假设。当系统既能提出正确的问题，又能在回答时不断改进，我们就更接近拥有真正自治的 AI，这些 AI 未来甚至可能参与科学发现。

BetterTogether 算法表明，当一个大型模型投入工作时，提示优化与微调的相互作用不仅是简单叠加 —— 它们的效果是乘法性的。即便在训练数据稀少或梯度流不可用的情况下，模型自举其训练过程的能力也为 AI 自主性打开了新途径。这一范式在那些人工标注数据昂贵或不可行的场景中尤为有前景。

🔬 幕后实情：挑战与权衡

任何革命性方法都难免伴随着挑战。本文也承认了若干局限性和悬而未决的问题，这些都值得进一步探讨：

⚠️ 搜索空间的浩瀚

这里所处理的优化问题在其完整形式上本质上就是不可行的。可能的提示字符串和权重配置空间之大令人咋舌，且由于每个模块没有中间梯度可供参考，系统必须依赖近似搜索方法进行求解。虽然在某些情况下观察到了 78% 甚至 88% 的提升，但这一过程计算量巨大，并且对超参数及自举训练示例的质量极为敏感。

🔄 自举策略的作用

利用自举产生训练标签既是这一方法的秘密武器，也是潜在的风险所在。这些算法依赖于“轨迹”——记录模型输出正确时的中间结果。然而，如果原始程序（即没有优化时）产生的正确信息过少，那么就没有足够的数据用于有效微调。在这种情况下，部分实验结果会标记为 “--”，表示可用数据不足。未来的工作可能需要探索更鲁棒的自举策略，或者其他形式的权重调整方法。

🧪 方法上的局限

该研究专注于使用 LoRA 微调预训练模型来实现权重优化。是否存在其他微调策略——例如全参数更新或混合方法——能够使显式提示优化变得不再必要，这仍然是一个开放问题。另外，在模块化系统中，提示优化与微调之间的相互作用目前尚未完全理解；正如研究人员所指出的，对这一现象背后的原因仍存在不少疑问。正如任何新技术一样，各组件之间可能会出现意想不到的相互作用风险，尤其是与经过数十年研究的标准梯度下降动态相比。

🔮 展望未来：自我提升 AI 流水线的黎明

当我们站在 AI 新纪元的边缘时，这些发现的意义既深远又令人振奋。通过允许 LM 同时改进其指令和调整内部参数，我们正向着能够自主适应新情况的系统迈进。不久的将来，这种技术可能会带来：

更可靠的 NLP 系统：
自我优化的模块化流水线会更加稳健，非常适合在医疗、法律和科学研究等关键领域应用。
减少对手工标注的依赖：
由于自生成训练数据指导优化过程，研究人员可能会发现，即便训练数据集较小，也能产生高质量输出。
AI 研究的民主化：
借助 DSPy 这类开源框架，这些技术降低了小型实验室和独立研究者的入门门槛，从而助力社区不断创新。
一种全新的 AI 自我提升范式：
LM “自我教导”这一理念不仅仅是一种技术巧思，更预示着未来机器或许能自主探索问题、生成假设乃至协助科学突破。

在 DSPy 生态系统中，诸如 MIPROv2 和 BetterTogether 等优化器的不断进步，已经开始改变研究人员部署 LM 的方式。通过将这些方法作为模块化构建块提供，DSPy 为系统整体性能的稳步提升创造了环境——正如一条 AI 装配线，每一个经过优化的模块都助推整体系统变得更加强大。

🌈 自学成才 LM 的一天

让我们描绘一幅场景：某研究机构的清晨，在热闹的实验室中，一组数据科学家正在边喝咖啡边审查其部署的 RAG 系统的最新性能指标。昨夜，系统通过利用 DSPy 实现的 BetterTogether 策略，不断交替优化自己的提示和权重。结果显示：
• 检索阶段的搜索查询更加精准；
• 推理链条变得更连贯；
• 多跳问题的最终答案也更加准确。

如今，团队不再需要手动微调以追逐每一个新的基准，而是监控着一个自我提升的系统，该系统不断优化其实现方法——仿佛一个从不眠的 AI 研究助手。这种自主自我提升的愿景并非科幻小说，而正迅速走出实验室，进入生产环境。

🧬 总结：未来属于自我优化

《微调与提示优化：两大步协同工作更佳》一文为我们展现了一幅充满希望的 AI 未来图景。通过以自举的方式交替进行提示优化和权重微调，BetterTogether 算法实现了单独应用任一方法无法达到的性能提升。在涵盖多跳问答、算术推理和分类等任务，并在三种不同语言模型上进行实验后，结果十分明确：自我教学的 AI 流水线可以显著超越传统方法。

尽管在计算成本、数据稀缺以及各组件之间未解相互作用等方面仍存在挑战，这项工作为未来探索新途径打开了大门。它挑战了提示工程与神经微调之间的传统界限，铺平了通向系统自主学习的道路。

对于研究人员和开发者而言，DSPy 等框架提供了实用工具，使得构建、部署，最重要的是优化 AI 系统，如同编写 Python 代码一般简单。当这些技术逐步成熟，我们或许很快就会看到不仅仅懂语言，而且懂得如何更好地使用自己的语言来解决越来越困难问题的 AI。

📚 参考文献

Soylu, D., Potts, C., & Khattab, O. (2024). 微调与提示优化：两大步协同工作更佳. 斯坦福大学. 检索自 http://dspy.ai
Khattab, O., 等人 (2024). DSPy：将声明性语言模型调用编译成最先进流水线. 发表于第十二届国际表征学习会议（ICLR）。
Qi, P., 等人 (2021). 回答开放领域问题：涵盖不同推理步骤的文本解答. 发表于 EMNLP。
Cobbe, K., 等人 (2021). 训练验证器解决数学文字题. arXiv 预印本。
Fisher, R. A. (1988). Iris. UCI 机器学习资料库。

🚀 结语：超越地平线

随着 AI 系统越来越融入我们的日常生活，对稳健、适应性强的自我优化模型的需求比以往任何时候都更加迫切。所描述的 BetterTogether 策略不仅仅是一种巧妙的技术手段——它标志着向自主学习系统迈出的重要一步，系统能够不断调整并适应新的数据环境，甚至可能帮助我们解答那些尚未被提出的问题。

借助 DSPy 等开源框架，这些技术正以模块化构建块的形式普及开来。从静态提示工程到动态自学的跃迁之路才刚刚开始，其发展势头预示着不仅技术将迎来全新变革，甚至连探究与发现的本质也可能因此而改变。

欢迎进入 AI 的未来 —— 一个语言模型能自我学习、自我优化并不断革新以应对明日挑战的时代。

无论你是对提示与权重交互充满好奇的研究者，还是渴望构建下一代 NLP 系统的开发者，BetterTogether 范式和 DSPy 的模块化方法都预示着一场史无前例的灵活性与创新革命。随着我们不断推进 AI 能力的边界，有一点变得越来越清楚：在微调与提示优化的协同中，整体的效果远远胜过各部分的简单累加。