One of the most common complaints I hear from users of AI agents is, "Why do I have to tell it the same thing over and over?" They expect their tools to learn from experience, but the reality is that most don't. This is because today's LLM-powered apps are fundamentally static; they don't learn purely from individual interactions.
As building agents becomes better defined and many products have shipped their first agentic MVPs, what’s becoming clear is that the next new thing may be how to get these agents to reliably and securely self-improve. This applies to both knowledge (gaining persistent user-related context) and behavior (learning to more effectively solve problems) which are independent but highly interrelated. In some online contexts, you’ll see this referred to as agent “memory,” and to me, that's just an implementation for achieving this experience.
If machine learning (ML) was supposed to “learn from experience E with respect to some class of tasks T…” why are our GPT wrappers, built using ML, not actually learning from experience? The answer is: technically they could, but training these next-token-prediction models is actually a fairly non-trivial problem compared to their task-specific classification/regression/etc counterparts.
In this post, I wanted to go through the modern toolbox for agent self-improvement and why it’s complicated.
Training (as in updating parameters) LLMs is still hard
If you have a knowledge base, you can’t just “train” on it. Traditional Supervised Fine-Tuning (SFT) requires a large dataset of conversational examples (user_question, expected_response) rather than just knowledge material.
If you are building a tool-use agent or a reasoning model, you often can’t train on just examples but instead rely on reinforcement learning to steer the model towards a reward. This takes quite a bit more compute, relies on a high quality reward function (which isn’t maximizing user ratings!), and either user data or highly realistic simulated environments.
While you can attempt to anonymize, a global model trained on one user's data still has the potential to leak information to others. While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making this a non-starter in practice.
Today's models have hundreds of billions of parameters with quite a bit of complexity around how to both train and serve them. While we’ve developed several ways of efficiently fine-tuning, there’s no platform (yet) that makes it trivial to regularly turn feedback into new, servable models.
Training (as in prompting, aka in-context-learning) is costly
Every piece of information added to the prompt, past conversations, tool outputs, user feedback, consumes tokens. This makes naive feedback quadratic in cost and latency as each interaction potentially generates feedback which is appended to the prompt in every future interaction.
Applications rely heavily on prompt caching to manage costs. However, the more you personalize the context with user-specific rules and feedback, the lower your cache hit rate becomes.
State makes everything more complicated
Once an agent starts learning, its past interactions can influence future behavior. Did the agent give a bad response because of a recent change in the system prompt, a new feature, or a piece of user feedback from three weeks ago? The "blast radius" of a single piece of learned information is hard to predict and control.
What happens when a user's preferences change, or when information becomes outdated? A system that can't effectively forget is doomed to make mistakes based on old, irrelevant data. Imagine telling your agent to never answer questions on a certain topic, but then a product update makes that topic relevant again. The agent's "memory" might prevent it from adapting.
For any of this to work, users have to trust you with their data and their feedback. This brings us back to the data leakage problem. There's an inherent tension between creating a globally intelligent system that learns from all users and a personalized one that respects individual privacy.
The core determiner for how you do self-improvement is what data you can get from the user, ranging from nothing at all to detailed corrections and explanations. The richer the feedback, the less samples needed to make a meaningful improvement.
It’s also a key product decision to determine the effect radius for different forms of feedback. I’ll call this the “preference group”; the group of users (or interactions) in which a given piece of feedback causes a change in agent behavior. These groups could be along explicit boundaries (by user, team, or other legal organization) or derived boundaries (geographic region, working file paths, usage persona, etc).
Grouping too small (e.g. user level) increases cold start friction and means several users will experience the same preventable mistakes, some never seeing any improvement until they provide sufficient feedback. For parameter-based training, it can also be unmanageable to have highly granular copies of the model weights (even if from PEFT).
Grouping too large (e.g. globally) leads to riskier agent updates and unusual behavior. One user with “weird” feedback could directly degrade the efficacy of the agent for all other users.
Even when you have no explicit signal from the user on how your agent is performing you can improve the system. While users get a potentially more focused experience, with a lack of signal, you’ll need to derive approximate feedback from high-volume, low-signal proxy data. There’s high potential to make false assumptions but this can be compensated for my aggregating more data (i.e. over time or preference group size) per model update.
What you could do:
Use LLMs to determine preferences or explanations — Take (question, answer) pairs and use LLMs (or even simpler heuristics) to determine if this was a preferred answer or what the preferred answer would have been. Effectively running your own LLM-as-judge setup to determine what the user might’ve told you. With this, proceed to cases 1, 2, or 3.
Use engagement metrics to determine preferences — Take traditional analytics on engagement with your agent to approximate the quality of responses. Did the user come back? Did the buy the thing you showed them? How much time did they spend? Turning these types of analytics into preferences on your agent’s responses. With this, proceed to case 1.
Use agent tool failures as implicit signals — You can log every tool call and its outcome (success, failure, or the content of the response). Recurring tool failures, inefficient tool-use loops, or patterns where the agent calls a tool with nonsensical arguments are all strong implicit signals that the agent's reasoning is flawed for a particular type of task. These failed "trajectories" can be automatically flagged and used as negative examples for Case 1.
Use simulation to generate feedback — Use an LLM to act as a "user simulator", generating a diverse set of realistic queries and tasks. Then, have your agent attempt to solve these tasks in a synthetic gym environment. Since you define the environment and task, you can often automatically verify if the agent succeeded (e.g., "Did it pass the tests?") and use this outcome as a reward signal. This synthetic data can then be used to create preference pairs or corrections, allowing you to train your agent using the methods from cases 1, 2, or 3.
Keep the chat history — While their are plenty of reasons this might make things worse, another option when there’s no clear preferences or feedback is provided is to just include the previous chats (or chat summaries) in future prompts within the same preference group. You do this with the hope that the collective context of previous chats, the agent can steer towards better responses.
Rely on third-party grounding — You could also rely on a 3rd party API to give the agents hints or updated instructions for how to solve a particular task. A simple example of this would be to have an agent that can “google” for how to solve the problem and as google indexes online posts, your agent might natural begin to improve. For any given agent you are building, there might be some pre-existing knowledge base you can lean on for “self-improvement”.
This is one of the most common feedback mechanisms. It's low-friction for the user, provides a clear signal that can be easily turned into a metric, and is a step up from inferring feedback from proxy data. However, the signal itself can be noisy. Users might downvote a correct answer because it was unhelpful for their specific need, or upvote an incorrect one that just sounds confident.
What you could do:
Fine-tune with preferences — You can train the model by constructing (chosen, rejected) pairs from the data you collect. A response that receives a 👍 becomes a "chosen" example, while one that gets a 👎 becomes a "rejected" one, and these are then paired for training. From there, classic RLHF can use these pairs to train a reward model that guides the main agent. A more direct alternative is DPO, which skips the reward model and uses the constructed pairs to directly fine-tune the agent's policy.
Use LLMs to derive explanations — Aggregate the 👍/👎 data across a preference group and use another LLM to analyze the patterns and generate a hypothesis for why certain responses were preferred. This process attempts to turn many low-quality signals into a single, higher-quality explanation, which you can then use to update documentation or create few-shot examples as described in Case 2.
Use in-context learning with examples — Dynamically pull examples of highly-rated and poorly-rated responses and place them into the context window for future queries within the same preference group. This lets the agent "learn" at inference time to steer its answers towards a preferred style or content format.
Here, instead of a simple preference, the user provides a natural language explanation of what went wrong (e.g., "That's not right, you should have considered the legacy API," or "Don't use that library, it's deprecated."). This feedback requires more effort from the user, but the signal quality is extremely high; a single good explanation can be more valuable than hundreds of thumbs-ups. Users are often willing to provide this level of detail if they believe the agent will actually learn from it and save them time in the future. This feedback can be collected through an explicit UI, in the flow of conversation, or even inferred from subsequent user actions.
What you could do:
Synthesize a corrected answer — One use of an explanation is to try and generate the corrected answer. You can use another LLM as a "refiner" that takes the (original_response, user_explanation) and outputs a corrected_response. If this synthesis is successful, you've effectively created a high-quality (original_response, corrected_response) pair and can move to Case 3.
Use in-context learning with explanations — Store the (response, user_explanation) pairs. When a new, similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt. This gives the agent a just-in-time example of a pitfall to avoid and the reasoning behind it, steering it away from making the same mistake twice or doubling down on what worked.
Distill feedback into reusable knowledge — Aggregate explanations to find recurring issues—like an agent's travel suggestions being too generic. An LLM can then synthesize these complaints into a single, concise rule. This new rule can either be added to the system prompt to fix the behavior for a user group, or it can be inserted into a knowledge base. For example, a synthesized rule like, "When planning itineraries, always include a mix of popular sites and unique local experiences," can be stored and retrieved for any future travel-related queries, ensuring more personalized and higher-quality suggestions.
Here, the user doesn't just explain what's wrong; they provide the correct answer by directly editing the agent's output. The "diff" between the agent's suggestion and the user's final version creates a high-quality training example. Depending on the product's design, this can often be a low-friction way to gather feedback, as the user was going to make the correction anyway as part of their natural workflow, whether they're fixing a block of generated code or rewriting a paragraph in a document.
What you could do:
Fine-tune with edit pairs — Use the (query, user_edited_response) pair for Supervised Fine-Tuning (SFT) to teach the model the correct behavior. Alternatively, you can use the (original_response, user_edited_response) pair for preference tuning methods like DPO, treating the user's edit as the "chosen" response and the agent's initial attempt as the "rejected" one.
Use in-context learning with corrections — Store the (question, user_edited_diff) pairs. When a similar query comes in, you can retrieve the most relevant pairs and inject them into the prompt as a concrete example of what to do and what to avoid, steering the agent toward the correct format or content at inference time.
Derive explanations — You can also work backward from the edit to enrich your prompts and/or knowledge bases. Use an LLM to analyze the "diff" between the original and edited text to generate a natural language explanation for the change, in some sense capturing the user's intent. This synthesized explanation can then be used in all the ways described in Case 2.
How do you handle observability and debuggability? — When an agent's "memory" causes unexpected behavior, debugging becomes a challenge. A key design choice is whether to provide users with an observable "memory" panel to view, edit, or reset learned information. This creates a trade-off between debuggability and the risk of overwhelming or confusing users with their own data profile.
How do you pick the "preference group"? — Choosing the scope for feedback involves a trade-off between cold-starts and risk. User-level learning is slow to scale, while global learning can be degraded by outlier feedback. A common solution is grouping users by explicit boundaries (like a company) or implicit ones (like a usage persona). The design of these groups also has business implications; a group could be defined to span across both free and paid tiers, allowing feedback from a large base of unpaid users to directly improve the product for paying customers.
How do you decide which feedback case to use? — The progression from simple preferences (Case 1) to detailed explanations or edits (Cases 2 & 3) depends heavily on user trust. Users will only provide richer feedback when they believe the system is actually listening. This trust can be accelerated by making the agent's reasoning process transparent, which empowers users to self-debug and provide more targeted suggestions.
How much should be learned via fine-tuning vs. in-context learning? — A core architectural choice is whether to learn via parameter changes (fine-tuning) or prompt changes (in-context learning/RAG). ICL is often faster and cheaper, especially as foundational models improve rapidly, making fine-tuned models quickly obsolete. While fine-tuning on synthetic data is an option for enterprises with privacy concerns, generating high-quality synthetic data is a significant challenge, often making prompt-based learning the more practical path.