Prompt Smells, Just Like Code

4 months ago 9

Prompt Smells

The term "code smell" was introduced by Kent Beck in the late 1990s. It was meant to capture something deeper than just bugs or technical inaccuracies. Code smells help us identify poor design fundamentals that lead to low-quality code. Engineering teams often deal with code smells as a form of technical debt, and it’s a problem as old as software engineering itself.

In this article, I’ll introduce a new term: prompt smells. Just like code, your prompts can smell too. Prompts in any LLM-powered app are just as important as code. They are a crucial part of your software. When a prompt smells, it usually means the prompt is poorly designed or written, the output is less explainable, and sometimes it even includes instructions that shouldn’t be there.

The intention here isn’t to list every prompt smell that exists, but to make sure you understand what they are and help you feel confident in removing them. Let’s get started.

Prompt Smells vs Code Smells Parallels

Before we dive into how to fix prompt smells, let’s look at some of the parallels between prompt smells and code smells. What follows is just a subset of all possible prompt smells. There are also some new “smells” that are unique to the world of prompt engineering.

Binary Operator in Name

In code, a function with a binary operator, such as SaveAndExit, does more than one thing. It both saves and exits, which breaks the Single Responsibility Principle (SRP).

In the prompt engineering world, a prompt that breaks this principle might look something like this:

You are given a text in English Language. Your should summarise the text and translate it to Marathi. Only output the translated text.

Notice how we’re asking the LLM to do two things in a single prompt. This could have been split into two prompts, which would make the output much more explainable and easier to debug. If something goes wrong with the translation, you won’t know if the issue was with the summarization or the translation itself.

While we’re drawing parallels, one way to partially fix this in prompt engineering is to change the prompt to something like this:

You are given a text in English Language. You should summarise the text and translate it to Marathi. Output in this format: Summarised Text: <Summary in English> Translated Text: <Summary in English Translated to Marathi>

Notice that although the prompt still does two things, it’s arguably more debuggable in this format. You can actually figure out what went wrong, since you haven’t removed the summary from the output. Having said that, it uses more output tokens, and is more costly to run.

If you’re developing an agent, you might want to treat these as separate tasks or handle them in different turns of the conversation.

Duplicated Code/Prompt

You’ve written seven lines of code to check if the database connection is active, and if not, to make the connection. Instead of turning this into a function, you end up copying the same code in five different places. You know how the rest goes. If you ever need to add another retry, you have to update all five places.

The same problem shows up with prompts. Sometimes your prompt is made up of various components, and some of them can be reused. It’s important to identify these common parts and make sure you reuse them. For example, if your app has a feature that lets users add more context, like internal documentation or terms, you should reuse the way you integrate that context into the LLM’s working memory instead of writing it from scratch each time. As with traditional code, it’s a trade-off: you have to make an educated guess about which prompt fragments are worth abstracting, and be sure the reuse benefit outweighs the added overhead.

Let’s look at one more example before we get to the important point:

Doing Deterministic Operations in Prompt

This is a particularly interesting prompt smell that comes up in prompt engineering. Sometimes, people start adding deterministic operations and expect the LLM to handle them, or expect agents to do them without using any tools.

Here’s an example:

Give me the list of all people from the below list, whose name is associated with context of Maratha history. Also sort the output based upon the year of birth. 1 Tanaji Malusare 1626 2 Bahirji Naik 1625 3 Nana Saheb 1824 .... <another 200 names>

Assume the names are dynamically populated in each call and you want the output sorted by year of birth. Unless you’re experimenting or the sorting requirements change frequently, asking the LLM to sort by date of birth is a prompt smell. You’re not only asking the LLM to do multiple things in a single prompt, but you’re also adding a layer of non-determinism to your code. Since you already have the data, sorting the output should be done deterministically after you get the results.

I’ll again highlight that this is a nuanced case. Sorting using LLMs cannot be universally labelled a prompt smell. Treat deterministic work like alphabetising names or sorting purely by birth-year as code, not prompt. But when the sort key is inherently semantic or subjective (e.g., ‘rank these summaries by optimism’), the LLM’s judgement is the feature, not a smell. In short, push algorithmic tasks back into code and delegate meaning-making tasks to the model.

💡

There are many other prompt smells that can come up. This article is just meant to make you aware that they exist. It is not a comprehensive list. I have also started working on documenting prompt smells separately and plan to release a full list in the future.

How to reduce the Prompt Smells

Now comes the interesting question. Now that you know prompt smells exist, you also know that over time they add up, making it harder to refactor or change prompts, creating more coupling, or even causing incidents in your app. So how do we stop this? How do we reduce prompt smells?

Since we’re drawing parallels to code smells, the way to fix them in code has two parts. First, you identify the function, class, or file that smells, and then you refactor it.

Is that all? Yes, but there’s a catch. You can’t just go ahead and refactor code unless there are enough unit or functional test cases covering it. After all, this is a production system, and smelly code is already hard to understand. It’s likely you might miss some edge cases. Will developers want to take responsibility if the code fails in production?

So, the real solution for code smells is broader. In a complex system without enough tests, code will keep smelling, and as time goes by, it only gets worse. I’ve worked with companies where I could trace the failure of the entire company back to smelly code. It stops your ability to make changes, slows down progress, and kills innovation. Timelines stretch, things break, and customers become unhappy.

Write Tests Evals

Well, you might notice a strikethrough above, but the main point is that you need to write tests or evals for your prompts or workflows. While working on AI apps, I’ve noticed that people—including myself—are often hesitant to touch a prompt or change any part of it. Without evals, you have no idea what effect changing the prompt will have on the larger application.

That’s not all. A customer might ask, “Hey, does your product work with Gemini?” and you might give your best guess and say something like, “It should, it works with GPT 4.1.” That’s like saying your app works with Postgres just because it works with MySQL. You can’t make that claim until you’ve actually run those damned tests.

Things get even trickier with prompt smells. With code smells, the code is often deterministic, so you can be reasonably confident that your refactored code will work. With prompts, though, there’s a big fat LLM adding a layer of non-determinism. Assuming things will be fine just because you broke up a complex prompt into smaller parts is not enough.

To make it more complex, outputs from prompts are often subjective. The result isn’t always a simple boolean or something from a finite set, so your evals need to be subjective too. In a complex system, especially one involving an LLM, butterfly effects are not uncommon. If you don’t have evals, your prompts will always smell.

If you are new to evals, I recommend you read this article.

Take Away

Let’s face it, prompt engineering is still new territory for most teams, but some problems feel very familiar. Just like messy code can drag down your entire product, messy prompts will do the same for any LLM-powered app. The good news is that you already know what to do about it. You don’t need a totally new mindset, just a willingness to pay attention to the quality of your prompts the same way you already care about code.

Go through your prompts regularly. Ask if each one is clear, single-purpose, and easy to debug. Watch out for duplication, for bloated or unclear instructions, and for anything that feels like you’re asking the model to do a job your own code could handle better. And above all, set up evals, without them, you’re just guessing every time you change something.

In the long run, the teams that take prompt smells seriously are the ones whose apps keep working, scale smoothly, and deliver consistent results to users. It’s not always exciting work, but it’s the difference between an app that survives and one that fizzles out when things get complicated. So don’t ignore those smells. Deal with them early, keep your prompts clean, and watch everything else get a little bit easier.

Read Entire Article