Don't Force Your LLM to Write Terse [Q/Kdb] Code: An Information Theory Argument

4 weeks ago 1

I’ve been building coding tools to help LLMs write q/kdb+ code, and one of the quants I spoke to asked: “Could we get it to write really terse code?”

The q/kdb+ community has a long tradition of preferring terse code, and they feel quite strongly that this is the right way to write code. I used to be in that camp myself before LLM coding assistants came along, as I use small screens and like to minimize keypresses. Nowadays, however, when it’s LLMs writing my code, all I care about is keeping my LLMs performing well. Because of that, I think the aesthetic preference for terseness should give way to the preference for LLM accuracy, which may mean more verbose code. So to answer that quant’s question: Yes, we could get it to write really terse code, but I really think we shouldn’t.

Let’s start with an example: (2#x)#1,x#0 is code from the official q phrasebook for constructing an x-by-x identity matrix. It’s roughly equivalent to Python np.tile(np.concatenate([[1],x*[0]]),x)[:x*x].reshape(*2*[x]). If this seems confusing, worry not: I asked Claude to explain it below! (all transcripts have been abridged, but are available in full at the end of the article)

What does np.tile(np.concatenate([[1],x*[0]]),x)[:x*x].reshape(*2*[x]) do in Python ?
[…]
np.concatenate([[1],x*[0]]) — This concatenates these lists into a single array: [1, 0, 0, 0]
np.tile(…, x) — This tiles/repeats the array x times. For x=3: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
[:x*x] — This slices to take only the first x*x elements. For x=3, x*x=9: [1, 0, 0, 0, 1, 0, 0, 0, 1]
*2*[x] — This is *[x, x], which unpacks to two arguments each equal to x
.reshape(*2*[x]) — This reshapes the 1D array into a 2D array of shape (x, x). For x=3:
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
I see a unique approach to generating an identity matrix through a creative NumPy manipulation. While unconventional, the code cleverly constructs a diagonal matrix with 1s along the main diagonal and 0s elsewhere. It’s a roundabout method compared to NumPy’s built-in functions, demonstrating an inventive way of matrix creation.

Let’s ask the same about the q code:

What does (2#x)#1,x#0 do in q ?
[…]
(2#x)#1,x#0 — reshapes 1,x#0 into a matrix with dimensions from 2#x

Actually, let me reconsider. In q:
[…]
So if x is something like 3 4 5, then[…]
(2#x)#1,x#0 reshapes this into a 3x4 matrix (but we only have 6 elements, so it will cycle)

Wait, I need to think about this more carefully. If x is 3 4 5:
[…]
I see a pattern emerging for generating an identity matrix in q. By manipulating list creation and reshaping, we can construct a matrix with 1s on the diagonal and 0s elsewhere, using a compact technique that leverages q’s list manipulation capabilities.

Notice that here Claude twice retraces its steps (“Actually, let me reconsider.” and “Wait, I need to think about this more carefully.”) before going over all of the steps each time.

Finally, for my last example, I asked Claude to explain some the code Claude itself produced in a separate session where it was prompted to write q code for constructing an identity matrix using only the # operator — it came up with (x;x) # (x*x) # 1,x#0. Prompting a fresh session to explain this:

What does (x;x) # (x*x) # 1,x#0 do in q?
[…]
So this creates an identity matrix of size x by x.
Let me trace through an example with x=3:
[…]
Yes, this creates an identity matrix!

This time there was no “let me reconsider” or “I need to think about this more carefully”; it did add an example at the end, but that’s similar to what happened in the Python case.

A Theory for Why This is Happening

Consider prompting an LLM to prove a theorem given some axioms, a task at which today’s LLMs are quite good. The output would be a string containing the proof, which a deterministic, non-LLM proof-checker could decide whether it’s correct or not.

Now let’s think about it from the perspective of Shannon’s Information Theory: before the LLM answers, there is something we don’t know: the proof; afterwards, our ignorance is dispelled¹. Therefore we can say we have gained some knowledge, which can be quantified in bits of surprisal following Shannon’s theory. Since in Shannon’s theory, surprisal is not objective, but rather it is computed relative to some probability distribution, I’d suggest using the one distribution we have at hand: the distribution over future tokens provided by the LLM itself².

I find it fascinating that Shannon studied the mathematical properties of information transmission, yet never applied his theory to the study of what makes some information useful rather than meaningless! This makes sense: in the pre-LLM era, information was typically produced by humans, whose probability distributions were inaccessible; we’ve arrived at a unique point in history where objectively-useful strings are now produced by agents that expose their output distributions

Now, let’s think about taking a proof and trying to rewrite it in fewer steps. The shorter proof would either contain the same information, but somehow compressed (lossless compression), or it might leave out some more obvious steps (lossy compression). Following Shannon’s Theory, we would quantify the amount of information in the output proof as I(proof) = -sum log P(token | previous tokens), and we would have I(short proof) <= I(long proof), with < for lossy compression and = for lossless compression. In practice I’d expect we’d have both some lossy and some lossless compression, but let’s focus on the lossless part, I think that’s the more interesting…

When I(short proof)=I(long proof), per-token average surprisal must be lower for the long proof than for the short proof. But since surprisal for a single token is simply -log P, that would mean that, on average, the shorter proof is made out of less probable tokens. Average surprisal-per-token, when exponentiated (a monotonic transform), is also known as perplexity, a common metric used for LLMs and their inputs and outputs. This means that, given equal information content, we would expect the shorter proof to have higher perplexity than the longer proof.

When LLMs output tokens, they first compute the output distribution over all tokens (at every step), and then they use some sort of pruning strategy (top-P) that basically throws away all but the most likely next tokens. If a particular short proof would require certain less probable tokens that wouldn’t make it in the top-P when prompted to prove a certain theorem, that would mean that the LLM would be absolutely unable to provide us with that particular proof. Instead, it will simply be forced to output a more verbose proof.³

I’ve been talking about proving theorems so far, but I’d argue the case of writing code to fulfill tasks is quite similar: we are still prompting the LLM to output a string that we can check for correctness. So let’s imagine we’ve prompted the LLM to write code. Whether we ask it to write terse code from the start, or ask it to rewrite verbose code more tersely, the same information-theoretic principles apply. The LLM can produce terse code when explicitly asked — but I think we shouldn’t ask it to, because terse code has higher perplexity and is therefore harder for the LLM to work with reliably.

When making the code less verbose, similarly to shortening a proof, we could be dealing with either lossy compression (removing comments, type hints, descriptive variable names, etc.) or lossless compression. But what might lossless compression mean? I’d argue lossless compression in code corresponds to expressing a combination of multiple, simpler operations, as a single, more complex operation; e.g. incrementing a variable with i += 1 vs i = i + 1. My theory above would suggest that the latter would have lower perplexity.

To test this claim, I measured the actual perplexity using Qwen 2.5 Coder 7B Instruct. Prompted with “Respond only with code: How do you increment i by 1 in Python?”, I compared the two valid outputs: i += 1 has a perplexity of approximately 38.68, while i = i + 1 has a perplexity of approximately 10.88. The compound operator version packs the same semantic information into fewer tokens, resulting in nearly 4× higher perplexity.

This confirms the theoretical prediction: when the same information is compressed into a terser representation, the average surprisal per token increases. And empirical research has shown that higher input perplexity correlates with lower task performance in LLMs. This suggests that prompting an LLM with terser code than it might have otherwise produced will cause the LLM to systematically underperform at a task like debugging, explaining, or extending it, compared to using the more verbose version of the code that the LLM naturally produces.

This information-theoretic view reveals a broader principle: surprisal per token represents the cognitive throughput the LLM can handle during generation. For straightforward tasks, LLMs naturally produce an appropriate amount of verbosity that keeps per-token surprisal manageable. But for very complex tasks, LLMs can underestimate how many tokens they need — they attempt to pack too much information into too few tokens, resulting in high per-token surprisal and increased errors. This framework explains why prompting strategies like “think step by step” or “be verbose” improve performance on complex tasks: they force the LLM to spread the cognitive load across more tokens, keeping per-token surprisal within bounds.

As for the practical takeaway for q/kdb+ developers: in the LLM era, let your coding assistant write verbose code. It’s not your fingers doing the typing anymore, and the performance gains from working with lower-perplexity code might outweigh any aesthetic preference for terseness.

Footnotes

[1] There is an argument to be made, popularized by Wittgenstein, that deducing a theorem from axioms shouldn’t provide us with any additional information, as all information contained in a theorem is already contained in all of the necessary starting axioms, and therefore all theorems are tautologies which add nothing to our knowledge (this is also one way to view the situation from the Bayesian interpretation of Information Theory). Here I’d counter that what we seek is not whether the theorem is true or false, but a string representing a proof that is validated by a proof-checker, and such a string itself is not contained in the axioms.

[2] We’d ideally want to measure surprisal relative to our own internal distribution (taking a Bayesian view of our beliefs and ignorance), however, given that our LLM is actually quite good at proving theorems, I’m happy to assume the distributions are reasonably similar.

[3] For completion, I should mention that LLMs can (and do) backtrack if they realize they are going into a dead-end with their initial proof approach. However, since the output so far now impacts the distribution of future tokens, this means that any backtracking will impact the distribution of tokens in the “second attempt”. But it also means that my described approach doesn’t really fit this case, only cases where the LLM outputted the proof without backtracking.

Full Transcripts

Explaining Python code:

What does np.tile(np.concatenate([[1],x*[0]]),x)[:x*x].reshape(*2*[x]) do in Python ?

[Thought process]

This is a question about Python code using NumPy. Let me break down what this expression does step by step.

The expression is: np.tile(np.concatenate([[1],x*[0]]),x)[:x*x].reshape(*2*[x])

Let me trace through this with an example value, say x=3:

x*[0] — This creates a list with x zeros. For x=3: [0, 0, 0]
[[1], x*[0]] — This is a list containing [1] and [0, 0, 0]: [[1], [0, 0, 0]]
np.concatenate([[1],x*[0]]) — This concatenates these lists into a single array: [1, 0, 0, 0]
np.tile(…, x) — This tiles/repeats the array x times. For x=3: [1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0]
[:x*x] — This slices to take only the first x*x elements. For x=3, x*x=9: [1, 0, 0, 0, 1, 0, 0, 0, 1]
*2*[x] — This is *[x, x], which unpacks to two arguments each equal to x
.reshape(*2*[x]) — This reshapes the 1D array into a 2D array of shape (x, x). For x=3:
[[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
I see a unique approach to generating an identity matrix through a creative NumPy manipulation. While unconventional, the code cleverly constructs a diagonal matrix with 1s along the main diagonal and 0s elsewhere. It’s a roundabout method compared to NumPy’s built-in functions, demonstrating an inventive way of matrix creation.

Explaining q code:

What does (2#x)#1,x#0 do in q ?

[Thought process]

The user is asking about q/kdb+ syntax. This looks like a q expression. Let me break down what this means:

In q/kdb+:

# is the take/reshape operator
x#y takes x elements from y (if x is a number) or reshapes y to dimensions x (if x is a list)
The expression (2#x)#1,x#0 involves multiple uses of #
Let me parse this step by step:

x#0 — takes x zeros (creates a list of x zeros)
1,x#0 — prepends 1 to the list of x zeros, so we get a list like 1 0 0 0… (x+1 elements total)
2#x — takes the first 2 elements of x
(2#x)#1,x#0 — reshapes 1,x#0 into a matrix with dimensions from 2#x
Actually, let me reconsider. In q:

2#x when x is a list takes the first 2 elements
Then (2#x)#y reshapes y into a matrix/array with dimensions specified by the first 2 elements of x
So if x is something like 3 4 5, then:

2#x gives 3 4
x#0 gives 5 zeros: 0 0 0 0 0
1,x#0 gives 1 0 0 0 0 0 (6 elements)
(2#x)#1,x#0 reshapes this into a 3x4 matrix (but we only have 6 elements, so it will cycle)
Wait, I need to think about this more carefully. If x is 3 4 5:

count x is 3
x#0 creates a list of zeros using the first element of x
Specifically, 3#0 results in 0 0 0
When prepending 1, we get 1 0 0 0
2#x extracts 3 4
The final reshape will create a matrix cycling through the limited elements
I see a pattern emerging for generating an identity matrix in q. By manipulating list creation and reshaping, we can construct a matrix with 1s on the diagonal and 0s elsewhere, using a compact technique that leverages q’s list manipulation capabilities.

Unlike in Python, in q the LLM is less certain that x should be a scalar rather than a list; still, when including in the prompt this additional detail, Claude still retraces its steps twice.

What does (2#x)#1,x#0 do in q for scalar x?

[Thought process]

This is a question about q/kdb+, which is a programming language. Let me think about what this expression does.

In q/kdb+:

# is the take/reshape operator
When used as n#list, it takes/repeats elements
When used as shape#list, it reshapes
Let’s break down (2#x)#1,x#0:

x#0 — This takes x copies of 0, creating a list of x zeros
1,x#0 — This prepends 1 to that list, so we get 1 0 0 0… (with x zeros)
2#x — This takes 2 copies of the scalar x, creating x x
(2#x)#1,x#0 — This reshapes the list 1,x#0 into shape x x, creating an x by x matrix
Wait, let me reconsider. When you have a shape vector like 2 3 and use it with #, it reshapes the right argument into that shape.

So if we have (2#x)#1,x#0:

2#x creates a 2-element list where both elements are x: x x
This is used as a shape for reshaping
The data to reshape is 1,x#0 which is 1 followed by x zeros
Reshaping into shape x x means creating an x-by-x matrix
This would create an x by x identity matrix! The first element is 1, then x zeros, and when reshaped into an x-by-x matrix, you get 1s on the diagonal and 0s elsewhere.

Actually, let me verify this logic:

For x=3: 1,3#0 gives 1 0 0 0 (4 elements)
Reshaping with shape 3 3 takes these 4 elements and repeats cyclically to fill a 3x3 matrix
Reading row by row, the matrix cycles through the initial data
The reshape operation fills the matrix by repeatedly using the source data
Each row follows a cyclic pattern determined by the initial list indices
Specifically, the indices wrap around using modulo arithmetic
This creates a unique diagonal pattern resulting in an identity matrix

When prompted with the code Claude itself wrote:

What does (x;x) # (x*x) # 1,x#0 do in q?

[Thought process]

This is a question about q/kdb+, which is a programming language. Let me break down this expression:
(x;x) # (x*x) # 1,x#0
In q/kdb+:

# is the “take” operator (or “reshape” depending on context)
; is used to create lists
, is the join/concatenate operator

Let me parse this from right to left (since q is generally evaluated right to left):

x#0 — this creates a list of x zeros
1,x#0 — this prepends 1 to that list, so you get a list like 1 0 0 0… with x zeros
(x*x) # 1,x#0 — this reshapes the list 1,x#0 to have length x*x
(x;x) # (x*x) # 1,x#0 — this reshapes the result into a matrix with dimensions x by x

So this creates an identity matrix of size x by x.
Let me trace through an example with x=3:

3#0 gives 0 0 0
1,3#0 gives 1 0 0 0 (4 elements)
(3*3) # 1,0 0 0 = 9 # 1 0 0 0 gives 1 0 0 0 1 0 0 0 1 (reshapes/cycles the input to length 9)
(3;3) # 1 0 0 0 1 0 0 0 1 reshapes this into a 3x3 matrix:

1 0 0
0 1 0
0 0 1
Yes, this creates an identity matrix!

Read Entire Article