Is Matrix Multiplication Ugly?

1 hour ago 1

A few weeks ago I was minding my own business, peacefully reading a well-written and informative article about artificial intelligence, when I was ambushed by a passage in the article that aroused my pique. That’s one of the pitfalls of knowing too much about a topic a journalist is discussing; journalists often make mistakes that most readers wouldn’t notice but that raise the hackles or at least the blood pressure of those in the know.

The article in question appeared in The New Yorker. The author, Stephen Witt, was writing about the way that your typical Large Language Model, starting from a blank slate, or rather a slate full of random scribbles, is able to learn about the world, or rather the virtual world called the internet. Throughout the training process, billions of numbers called weights get repeatedly updated so as to steadily improve the model’s performance. Picture a tiny chip with electrons racing around in etched channels, and slowly zoom out: there are many such chips in each server node and many such nodes in each rack, with racks organized in rows, many rows per hall, many halls per building, many buildings per campus. It’s a sort of computer-age version of Borges’ Library of Babel. And the weight-update process that all these countless circuits are carrying out depends heavily on an operation known as matrix multiplication.

Witt explained this clearly and accurately, right up to the point where his essay took a very odd turn.

HAMMERING NAILS

Here’s what Witt went on to say about matrix multiplication:

“‘Beauty is the first test: there is no permanent place in the world for ugly mathematics,’ the mathematician G. H. Hardy wrote, in 1940. But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board. It is possessed of neither beauty nor symmetry: in fact, in matrix multiplication, a times b is not the same as b times a.”

The last sentence struck me as a bizarre non sequitur, somewhat akin to saying “Number addition has neither beauty nor symmetry, because when you write two numbers backwards, their new sum isn’t just their original sum written backwards; for instance, 17 plus 34 is 51, but 71 plus 43 isn’t 15.”

The next day I sent the following letter to the magazine:

“I appreciate Stephen Witt shining a spotlight on matrices, which deserve more attention today than ever before: they play important roles in ecology, economics, physics, and now artificial intelligence (“Information Overload”, November 3). But Witt errs in bringing Hardy’s famous quote (“there is no permanent place in the world for ugly mathematics”) into his story. Matrix algebra is the language of symmetry and transformation, and the fact that a followed by b differs from b followed by a is no surprise; to expect the two transformations to coincide is to seek symmetry in the wrong place — like judging a dog’s beauty by whether its tail resembles its head. With its two-thousand-year-old roots in China, matrix algebra has secured a permanent place in mathematics, and it passes the beauty test with flying colors. In fact, matrices are commonplace in number theory, the branch of pure mathematics Hardy loved most.”

Confining my reply to 150 words required some finesse. Notice for instance that the opening sentence does double duty: it leavens my many words of negative criticism with a few words of praise, and it stresses the importance of the topic, preëmptively¹ rebutting editors who might be inclined to dismiss my correction as too arcane to merit publication.

I haven’t heard back from the editors, and I don’t expect to. Regardless, Witt’s misunderstanding deserves a more thorough response than 150 words can provide. Let’s see what I can do with 1500 words and a few pictures.

THE GEOMETRY OF TRANSFORMATIONS

As a static object, matrices are “just” rectangular arrays of numbers, but that doesn’t capture what they’re really about. If I had to express the essence of matrices in a single word, that word would be “transformation”.

One example of a transformation is the operation f that takes an image in the plane and flips it from left to right, as if in a vertical mirror.

Another example is the operation g that that takes an image in the plane and reflects it across a diagonal line that goes from lower left to upper right.

The key thing to notice here is that the effect of f followed by g is different from the effect of g followed by f. To see why, write a capital R on one side of a square piece of paper–preferably using a dark marker and/or translucent paper, so that you can still see the R even when the paper has been flipped over–and apply f followed by g; you’ll get the original R rotated by 90 degrees clockwise. But if instead, starting from that original R, you were to apply g followed by f, you’d get the original R rotated by 90 degrees counterclockwise.

Same two operations, different outcomes! Symbolically we write g ◦ f ≠ f ◦ g, where g ◦ f means “First do f, then do g” and f ◦ g means “First do g, then f”.² The symbol ◦ denotes the meta-operation (operation-on-operations) called composition.

The fact that the order in which transformations are applied can affect the outcome shouldn’t surprise you. After all, when you’re composing a salad, if you forget to pour on salad dressing until after you’ve topped the base salad with grated cheese, your guests will have a different dining experience than if you’d remembered to pour on the dressing first. Likewise, when you’re composing a melody, a C-sharp followed by a D is different from a D followed by a C-sharp. And as long as mathematicians used the word “composition” rather than “multiplication”, nobody found it paradoxical that in many contexts, order matters.

THE ALGEBRA OF MATRICES

If we use the usual x, y coordinates in the plane, the geometric operation f can be understood as the numerical operation that sends the pair (x, y) to the pair (−x, y), which we can represented via the 2-by-2 array

where more generally the array

stands for the transformation that sends the pair (x, y) to the pair (ax+by, cx+dy). This kind of array is called a matrix, and when we want to compose two operations like f and g together, all we have to do is combine the associated matrices under the rule that says that the matrix

composed with the matrix

equals the matrix

For more about where this formula comes from, see my Mathematical Enchantments essay “What Is A Matrix?”.

There’s nothing special about 2-by-2 matrices; you could compose two 3-by-3 matrices, or even two 1000-by-1000 matrices. Going in the other direction (smaller instead of bigger), if you look at 1-by-1 matrices, the composition of

and

is just

so ordinary number-multiplication arises as a special case of matrix composition; turning this around, we can see matrix-composition as a sort of generalized multiplication. So it was natural for mid-19th-century mathematicians to start using words like “multiply” and “product” instead of words like “compose” and “composition”, at roughly the same time they stopped talking about “substitutions” and “tableaux” and started to use the word “matrices”.

In importing the centuries-old symbolism for number multiplication into the new science of linear algebra, the 19th century algebraists were saying “Matrices behave kind of like numbers,” with the proviso “except when they don’t”. Witt is right when he says that when A and B are matrices, A times B is not always equal to B times A. Where he’s wrong is in asserting that is a blemish on linear algebra. Many mathematicians regard linear algebra as one of the most elegant sub-disciplines of mathematics ever devised, and it often serves as a role model for the kind of sleekness that a new mathematical discipline should strive to achieve. If you dislike matrix multiplication because AB isn’t always equal to BA, it’s because you haven’t yet learned what matrix multiplication is good for in math, physics, and many other subjects. It’s ironic that Witt invokes the notion of symmetry to disparage matrix multiplication, since matrix theory and an allied discipline called group theory are the tools mathematicians use in fleshing out our intuitive ideas about symmetry that arise in art and science.

So how did an intelligent person like Witt go so far astray?

PROOFS VS CALCULATIONS

I’m guessing that part of Witt’s confusion arises from the fact that actually multiplying matrices of numbers to get a matrix of bigger numbers can be very tedious, and tedium is psychologically adjacent to distaste and a perception of ugliness. But the tedium of matrix multiplication is tied up with its symmetry (whose existence Witt mistakenly denies). When you multiply two n-by-n matrices A and B in the straightforward way, you have to compute n²numbers in the same unvarying fashion, and each of those n² numbers is the sum of n terms, and each of those n terms is the product of an element of A and an element of B in a simple way. It’s only human to get bored and inattentive and then make mistakes because the process is so repetitive. We tend to think of symmetry and beauty as synonyms, but sometimes excessive symmetry breeds ennui; repetition in excess can be repellent. Picture the Library of Babel and the existential dread the image summons.

G. H. Hardy, whose famous remark Witt quotes, was in the business of proving theorems, and he favored conceptual proofs over calculational ones. If you showed him a proof of a theorem in which the linchpin of your argument was a 5-page verification that a certain matrix product had a particular value, he’d say you didn’t really understand your own theorem; he’d assert that you should find a more conceptual argument and then consign your brute-force proof to the trash. But Hardy’s aversion to brute force was specific to the domain of mathematical proof, which is far removed from math that calculates optimal pricing for annuities or computes the wind-shear on an airplane wing or fine-tunes the weights used by an AI. Furthermore, Hardy’s objection to your proof would focus on the length of the calculation, and not on whether the calculation involved matrices. If you showed him a proof that used 5 turgid pages of pre-19th-century calculation that never mentioned matrices once, he’d still say “Your proof is a piece of temporary mathematics; it convinces the reader that your theorem is true without truly explaining why the theorem is true.”

If you forced me at gunpoint to multiply two 5-by-5 matrices together, I’d be extremely unhappy, and not just because you were threatening my life; the task would be inherently unpleasant. But the same would be true if you asked me to add together a hundred random two-digit numbers. It’s not that matrix-multiplication or number-addition is ugly; it’s that such repetitive tasks are the diametrical opposite of the kind of conceptual thinking that Hardy loved and I love too. Any kind of mathematical content can be made stultifying when it’s stripped of its meaning and reduced to mindless toil. But that casts no shade on the underlying concepts. When we outsource number-addition or matrix-multiplication to a computer, we rightfully delegate the soul-crushing part of our labor to circuitry that has no soul. If we could peer into the innards of the circuits doing all those matrix multiplications, we would indeed see a nightmarish, Borgesian landscape, with billions of nails being hammered into billions of boards, over and over again. But please don’t confuse that labor with mathematics.

Join the discussion of this essay over at Hacker News!

This essay is related to chapter 10 (“Out of the Womb”) of a book I’m writing, tentatively called “What Can Numbers Be?: The Further, Stranger Adventures of Plus and Times”. If you think this sounds interesting and want to help me make the book better, check out http://jamespropp.org/readers.pdf. And as always, feel free to submit comments on this essay at the Mathematical Enchantments WordPress site!

ENDNOTES

#1. Note the New Yorker-ish diaresis in “preëmptively”: as long as I’m being critical, I might as well be diacritical.

#2. I know this convention may seem backwards on first acquaintance, but this is how ◦ is defined. Blame the people who first started writing things like “log x” and “cos x“, with the x coming after the name of the operation. This led to the notation f(x) for the result of applying the function f to the number x. Then the symbol for the result of applying g to the result of applying f to x is g(f(x)); even though f is performed first, “f” appears to the right of “g“. From there, it became natural to write the function that sends x to g(f(x)) as “g ◦ f“.