Benford's Law and the Ahlstrom Conjecture

3 months ago 1

I was chit-chatting with my boss, Eric, before work one morning. Somehow we got onto the topic of identifying fraudulent financial data. I told Eric about Benford’s Law, and he speculated about a technique I hadn’t thought of before.

Benford’s Law states that in real-life financial data, the leading digit in any number will be a ‘1’ or a ‘2’ almost 50% of the time (theoretically, 48.7%). Very surprising. If a criminal generated fake data, it’s highly unlikely they would make the leading digit a ‘1’ or a ‘2’ so often.

The real-life financial data I used

I decided to see for myself. I grabbed an old, real-life page of financial data. There were 65 numbers on the page. I manually counted the number of times each digit, ‘0’ through ‘9’, occurred as the first digit, and computed each frequency:

digit count freq ---------------------- 0 3 0.05 1 18 0.28 2 15 0.23 3 6 0.09 4 4 0.06 5 5 0.08 6 3 0.05 7 3 0.05 8 7 0.11 9 1 0.02 65 1.00

The leading digit in the real-life financial data was a ‘1’ or a ‘2’ with a frequency of 0.28 + 0.23 = 0.51 — very close to the 0.48 predicted by Benford’s Law. Surprising!

Eric wasn’t familiar with Benford’s Law. Before I explained it, Eric speculated that if a criminal made up fake financial data, they would almost never use two consecutive digits that are the same because that doesn’t appear random. I coined the idea the “Ahlstrom Conjecture”.

It turns out that consecutive digits that are the same are very frequent. Theoretically, in a sequence of 10 random digits, the probability that there is a pair of digits that are the same, such as ’33’ or ’77’, is approximately 0.65 : 1.0 – ((10 * 9^9) / 10^10).

I applied “The Ahlstrom Conjecture” to the real-life financial data. I broke the data up into 32 sequences of ten digits. For real, non-fraudulent data, you’d expect about 0.65 * 32 = 21 of the sequences to have one or more pairs of repeated digits.

sequence repeat? --------------------- 8667894929 x 8949298949 2917898624 4117654517 x 6545718384 1881615806 x 4135792598 0425980424 4103024712 4714561098 0001040041 x 0739148300 x 0012644146 x 4191092821 1214783251 1734225676 x 7292926499 x 6110928211 x 6499611157 x 8172290880 x 6740213173 1143822133 x 3822133292 x 9236051173 x 4225676749 x 6698151729 x 1650002314 x 7262314726 5441313894 x 9339140000 x 1073916469 6801000000 x

The were repeated digits in 20 of the 32 sequences of digits — almost exactly the theory prediction.

Therefore, I think the “Ahlstrom Conjecture” is correct — in fraudulent financial data, generated by a human, you would find far fewer than 65% of the sequences of ten digits to have consecutive same digits.

To the best of my knowledge, this “Ahlstrom Conjecture” isn’t a generally known technique to identify fraudulent financial data. But it seems to be a practical and useful technique. Of course, analyzing data for randomness is a standard technique, but the specific idea of looking for repeated pairs of digits doesn’t seem to be described anywhere (at least that I could find).

Fascinating!

Generating fake financial data is not too difficult. With the rise of generative AI, creating fake images has become relatively easy. I did a brief review of the topic and discovered that several of the discussions used images of Japanese geishas as examples. I speculate this is because of the heavy makeup and somewhat artificial appearance of real geishas to begin with.