Benchmarking Crimes Meet Formal Verification

1 week ago 3

No, this isn’t about formally verifying benchmarking (BM) crimes. It’s about the use of misleading statistics in papers that apply formal methods (FM) to verify (prove correct) operating systems (OS) code – something that has been bugging me for a while.

NameLanguageSpec. Lang.Proof:Code

seL4	C+Asm	Isabelle	20:1
CertiKOS	C+Asm	Coq	15:1
SeKVM	C+Asm	Coq	7:1
Ironclad	Dafny	Dafny	5:1
XXX	Rust	Verus	10:1
YYY	Rust	Verus	7.5:1

Table 1: Language and (incomplete) proof-size statistics of a number of verified systems [Anonymous].

The Table 1 above shows a recent example from a peer-reviewed paper (it would be unfair to pillory the authors, given how widespread this flaw is, so I’m keeping it anonymous). It seems to show a significant improvement in verification efficiency since the original seL4 work, given that the lines of proof required to verify a certain amount of code have been reduced by factors of 3–4. Note that the systems in the upper half of the table use interactive (i.e. manual) theorem proving (ITP), while the second half uses automatic theorem proving (ATP) using SMT solvers. Given the automation one would naturally expect verification effort of the ATP systems to be significantly less than the ITP systems, and the table seems to roughly confirm this.

So, what does “proof-to-code ratio” mean?

Short answer: Very little.
To understand why I say that, let’s take as an example a function sort() that takes a list of integers and returns the list sorted by increasing value. Presumably, a “proof” of that function should confirm exactly that: it returns the original numbers sorted. But in order to have a meaningful proof, we need to have a more precise specification of the expected result. And such a specification will demand a number of properties of the return value; informally this could look something like:

The return value, R, is a list of integers.
The length, n, of the output list equals the length of the argument list, A.
The output list is sorted: 0<i<n: R_i−1≤R_i.
Every element of the output occurs in the argument: 0≤i<n: R_i∈A

Clearly, our expectation on the correctness of sort() is that all those conditions hold. Condition 1 is little more than syntax (and may or may not be discharged by the type system). Condition 2 is a fairly trivial and clearly insufficient for correctness.

Condition 3 is the core property that says what we mean with “sorted”. But it’s clearly not enough, because a function that returns the list [0, 1, · · · , n − 1] would satisfy Conditions 1–3, but would still be utterly wrong for what we want. So we also want Condition 4, that assures we sort the correct numbers.

Sort-of. Conditions 1–4 would be sufficient if we had a precondition that all numbers in A are unique. If we don’t have that, then the results could still be wrong, e.g. if sort([1,0,1]) returned [0,0,1], which presumably is not what we’re after.

To fully capture the correctness of sort, we need an additional Condition 5, which somehow captures that we’re dealing with the same numbers, each integer occurs exactly the same number of times in both lists. I’m not even trying to specify this semi-formally, as I’m sure I’d stuff it up…

Why am I telling you all this?

Well, between proving

Conditions 1–3
Conditions 1–4
Conditions 1–5

the code hasn’t changed. But it’s obvious that later cases have much more complex specification than the prior ones. And it should also be obvious that proving (2) will require more proof effort than proving (1), and proving (3) will require much more proof effort than proving (2), resulting in a larger proof, and hence a larger proof-to- code ratio.

So, just comparing proof sizes without looking at the completeness of the spec is meaningless. Implying that it has meaning is definitely a benchmarking crime in my books.

Sorting is a also good example for a different aspect: There are many sort algorithms of varying (conceptual as well as computational) complexity. Any of them, if implemented correctly, will satisfy our Conditions 1–5, but with significantly differing code sizes. The different code sizes may have an impact on the size of the proof (very likely with ITP, maybe not as much with SMT, if it can handle it!) How much this can tell you about verification efficiency is an open question.

This is what I mean when I say that:

Proof-to-code ratio is a meaningless metric!

So, what does make sense?

Clearly the size of the specification is a major factor, more important than the code size! Note that the table commits the classic BM crime of showing relative numbers only, so I need to dig out the full information.

NameShortTypeCodeSpecProofP:CP:S

seL4	seL4_I	ITP	8.5	4.9	165	19	34
CertiKOS	CKOS_I	ITP	6.5	1.4	95	15	68
SeKVM	SKVM_I	ITP	3.8	1.7	24	6.4	14
Ironclad	Icld_A	ATP	7.0	3.5	33	4.8	9.5
XXX	XXX_A	ATP	0.5	0.6	5	10	8.3
YYY	YYY_A	ATP	4.6	2.3	11	2.3	4.6

Table 2: Same systems, but explicitly showing code, spec and proof sizes (1000s of lines).

Table 2 gives a more complete picture in that it shows the actual code and proof sizes, but also the spec size. (I’m grateful to the authors in supplying these figures where they weren’t in the paper, and note that some are updated with respect to the paper.)

In the table, the “Short” column defines the acronym used in the below graphs, the “Type” column indicates the verification approach used, interactive or automated theorem proving; P:C stands for proof-to-code ratio and P:S for proof-to-spec ratio.

The table also corrects a common mis-representation of the seL4 verification effort, which is often summarised as 20 py and 200 k lines of proof for 8.5 k lines of code. This is quite incorrect: if you bother to check the paper [Klein et al., 2009] you’ll see that these proof-size figures include some automatically generated proofs; manually written proofs were 165 k lines. Furthermore, the 20 py includes a significant investment into re-usable proof libraries, e.g. for dealing with fixed-size machine words, leaving 11 py for the actual seL4 verification. Unfortunately, this doesn’t stop (these and other) authors quoting incorrect numbers with reference to the paper.

If we look at the proof/spec size ratio (last column) we actually see an even bigger difference than in the proof/code ratio. Why is that?

The striking observation from the table is that seL4 is by far the most complex systems, going by the spec size. This aligns with the informal complexity arguments I’ve made above, so it’s hardly surprising the proof required more effort.

Before we get carried away with looking at ratios, we need to first remind ourselves of a simple fact:

Comparing ratios of two quantities only makes sense if you can reasonably expect a linear relationship between these quantities.

For example, it is nonsense comparing the ratio of surface gravity to diameter of different planets: it is not a linear relationship (according to Newton it would be quadratic if the planets all had the same density) and thus the ratio numbers are meaningless.

So, can we expect linear relationship between code size/complexity and proof size? In fact, it is far from linear! Matichuk et al. [2015] found that the size of a proof is quadratic in the size of the formal statement of a property. In other words, the proof size can be expected to grow with the square of the spec size!

I found that result highly intuitive: Proving that some piece of code maintains an invariant will obviously depend on how much state there is that can affect the functionality of the code, and how much other code there is that can affect that state. Intuitively, this is an n² problem.

So, just comparing ratios as if they had meaning is simply ~~bullshit~~ a benchmarking crime.

Note that Matichuk et al. only looked at ITP, so there’s no guarantee that the same relationship will hold for ATP. However, as I said above, the n² relationship is intuitive.

So, how do the stats look in this light?

In Figure 1 we plot lines-of-proof against lines-of-code (red ovals) and against lines-of-spec (blue ovals). The straight lines are sample quadratics (with no claim that they are good predictors). Note that in this logarithmic plot, all quadratics are parallel!

If we just look at code and proof sizes we see that the purple line fits the ITP systems (solid red ovals) reasonably well, while two of the ATP systems (Ironclad and YYY, dotted red ovals) are, unsurprisingly, lower (we’ll look at XXX later). However, as argued, this is a meaningless comparison.

If we look instead at the spec-to-proof ratio then we see that among the ITP systems (solid blue ovals), seL4 and SeKVM are quite similar (fitted by the green line), while CertiKOS has a surprisingly large amount of proof given its spec size.

Comparing the ATP systems (dotted blue ovals) we see that the orange line fits Ironclad and YYY quite well, while the proof effort of XXX is more in line with the ITP systems.

Why is XXX such an outlier? I’m not going to make any strong claims here, but can have at least a suspicion. XXX is a much simpler system than the others, in terms of both, code and spec size. The quadratic relationship found by Matichuk et al. is obviously asymptotic, we can reasonably expect that there’d be linear and constant terms in the complete relationship. Such terms would result in a dependency that is no longer a straight line in Figure 1 – the dependency graph would rise above the straight lines to the left of the figure. This could well be the reason why XXX is so high compared to the purple line.

Modularity

The time-honoured engineering principle for dealing with complexity is modularity: Break the complex system into simpler parts that are largely independent. This likely applies to verification as well. Specifically, if the verification cost, V, asymptotically grows with the square of the size of the spec, S, i.e.

V ∼ O(S²),

then breaking the whole system into k modules, each with a spec size of S/k, the cost of verifying one module would be

V_k ∼ O((S/k)²) ∼ O(V/k²),

and the resulting cost for verifying all modules would be

Σ_k V_k ∼ O(k(S/k)²) ∼ O(V/k).

Obviously, this argument is vastly over-simplified in many ways. Not only will the modules not all be of equal complexity, but the total spec size of two sub-modules making up a system will almost certainly be larger than the spec of the complete system. Furthermore, there will be effort required to prove that the composition of all the modules will satisfy the overall system spec.

Nevertheless, experience shows that modularity may dramatically simplify constructing, and, importantly, reason about of complex systems, and it is intuitive that this will also somehow apply to the verification effort (and the above provides some over-simplified justification for this).

Would modularity make verifying something like seL4 cheaper?

Pretty unlikely, I claim. Why?

Because you can’t really modularise seL4. My tongue-in-cheek definition of a microkernel is: The black gunk that’s left when you pull all the easy bits out of the kernel. More seriously: In the seL4 kernel, almost everything is connected to everything else, with little chance to separate things. Consequently, a large part of the verification effort went into proving global invariants.

There are systems that claim to be “similar” to seL4 yet simpler or more modular. Each such case I have seen inevitably makes some significant simplification to the system model that either reduce generality or performance (or both). As such, I don’t expect to see anything that’s in a real sense like seL4 verified at significantly less cost anytime soon.

And I certainly think there’s a snowball’s chance in hell that something of the complexity of seL4 will be verified by ATP an time soon.

So, what’s the take-away?

Well, it ain’t that easy!

First we need to note that the number of data points are really small, so we need to be careful not to over-interpret the numbers. If I can attempt a summary, then it would be:

Looking at proof-to-code ratio is BS, you’ll need to look at the spec size, and even there expecting a linear relationship is nonsense
ATP requires less proof effort than IPT – surprise!
Obviously, that’s the whole point of automated techniques: When they work, they are relatively cheap (if you accept the green and orange lines in the figure as predictors, which I don’t claim they are, the improvement would be a factor of five). But they are very limited in the complexity of specifications they can tackle.
Even within the class (ITP vs ATP) there is a big range of effort. And, surprise, seL4 actually does pretty well in the ITP class!

Hence, my plea to the community: Please stop this unscientific nonsense of looking at proof:code ratios!

Read Entire Article