Pierre Barre
unread,
Jun 25, 2025, 7:40:43 AM (9 days ago) Jun 25
to certificate-transparency
Hi everyone,
With multiple new CT implementations available, especially static spec implementations, I ran some benchmarks that might be of interest.
I tested three implementations:
- Sunlight (static)
- Cloudflare Azul (static)
- CompactLog (static + RFC6962)
Test setup:
1. Generated 1M unique certificates
2. Used wrk with a Lua script to test add-chain endpoints
3. Tested at 50, 1000, and 3000 concurrent connections
4. 60 seconds per test, storage reset between implementations
I didn't include Tessera (CT implementation isn't POSIX-only yet). I tried to run itko but couldn't get it to run.
Results:
The results show significant performance variations between implementations, regardless of whether they use static or RFC 6962 APIs
I hope you will find this interesting!
Best,
Pierre
Andrew Ayer
unread,
Jul 1, 2025, 8:51:24 AM (3 days ago) Jul 1
These benchmarks are misleading for the following reasons.
First, CompactLog does not call fsync to durably store data on disk, while Sunlight does[1]. I determined this by running strace on a release build of CompactLog as of commit a299190d5259a2602626b2d0a5e3effb830cb8eb and submitting a certificate using add-chain. No calls to fsync were made prior to an SCT being returned and the entry being incorporated into the tree. I then conducted another test in which I simulated a hard system reset (with `echo b > /proc/sysrq-trigger`) immediately after obtaining an SCT from add-chain. After rebooting the system, CompactLog failed to start with the error "failed to open db: EmptySSTable". I then ran the strings command on the files in the storage directory, and while I found evidence of other certificates, there was no trace of the certificate submitted right before the crash.
fsync reduces performance but is absolutely mandatory for CT logs, which must not lose data in the face of events like power failures.
Second, the latency benchmarks appear to be measuring the difference in batch frequency, rather than intrinsic properties of the implementations. This is evident with the Sunlight latency measurements being consistently 1000 ms - the exact frequency at which Sunlight commits batches by default[2]. On the day that these benchmarks were posted, CompactLog's batch frequency was changed from 500 ms to 50 ms[3] so I'm not sure which value was used with these benchmarks, but either value would explain the lower latency compared to Sunlight and render the benchmark meaningless. It's important to note that there are tradeoffs when picking this parameter - a lower batch frequency makes add-chain faster, but increases monetary costs when using cloud object storage due to the increased number of PUT operations.
I urge caution about evaluating CT log implementations on the basis of write performance, as there are other properties that are much more important. It's better to pick a robust implementation and shard the write load among several logs than run a single log that's fast but less robust.
When reviewing the Sunlight code, it was easy for me to follow the write path to ensure that before an SCT is returned, the entry is durably persisted to a filesystem or an object store. CompactLog, on the other hand, delegates to SlateDB, a complex storage engine which is less than 15 months old and hasn't reached a stable version yet. Beyond the lack of fsync, I'm troubled by some of the issues in the SlateDB bug tracker[4], which include data corruption bugs which were uncovered during CompactLog development:
https://github.com/slatedb/slatedb/issues/604 - "GC deletes things it shouldn't" (not yet fixed)
https://github.com/slatedb/slatedb/issues/591 - "Enabling compression results in a corrupted database" (fixed)
Another important property is read availability. I tried submitting certificates to CompactLog until the filesystem ran out of space. Once the filesystem was full, the read endpoints (get-sth, get-entries, checkpoint, tiles) began returning the following error:
{"error":"Storage error: Invalid data format: Failed to get committed tree size: DbError(ObjectStoreError(Generic { store: \"LocalFileSystem\", source: UnableToCopyDataToFile { source: Os { code: 28, kind: StorageFull, message: \"No space left on device\" } } }))"}
With Sunlight, an out-of-space condition or similar failure affecting the write path would not affect the read endpoints. This is important in CT, because while CAs can always submit to another log, monitors need to be able to retrieve certificates in a timely manner from all logs.
It's possible to create an RFC 6962 implementation in which the read path is independent of write path failures (you can just run a Sunglasses proxy in front of Sunlight) but the design of RFC 6962 does not encourage this. Indeed, every RFC 6962 implementation today uses a database to serve read requests. In contrast, I believe static-ct-api's design encourages implementations in which the read path is served without a database, independently of the write path.
Instead of write performance benchmarks, I'd rather see CT log implementations post design documents and tests which focus on how they ensure robustness and high read availability (for example, see the section "On robustness" in Sunlight's design document[5]).
Regards,
Andrew
[1] https://github.com/FiloSottile/sunlight/blob/main/internal/durable/path.go
[2] https://github.com/FiloSottile/sunlight/blob/7b9902e4ca71550005fda2f6a45698fa0c59005f/cmd/sunlight/main.go#L183
[3] https://github.com/Barre/compact_log/commit/e65a07c2830fa37952b45b24a7f6a47e809bf603
[4] https://github.com/slatedb/slatedb/issues
[5] https://docs.google.com/document/d/1YsxLGZxYE1KTCTjDK2Ol-bcTzbrI313SZ1QWgqmnRDc/edit?tab=t.0#heading=h.2tc8kbu84230
Pierre Barre
unread,
Jul 1, 2025, 10:21:30 AM (3 days ago) Jul 1
to Andrew Ayer, certificate-transparency
Hi Andrew,
Thank you for taking the time to analyze CompactLog.
I appreciate you highlighting that the local filesystem configuration lacks durability guarantees - you're correct that this should be better documented. The local storage option is primarily intended for testing and development, not production use. I'll update the README to make this clearer. Production deployments use cloud object storage (S3, Azure Blob, a durable on premise s3 implementation), where durability is handled at the storage layer rather than through application-level fsync.
Regarding durability in production CT deployments - modern systems rely on storage-layer guarantees rather than filesystem semantics. While fsync provides some durability on specific filesystem configurations, production deployments require defense against the full spectrum of failure modes: controller cache failures, misdirected writes, bit rot, and silent corruption.
On batch frequency: You're correct that the latency measurements reflect the 50ms batching configuration. The throughput numbers, error rates, and scalability patterns remain valid data points for operators evaluating implementations. If CompactLog can safely operate at 50ms batching while maintaining efficiency, why would I artificially throttle it to 1 second?
CompactLog's WAL design ensures that batch frequency doesn't directly correlate with storage PUT operations - the writes are coalesced at the storage layer. The ability to safely operate at 50ms batching while maintaining both performance and cost-efficiency is an architectural advantage, not a benchmark artifact. Lower latency directly benefits CAs waiting for SCTs.
Regarding SlateDB - yes, they've had issues that were quickly resolved. Having an actively maintained dependency with responsive maintainers who fix issues within days is exactly what you want in production systems. The alternative is unmaintained code or reimplementing complex storage logic.
The maturity concern seems selectively applied. Sunlight, started in May 2023, is roughly the same age as SlateDB. Both are young projects by infrastructure standards. The difference is that CompactLog delegates storage complexity to a dedicated project with storage expertise, while Sunlight implements its own storage layer. Both approaches have tradeoffs, but citing youth as a disqualifier would rule out most modern CT implementations.
While robustness is certainly important, dismissing performance as "meaningless" overlooks real operational concerns. Different implementations make different tradeoffs, and documenting these helps operators make informed choices.
Thank you for the comprehensive testing across crash scenarios, storage exhaustion, and dependency analysis. Your thorough investigation helps validate CompactLog's architecture - finding that the primary concern was fsync in local storage mode is actually quite reassuring given the breadth of your testing.
Best,
Pierre
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "certificate-transparency" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/certificate-transparency/u8SsXgSFbz4/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/certificate-transparency/20250701085116.e467e48255c1349b1a986ad0%40andrewayer.name.
Pierre Barre
unread,
Jul 1, 2025, 7:04:57 PM (2 days ago) Jul 1
to Andrew Ayer, certificate-transparency
> to Sunlight and render the benchmark meaningless. It's important to
> note that there are tradeoffs when picking this parameter - a lower
> batch frequency makes add-chain faster, but increases monetary costs
> when using cloud object storage due to the increased number of PUT
Following up with concrete operational cost data you suggested was important. I ran both implementations ingesting 1M certificates and performing monitor-style read operations:
Write Costs (1M certificates):
CompactLog: 12,847 storage PUTs
Sunlight: 287,364 storage PUTs
22.4x more expensive writes
Read Costs (Full tree sync, 1000 iterations):
CompactLog: 82,025 GETs total (mostly cache hits after first sync)
Sunlight: 41,030,000 GETs (41,030 per sync × 1000)
500x more expensive reads
This exposes fundamental architectural issues with "independent read/write paths." The system lacks application-level caching, meaning every monitor request hits storage directly. This design is vulnerable to denial-of-funds attacks where attackers can directly drive up S3 costs. Additionally, it requires an expensive CDN, which ironically couples the paths that are claimed to be independent. Finally, this architecture cannot achieve 0 MMD (Maximum Merge Delay) because independent paths inherently require synchronization delay between them.
Most critically, CompactLog's 0 MMD strengthens CT's security model. When SCTs are issued, certificates are immediately visible to monitors - no window for undetected misissuance. The "independent paths" architecture makes this impossible by design.
I sympathize with the investment in classic static CT - significant effort has gone into this approach by talented engineers. However, when architectural limitations force defenders to propose "security by @" (rate limiting based on user agent strings) as a serious solution, I believe we're witnessing sunk cost fallacy in action.
It's worth noting that the operators most vocally advocating for static CT appear to have infrastructure sponsorship arrangements that shield them from these costs. When storage and bandwidth are free, a any difference in operational costs becomes irrelevant. But this creates a distorted view of architectural viability - what works with sponsored infrastructure doesn't translate to sustainable operations for the broader ecosystem.
More concerning is the narrative that "scale requires direct object storage serving" - a claim that these benchmarks definitively disprove. When we accept that direct S3 serving is "the only way to scale," we're essentially mandating architectural decisions that maximize cloud provider revenue.
This raises a fundamental question: why are we advocating for this model? The static CT API is objectively more complex than RFC 6962, its "pure" deployment model is economically unviable without sponsorship, and it weakens security guarantees (MMD > 0). Yet there's a push to deprecate RFC 6962 - a working, proven standard - in favor of an architecture that's worse on every measurable dimension except ideological purity.
What's particularly troubling is that the "direct storage serving" approach is essentially brute force engineering - throwing unlimited infrastructure at a problem instead of solving it properly. Caching, request coalescing, and memory management aren't complex optimizations; they're basic engineering practices. When we champion architectures that prohibit these fundamental techniques, we're not promoting simplicity - we're mandating inefficiency.
The insidious part is that this design has tremendous surface appeal. "CT logs served directly from S3" sounds innovative and elegant. Object storage is familiar, reliable, and scalable - who wouldn't support that? Most people hear the pitch and think "brilliant!" without digging deeper into the implications.
It's only when you run the numbers or try to operate it without sponsorship that the reality hits: it's a design that sounds great in conference talks but fails basic operational requirements.
What's exhausting is watching the goalposts constantly move. When I show performance benchmarks, suddenly performance doesn't matter. When I demonstrate cost efficiency, the topic shifts to "separation of concerns." When I point out the need for caching, we're told CDNs solve everything. When CDNs are shown to be expensive Band-Aids, the argument becomes about implementation simplicity. I've even been told that fewer lines of code is a key metric - as if code golf determines operational viability. This isn't technical discourse; it's ideological defense through ever-shifting arguments.
Speaking of simplicity - setting up Sunlight for these benchmarks was remarkably complex. Manual seed generation, key management, undocumented YAML configurations, multiple executables with unclear relationships, and manual SQLite database initialization.
Yes, SQLite - the "static" CT implementation that supposedly doesn't need databases requires manually initializing one.
The irony of requiring database setup for an architecture championed for eliminating databases wasn't lost on me. How do we even backup this database? Can I safely delete it? Is it critical for operation? Apparently yes - the server breaks without it: "checkpoint missing from database but present in object storage". So much for database-free architecture. Where's the operating manual explaining any of this?
I had to manually parse the Go entry point code just to understand the correct startup sequence. When your "simple" system requires reading source code to figure out basic operations, you've failed at simplicity.
Even after getting it running, I didn't feel confident about the soundness of what I'd deployed. Was my seed secure enough? What entropy was expected? When I tried an empty file, it at least failed. Progress! But it happily accepted 32 spaces as a seed. Yes, I generated cryptographic keys using echo " " > seed. No warnings about low entropy, no validation, just silent acceptance of catastrophically insecure configuration. I didn't want this to work - I wanted it to fail informatively.
Actually, Sunlight requires you to provide a seed file with at least 32 bytes - but apparently any 32 bytes will do, including repeated spaces. This does indeed use whatever garbage you provide as the seed for key generation. A CT log with predictable keys undermines the entire Certificate Transparency ecosystem. This isn't just bad - it's "shut down everything and rotate all keys immediately" bad.
But hey, at least there's fsync, right?
The irony is breathtaking - being lectured about "robustness" and the critical importance of fsync while Sunlight silently accepts spaces as a cryptographic seed. Apparently, durably persisting compromised keys to disk is more important than ensuring those keys aren't trivially predictable. This perfectly encapsulates the misplaced priorities: obsessing over filesystem semantics while ignoring fundamental cryptographic security.
And why does it even require operators to provide a seed? Why not just generate a secure key pair automatically like every other cryptographic system built in the last decade? You know, for simplicity? Instead, we get the worst of both worlds - manual seed management with no validation.
When I attempted to run Sunlight at 50ms batching to match CompactLog's configuration, it essentially became stuck in a continuous checkpoint write loop - constantly PUT'ing new checkpoint objects to storage. With Sunlight's README recommending object versioning be enabled, these constant PUTs would generate thousands of object versions per hour, each incurring storage costs.
More concerningly, Sunlight's performance severely degraded after ingesting just a few hundred thousand certificates - batch processing times increased to 600ms (local minio was backed by a NVMe array), making 50ms batching physically impossible. This means Sunlight cannot safely operate at lower latencies even if operators wanted to provide better service to CAs. The 1-second batching isn't a conservative choice - it's an architectural limitation.
When I needed to run these benchmarks again on fresh infrastructure, my first thought was genuine dread: "Oh no, I didn't keep the old VM." That's not the reaction you want operators to have about your "simple" system. When redeployment feels like punishment, something has gone fundamentally wrong with your definition of simplicity.
If simplicity for operators was truly the goal, documentation and ease of deployment would be core to the project, not an afterthought. Instead, we see the opposite - a system that requires deep expertise just to start. This isn't simplicity; it's complexity with better marketing. With virtually no documentation, getting it running felt more like reverse engineering than deployment.
I suppose there's opportunity here - with enough expertise in these "simple" systems, one could build quite a consulting practice helping organizations navigate the complexity. The gap between "served from S3" marketing and operational reality certainly creates demand for specialists. But I'd rather build systems that operators can actually understand and run themselves.
I'm increasingly concerned that the CT ecosystem is being shaped by operators with nearly unlimited infrastructure budgets (whether through direct ownership or sponsorship), while the voices of smaller operators and monitors - who actually detect misissuance - are barely heard. When architectural decisions make logs unaffordable to operate independently, we're not improving transparency; we're consolidating control. This is becoming less about certificate transparency and more about cloud providers monetizing a mandatory security protocol.
The few independent operators still running logs deserve recognition for swimming against this tide. But we shouldn't design protocols that require either corporate sponsorship or six-figure monthly cloud bills to participate meaningfully in web security.
Happy to share the full benchmark methodology and scripts for reproduction.
Best,
Pierre
On Tue, Jul 1, 2025, at 17:51, Andrew Ayer wrote:
Bas Westerbaan
unread,
Jul 2, 2025, 3:48:22 AM (yesterday) Jul 2
to [email protected], Andrew Ayer
I'm increasingly concerned that the CT ecosystem is being shaped by operators with nearly unlimited infrastructure budgets (whether through direct ownership or sponsorship), while the voices of smaller operators and monitors - who actually detect misissuance - are barely heard. When architectural decisions make logs unaffordable to operate independently, we're not improving transparency; we're consolidating control. This is becoming less about certificate transparency and more about cloud providers monetizing a mandatory security protocol.
I think we all agree that we need to make CT easier and cheaper to run. To wit: one of the primary motivations of Sunlight (the protocol) is to reduce cost, and it seems to achieve this for several operators. Now, on your accusation of monetization: I invite you to do a back-of-the-envelope calculation on the revenue in cloud bills, and spent in headcount.
Best,
Bas
Pierre Barre
unread,
Jul 2, 2025, 7:16:23 AM (yesterday) Jul 2
to certificate-transparency, Andrew Ayer
Hi Bas,
That's an interesting framing. If CT logs operate at a loss for everyone, then we're discussing the wrong metric.
The question becomes: How do we minimize the loss while maximizing ecosystem participation?
Would you agree that's a worthwhile optimization target?
If it's charity, I'm offering to take that burden off your hands for 1/10th the cost! ;-P
Best,
Pierre
Pierre Barre
unread,
Jul 2, 2025, 1:54:33 PM (yesterday) Jul 2
to certificate-transparency, Andrew Ayer
Interesting timing on those Sunlight commits from earlier today:
- "cmd/sunlight-keygen: actually generate the seed"
- "cmd/sunlight: rename Seed to Secret and check it is exactly 32 bytes"
What's particularly concerning is that my private security report was made public without my consent, specifically to frame it as an invalid report. This was followed immediately by implementing the exact fixes I suggested.
For the record: accepting 32 spaces as cryptographic key material isn't a "usability" issue. It's a key generation failure that produces predictable private keys. The immediate fixes confirm this was understood.
What concerns me most is the precedent this sets. Taking private security reports public to dismiss the reporter as mistaken, while simultaneously implementing their recommendations, is a serious breach of trust. It suggests vulnerability classification is more about controlling narrative than technical merit, and actively punishes researchers who follow responsible disclosure practices.
Best,
Pierre
Aaron Gable
unread,
Jul 2, 2025, 2:41:40 PM (yesterday) Jul 2
Finally, this architecture cannot achieve 0 MMD (Maximum Merge Delay) because independent paths inherently require synchronization delay between them.
You've made this claim multiple times and I must admit I still do not understand it. Sunlight does not have a MMD greater than 0. It refuses to return an SCT until the entry has been sequenced into the log, and in fact includes the sequence number (leaf_index) directly inside the SCT it returns.
Under what definition are you somehow considering this behavior to be an MMD > 0?
Aaron
Pierre Barre
unread,
Jul 2, 2025, 2:48:35 PM (yesterday) Jul 2
to certificate-transparency
Aaron,
If you need the relationship between sequence assignment and commitment guarantees explained, that’s concerning for a CT implementation discussion.
The fact that you felt compelled to nitpick technical details while completely ignoring the egregious behavior described above - where my private vulnerability report was weaponized for public dismissal, only to be “fixed” within hours - shows this isn’t a fair discourse environment.
I’m done contributing to a community that prioritizes technical gotchas over addressing serious ethical violations.
This is my last email here. Thanks to all the great people I’ve interacted with who deserve better than this.
Bye.
Pierre
Aaron Gable
unread,
Jul 2, 2025, 3:10:18 PM (yesterday) Jul 2
Aaron,
If you need the relationship between sequence assignment and commitment guarantees explained, that’s concerning for a CT implementation discussion.
I don't believe I need that relationship explained, and I don't believe explaining that relationship would have answered my question. But perhaps I (someone who works with the CT ecosystem but who has never written a log implementation myself) have a misunderstanding! I'd love to hear a straightforward explanation of how my understanding of Sunlight's zero merge delay is misguided or incorrect. I'm sorry it seems like the opportunity for that has passed.
The fact that you felt compelled to nitpick technical details while completely ignoring the egregious behavior described above - where my private vulnerability report was weaponized for public dismissal, only to be “fixed” within hours - shows this isn’t a fair discourse environment.
Without any comment on the egregiousness of the behavior, I'll note the email I sent is timestamped twelve minutes before the email in which you reported said behavior. Accusing me of ignoring things which had not yet been presented for consideration does not seem conducive towards fostering a fair discourse environment.
Thanks,
Aaron
Pierre Barre
unread,
Jul 2, 2025, 10:20:52 PM (yesterday) Jul 2
to certificate-transparency
For the record, here’s the email Google’s Joe decided to ban me for and call me out in public on the other list:
————————
Filippo,
First, you’re once again making our private exchange public without my consent; stop that. This pattern of weaponizing transparency against security reporters is deeply unethical.
Your attempt to credit Andrew actually makes your behavior more apparent, not less.
Andrew’s issue presents these as mild usability suggestions - “good in my opinion,” “reduce cognitive burden,” “Basically, I think it would…” His tone is clearly about UX improvements, not security vulnerabilities. You even told me these were “planned as part of a set of usability improvements for v1.0.0” - yet mysteriously they became urgent same-day fixes within hours of my security report. Why the sudden urgency for v1.0.0 features?
More importantly, your response contains a fundamental contradiction: If you were simply implementing Andrew’s planned improvements, why did you explicitly tell me accepting 32 spaces as cryptographic keys was “not a security vulnerability”?
Either:
1. You believed it was a vulnerability (explaining the urgent fixes) - making your dismissal dishonest
2. You believed it wasn’t a vulnerability - making your immediate fixes inexplicable
Furthermore, you’ve completely ignored the corruption scenario I raised. Your “exactly 32 bytes” check doesn’t fix this at all - a file corrupted with 32 null bytes would still pass. The actual fixes would be:
• A seed format with integrity checking (checksum/hash)
• Stop accepting user-provided seeds entirely
• Actual validation beyond just length
The timing tells the real story: Andrew’s gentle UX suggestions sat for a day until my security report triggered immediate action. You’re now trying to retroactively frame security fixes as usability improvements to avoid acknowledging the vulnerability.
————————
Katrina Joyce
unread,
7:46 AM (15 hours ago) 7:46 AM
Pierre,
This google group is a community space that falls under the purview of the transparency.dev code of conduct. Technical discussion is welcomed. Insulting, patronising, and disparaging remarks towards others are not. Additionally, this forum should *absolutely not* be used to post rebuttals to decisions made on other groups that you no longer have access to.
As a result of messages in this thread, we are issuing you with a warning, with consequences for continued behaviour of this nature within the transparency.dev community. Any further violations of the community code of conduct will result in a permanent ban from this group, and other transparency.dev community communication spaces as well.
Kat
Transparency.dev Community Manager
Pierre Barre
unread,
8:32 AM (14 hours ago) 8:32 AM
to certificate-transparency
Hi,
To provide context: I discovered critical security vulnerabilities in Certificate Transparency infrastructure, attempted private disclosure, had my report made public without consent, and was banned from the CT mailing list. The email I shared contains no insults - only technical analysis of the handling of security issues.
Since then, I've discovered additional vulnerabilities including unauthenticated memory dump endpoints and privilege escalation paths in production infrastructure that forwarded to both Filippo and to [email protected].
I understand this forum may not be the appropriate venue. With normal disclosure channels closed to me and these vulnerabilities affecting production systems, I'll be following industry-standard responsible disclosure practices with a 30-day timeline (August 3rd, 2025) as per Google Project Zero's established principles
I won't be posting further in this forum.
Best,
Pierre