Do they even test this?

3 months ago 2

That’s the question every tester dreads to hear, because it usually means we’ve let something really embarrassing slip into a release.

The real answer is, “Yes, we do,” though that doesn’t offer much comfort if you’re facing issues in production. Still, during quieter moments, people sometimes ask less rhetorically what kind of testing takes place in MariaDB. Let’s dive into that.

A path of a bugfix into a maintenance release

When a pull request (or an internal patch outside the PR system) is pushed into the MariaDB/server repository on Github, it is picked by the MariaDB server CI, Buildbot. The total number of the builders in the Buildbot is reported to be over a hundred, dynamic, and regularly updated, with new builders added and old ones removed when their operating systems reach EOL. The exact set of builders used depends on the MariaDB version the patch is intended for, but it will contain at least several dozen of builders, including standard release builds and instrumented binaries (ASAN, MSAN, UBSAN). In addition to the Buildbot, the patch is also built in AppVeyor.

After building the binaries, the Buildbot and AppVeyor run MTR tests. MTR is the official MariaDB test framework and a collection of tests, mainly designed for functional regression testing. While the number of test files or queries or lines aren’t particularly meaningful characteristics, those curious can count them themselves as MTR is a part of MariaDB source code and is included into official DEB/RPM packages and binary tarballs. It’s sufficient to say that the test set provides decent coverage, especially when run in full capacity, using different test “protocols” (in addition to executing plain SQL queries, it has several other modes, converting everything into prepared statements, views, stored procedures, using the cursors), platform-dependent variations, debug-only facilities, and environment specifics like different SSL versions or FIPS support. MariaDB unit tests are also run via MTR. The test collection is actively maintained, because the general MariaDB server policy is that every code patch must be accompanied by a test case covering it. During code review, the reviewer ensures the patch includes sufficient MTR tests. Exceptions occur when changes can’t be tested with MTR, but they are not common.

Developers are expected to inspect build/test results of their pushes. AppVeyor results and part of Buildbot results are reported on Github, so they can be easily found there alongside the corresponding commit, though they can also be viewed via the CI interface.

MTR tests can also be run locally, and developers and contributors are strongly encouraged to do so, to reduce the number of iterations for a patch to reach maturity.

A subset of the Buildbot builders is included into a branch protection set. This means that even when a patch is ready from the developer’s perspective and approved by the reviewer, Github won’t allow it to be pushed into the main branch until tests on all branch protection builders succeed. Due to resource and time constraints only a fairly small number of builders currently participate in the branch protection, though we strive to make the subset as representative as possible, so it changes from time to time.

In addition to fully automated CI testing, selected patches are additionally tested by human testers. MariaDB server development consists of several teams, each focusing on different server functionalities (runtime, InnoDB, optimizer, etc.). Each team has at least one dedicated tester working directly with the team, participating in team-specific meetings, and reporting to the development team lead. So, it’s a shared responsibility of the tester, team lead, and developer to identify potentially risky patches that may require more attention than the CI can provide. For internal patches, this happens within the team that develops the patch, and for contributions, it occurs within the team that reviews/accepts the patch.

Depending on the nature of the patch, testers may perform more targeted functional (regression) testing, including manual testing although it’s not common for maintenance releases, examine corner cases, and, importantly, run concurrent regression tests to verify the stability of the new code. Testers are free to use the tools they find most suitable for the task. Currently, the most actively used tools are two SQL fuzzers — Random Query Generator (RQG) and pQuery.

RQG was originally created by Philip Stoev. We use our own heavily modified forks, allowing us to not only randomize data and queries but also create and run various non-trivial test scenarios and setups. Logical fuzzing capabilities were also extended. It can be used for stability testing, non-functional testing like data upgrades and crash recovery, and to some extent, for functional and functional regression testing.

pQuery was originally created by Roel Van de Paar and Alexey Bychko. It is further developed by Roel and other MariaDB test engineers in the MariaDB fork and is mainly used for stability testing.

Once a patch has been approved by a reviewer and, if necessary, by a tester, and it has passed branch protection, it gets pushed into the main branch. In this context, “main branch” refers not to the branch called main but to the branch for the lowest server version the patch targets (e.g., 10.6, 10.11).

A push to a main branch is again picked up by the Buildbot (and AppVeyor). Buildbot uses a much wider set of builders on the main branches. It includes not only an extended variety of platforms and a docker-library image, but also types of tests not performed on development branches and PRs, such as package installation and upgrade tests. An installation test installs packages (RPMs, DEBs, MSI) on a clean system to ensure the packages are usable in a standard environment and don’t create dependencies that default distribution repositories cannot satisfy. A package upgrade test installs a previously released version of the server and then upgrades to the new one. Both major upgrades (e.g., 10.6 → 10.11) and minor upgrades (e.g., 10.6.22 → 10.6.23) are tested. The latter also verifies that new dependencies or requirements aren’t introduced in stable releases.

Once the patch is pushed into the lowest main branch and the CI results are satisfactory, the developer’s work on the case is done, and the JIRA ticket/PR gets closed. However, the patch doesn’t immediately appear in higher release lines. For a while, it can happen that a bug is closed as fixed in versions like 10.11, 11.4, or 11.8, but if you try to build the latest 11.8, you won’t find the fix there. Over time, the patch is merged into higher versions, with merges usually done a few times between subsequent releases. There is also a merge directly before the release to ensure that all bug fixes have indeed reached all versions they had to. Eventually, the patch will reach all versions it targets.

The tops of all main branches are subject to further scheduled automated system testing performed outside Buildbot. These RQG-based tests include general concurrent stability/stress tests, basic traditional replication and Galera scenarios, crash recovery, data backup/restore, and more, with randomized data, queries, and configuration options. A two-hour test set runs daily on each main branch (we typically have 6-7 active main branches simultaneously; at the time of writing, these are 10.6, 10.11, 11.4, 11.8, 12.0, 12.1, and main), with an additional wider six-hour test runs weekly on each branch. These tests are usually performed on debug ASAN builds. Due to the nature of random tests, they can encounter both new regressions and old bugs that weren’t previously discovered (as well as plenty of known bugs that are recognized and filtered out). New regressions usually become blockers for the upcoming release, meaning the release won’t happen until they are fixed. Old bugs are typically added to the bug backlog.

During release preparation, when the final merges are being done, in addition to the scheduled tests, more test runs are triggered:

performance regression benchmarks;
extra concurrent tests, mostly the same as in the regular scheduled runs but performed on different build types (release builds, MSAN, UBSAN, non-debug ASAN) to ensure that different kinds of failures can be discovered;
data upgrade tests, to test live/dump/mariabackup upgrades from earlier versions to the new ones.

Final merges are tested by Buildbot the same way as described for main branches. The results are inspected and the packages built by Buildbot become the maintenance releases you know and surely love.

A path of a feature into a new RC

The early life of a feature in MariaDB differs somewhat from that of a bugfix. We are naturally trying to avoid creating even more technical debt than we already have, so the features nowadays undergo stricter upbringing than they did even four years ago.

A new feature is initially developed in a separate branch, normally based on the main branch (in this context the “main branch” refers to the branch called “main”). Same as for bugfixes, every push into the MariaDB/server repository is picked by the Buildbot, so the developer can inspect results of their work-in-progress.

When the feature is ready from the developer’s perspective and doesn’t require major changes from the reviewer’s, the developer should ensure that Buildbot also doesn’t show feature-specific problems. Once these conditions are met, the feature task is assigned to a human tester who will further test it in the development branch.

Meanwhile, the feature is merged into the preview-X.Y-preview branch, where X.Y is the next major MariaDB server version to be released. When the time comes, the branch with all such features is released as “X.Y Preview”. Note that, at this point, the features are generally not yet tested beyond the CI, so they cannot be considered stable, and the preview release must not be used in production. It is an early release to collect real-user feedback on the features’ design.

The test engineer who received the feature task and the development tree for testing analyzes what kind of testing the feature requires and performs it.

From my personal experience, most commonly it would be

new functionality testing: based on the feature specification, discussions with the feature developer and reviewer, review notes, etc., the positive and negative testing, both manual and half-automated (when the tests are scripted but the results are inspected manually);
feature integration testing: it should be considered which existing server functionality the new feature is supposed to interact with and in what way, and the results of such interaction should be verified. For this purpose, I use the test management tool Testiny, where I currently have ~3,500 “test cases” (mostly just elements of functionality or just testing directions, from a server variable to a specific engine), extended after each release. For a new feature that I test, I go through the entire list of “test cases” and check for any relevance to the new feature. If it’s possible, I add it to a list, which usually ends up with some 100-400 such items. Then I go through the resulting list item by item, either checking integration manually, creating a draft MTR test case, or adding the combination to the random tests;
MTR line coverage, to see whether the MTR tests accompanying the feature are sufficient;
stability tests on different build types, similar to those run for maintenance releases, but tuned for the new feature and its integration;
when applicable, component integration, package tests, data upgrade, etc.

If the feature is performance-related or performance-risky, performance testing is done, either by the dedicated performance team or by the developers themselves. Otherwise, benchmarking is done later for the upcoming release, as usual.

During feature testing, the tester reports discovered bugs and identifies those that must be fixed before the feature release.

By the end of the testing cycle, the tester either issues explicit approval (signs off on the feature) or declares it not ready for release. The outcome mainly depends on two factors: whether the tester considers the performed testing sufficient, and whether there are bugs left that must be fixed before the feature release.

If, by the time of the X.Y.1 RC release, the feature has received approval, the development branch is rebased on the latest main, Buildbot builds/tests the push, and, if branch protection permits and results are satisfactory, the feature branch is merged into the “main” branch. The merge goes through the same Buildbot and out-of-the-Buildbot tests as all main branches. If the feature didn’t get approval, it is postponed until the next release.

Before the release, the “main” branch becomes the X.Y main branch, from which the new RC is released.

So, with all this seemingly endless testing, how do the “Do they even …” situations happen?

First, the obvious — no testing is ever perfect. It’s not something you want to say to a frustrated user or a raging manager, but that’s the truth. While we analyze every specific blunder to figure out why it happened and make sure it won’t happen again, there’s no magic trick to prevent them all in advance; they will happen sometimes. Could we do better? Sure, but in my opinion, it’s not the amount of testing that’s our biggest bottleneck at the moment (or in recent years). It’s no secret, as JIRA shows: there are 1,186 bugs marked as fixed in the past 365 days in MariaDB JIRA, “MariaDB Server” (MDEV) project, and 1,418 bugs filed internally — by MariaDB employees and foundation members — not counting those from external users (another ~700). So, we already file more bugs than we can fix, which brings us to two highly intervening problems: relevance and priorities.

We file lots of bugs internally; but internally filed bugs are often treated as artificial issues (as they often are, see, for example, MDEV-28900). Many of them come in a form of debug assertion failures or instrumentation errors, which of course a real user will never see; it may well be that they have no effect of real users at all, or what the user gets and reports instead is a vague random problem some time later, but the relevance can’t be known until the bug is fully analyzed (and maybe not even then).

Instead, we — quite naturally and reasonably — try to treat bugs from real users with higher priority than internal ones, because those are real-life problems users suffer from. But it’s no secret that analyzing a failure in someone’s production environment is dozens, even hundreds of times more resource-consuming than internal ones. So, the more time our development spends on user bugs, the fewer internal bugs get fixed, the more end up in the release, and the more vague reports we get from users.

I could say that we need a breakthrough, we have to reach the point where we start fixing more bugs than we and users file together, and that it’s the only way to solve the problem, but it’s easier said than done. Fixing more bugs requires resources which are not always available; reporting less bugs won’t do any good for obvious reasons. There are still potentially helpful steps we could take, but for those, we would need active help from the community.

We don’t need more tests, at least not at this point; we need different tests. It’s meaningless to expect “more realistic tests” to be created internally, because everything created internally is artificial by definition. We need tests that represent the datasets and workflows of real users/products, and those can only be obtained from real users and products. There’s a set of tasks in the JIRA “MariaDB Foundation” (MDBF) project for ecosystem testing, mainly concerning connectors, but I suppose over time, it could be extended to open-source products that come with non-confidential datasets and tests used in their UAT (User Acceptance Testing), etc. However, even the existing track seems stalled, likely due to a lack of resources. Nevertheless, if there are users — e.g., maintainers of open-source products using MariaDB — who have publicly available tests that can be deployed and run non-interactively outside their internal test environments and are willing to share them, we could try to integrate them into our test systems.

The other side of this is adjusting the priorities for existing bug reports. Obviously, we must treat issues real users encounter with higher priority, but the problem is, we don’t know when users encounter bugs that exist in our bug database — we can only see what bugs they report. So, if a responsible user observes a failure, searches the JIRA, and finds a bug report for it, they think there’s nothing else to do but wait for it to be fixed. We appreciate the effort such a user makes in not filing a duplicate report; it’s very important. But we also remain unaware that real users encounter bugs we filed from internal tests and that stay in our backlog. What we need is some indication that a user is affected, e.g., by commenting or voting. Then, we could re-prioritize such bugs accordingly.

So, dear MariaDB users, you don’t even have to be developers to contribute — start sharing those real-world scenarios and flagging bugs affecting your systems, and we’ll do the rest.

Read Entire Article