Systematic design of multi-join GROUP BY queries

3 hours ago 1

Author: Alexey Makhotkin [email protected], ~5400 words.

This is the first public revision of this text. Early readers have shared encouraging feedback, but I’m sure there’s still room for improvement. I’m releasing it now to gather broader input from a wider audience.

Problem

Many times I’ve seen people asking for help with fixing some complicated SQL queries. A common scenario is that people try to solve the problem by using JOIN with multiple tables, sometimes with subqueries. Sometimes they see value multiplication (a resulting numeric value is two or three times bigger than it should be), or they see more rows than there should be. Often they try to use DISTINCT to remove this duplication. In the end, people get hopelessly confused, and ask for help.

Usually this is a frustrating experience for both sides:

people are often not even comfortable with sharing the exact text of SQL query, so they have to censor it by using meaningless identifiers;
sharing table structure is also often not possible, to say nothing of sample data;
Occasionally people are not even clear about what they are trying to achieve; sometimes it’s visible from the fact that two parts of query seem to contradict each other;
Very often the query works fine but it’s just unacceptably slow, and people ask for performance advice;

Approach

I’d like to present a way to systematically design an important class of such queries: the ones that are based on GROUP BY and use multiple joined tables.

The idea of this approach is to design the query incrementally, making sure that each step is correct before proceeding. This would be useful if you tried writing the entire query but cannot make it work. Also, you can use this approach to design the query from the ground up, with hopefully more predictable results.

This is not an SQL tutorial, this text assumes that you understand SQL syntax and you know how GROUP BY works. However, this text uses a possibly uncommon approach to designing joins, more on that further. There is a lot of historical confusion around JOIN usage, and there is a lot of room in using them in different ways.

It is not required to use this approach from start to finish. It’s possible that you will find your own mistake as you go through validation steps, and then you can just stop reading because your query would be fixed.

This text does not depend on the specific database server: you can use it for any OLAP/OLTP solution. You need to generally understand how your system processes queries. We’ll talk about the design space of optimizations that may be available to you.

We’re going to design a query that returns some common stats on social media users. Let’s write down the short human-readable description of what we want to see in each column. This step is skipped surprisingly often.

User name (here we mean the unique login name, such as “chrisdate”);
Registration date;
Number of posts written by the user;
Timestamp of the most recent comment;
Number of likes on their posts;
number of users subscribed to this user;
Number of users that this user follows;
number of comments left on other user’s posts;

The table schema could be found in the appendix, it should not contain any surprises.

We will also use some other examples to illustrate some important distinctions.

An overview of the design process

Here is a list of design steps we’re going to use:

Write down a list of columns that you want to see in the output;
Write down a “unique key” that would be used in the resulting dataset;
Implement a unique key query;
Implement each of the aggregation sub-queries;
Write down the filtering conditions, separately for the unique key query and sub-queries;
- Bonus: implicit filtering conditions;
Write down ordering conditions;
Implementing non-aggregation sub-queries;
Decide which of the joining approaches are acceptable at this point:
- joining materialized subqueries;
- joining CTEs;
- joining underlying tables directly;
Constructing the result pipeline;

We will also tackle the question of “duplicate values with JOIN”, something that people try to fix by generously sprinkling DISTINCT on the query.

This post is a companion piece to the newly released “Database Design Book”. It is available as a PDF (145 pages, ~32000 words), or in EPUB format.

Problem
Approach
Main example: social media stats
An overview of the design process
On the complexity of examples
Step 1. Defining unique key
- Example of a composite unique key
Step 2. Implementing unique key query
- Implementing uniqueness
- Example: per-day aggregation
Step 3. Implementing sub-queries
- The process so far
- Multi-column aggregating sub-queries
Filtering conditions
- Filtering on non-aggregated data
- Filtering on aggregated data
Ordering conditions
Implementing non-aggregation sub-queries
- Strict and sloppy GROUP BY clauses
- Sloppy GROUP BY: non-aggregated extra columns
Step N. Joining it all together
- Thought experiment: materialized subqueries
- Joining CTEs
Optimizing correct mega-queries
Summary
P.S.: SQL schema for the social network example

On the complexity of examples

We cannot use a truly complicated real-world example, because it would require a lot of setup. We would have to explain the business requirements, tables structure, data distribution, a database server that we’ll be using, and we’ll have to provide a lot of sample data, preferrably even a dump of production data. This is impossible.

We will necessarily use simple, though non-trivial examples. There is a sort of a paradox here because such examples do not need all that complexity. This tension between simplicity of explanation and complexity of the task is very common. You will have to take a leap of faith in trusting this approach, and try it on your own tasks to see if this approach works.

Step 1. Defining unique key

This is basically a list of columns in an imaginary “GROUP BY” clause. Often this is a single column, sometimes it’s multiple columns.

For our social media example, the unique key would be a single column:

user_id (it corresponds to the users.id column which is a primary key);

It means that every output row is uniquely identified by a user ID: for each user there is one row.

Note that our unique key is NOT mentioned in the list of columns on a previous step.

Example of a composite unique key

Imagine a project tracking system, with projects and developers assigned to them. Suppose that we want to prepare a report for each developer per project: the number of hours that a specific developer worked on a specific project, number of tickets completed and open, the closest deadline for all assigned tasks, etc.

In this case, the unique key of our resulting dataset would consist of two fields:

project_id;
developer_id.

(For example, if a certain developer is assigned to two projects, we want to see two rows for that developer).

We’ll be dealing with single-column and multi-column unique keys uniformly, but for simplicity our main example is single-column.

Step 2. Implementing unique key query

So, the unique key for our example is “user_id”. Now we need to prepare a query that returns a list of user IDs that we could be interested in. Later on, we may filter out some of those IDs, depending on our problem, but not now.

Here, the unique query is very simple:

SELECT id AS user_id FROM users

The result of a unique key query must be, uhm, unique. We use a primary key here, so the uniqueness is guaranteed automatically.

It is essential to validate that the unique key query is specified correctly. In this example, it’s really simple, but in more complicated examples it’s easy to choose it incorrectly in the beginning. Good news is that you can fix, and our approach lets you do that independently from all other subqueries.

Implementing uniqueness

There are three main ways to implement uniqueness:

selecting primary key or a unique key;
using GROUP BY without aggregating functions;
using DISTINCT;

Here is an example of GROUP BY without aggregating functions:

SELECT project_id, developer_id FROM assigned_projects GROUP BY 1, 2

Note that we used the old-school syntax, “GROUP BY 1, 2”. Most databases support that. If yours doesn’t then just use “GROUP BY project_id, developer_id” (strictly duplicating the list of columns from the SELECT clause). This syntax is concise and reduces duplication.

GROUP BY here could be replaced with DISTINCT, but we use GROUP BY because we may want to extend this later, as we work on this query.

Another reason why we discourage DISTINCT is because DISTINCT is sometimes used as a band-aid for incorrectly designed queries, and we want to reduce this bad habit.

Example: per-day aggregation

Here is an important example of a unique key query. Suppose that we want to build a sales report per-day, for the last 7 days, with the number of orders and total order sum. It’s a textbook example of grouping and you should be able to implement it without much thinking:

SELECT DATE(created_at), COUNT(*), SUM(total_paid) FROM orders GROUP BY 1;

Right?

Suppose that we have some days without any orders. This query will return only the rows for the days with orders. This may be acceptable to us in this scenario, but sometimes it would be better to have each day in the result: for example, if we want to calculate the average and to divide by the number of rows, or something.

If we need to have seven rows, no matter what, we would begin with a unique key query that looks like that:

SELECT '2025-03-01' AS date UNION ALL SELECT '2025-03-02' UNION ALL SELECT '2025-03-03' UNION ALL SELECT '2025-03-04' UNION ALL SELECT '2025-03-05' UNION ALL SELECT '2025-03-06' UNION ALL SELECT '2025-03-07'

This is, again, a very simple example but it demonstrates the potential problem.

We call it an implicit filtering condition. Our naively written query would have implicitly filtered on the condition “at least one order was placed on that day”. We’ll talk about implicit filtering conditions later.

Step 3. Implementing sub-queries

We have six aggregating sub-queries in our example:

Number of posts written by the user;
Timestamp of the most recent comment;
Number of likes on their posts;
number of users subscribed to this user;
Number of users that this user follows;
number of comments left on other user’s posts;

They would query four different tables:

posts;
comments (twice);
likes;
subscriptions (twice);

Sometimes people just try to JOIN all those tables, add a GROUP BY and then try to extract all the aggregated information in the same query. Sometimes it even works. I won’t even try to implement it here, because I am not sure about my level of SQL knowledge. Let’s make it simpler, in the hope that it would also be easier.

We’re going to write six different queries, each returning a two-column dataset. First column is our unique key column (user_id), and the second column is an aggregate value.

#1. Number of posts written by the user:

SELECT user_id, COUNT(*) AS posts_count FROM posts GROUP BY user_id;

#2. Timestamp of the most recent comment:

SELECT user_id, MAX(created_at) AS latest_commented_at FROM comments GROUP BY user_id;

#3. Number of likes on their posts;

SELECT posts.user_id, COUNT(*) AS likes_count FROM posts INNER JOIN likes ON posts.id = likes.post_id GROUP BY 1;

#4. Number of users subscribed to this user:

-- number of users subscribed to this user SELECT timelines.their_id AS user_id, COUNT(*) AS subscribers_count FROM timelines INNER JOIN subscriptions ON timelines.id = subscriptions.timeline_id WHERE timelines.what = "user" GROUP BY 1;

#5. Number of users that this user follows:

-- number of user’s subscriptions SELECT subscriptions.user_id, COUNT(*) AS subscriptions_count FROM subscriptions INNER JOIN timelines ON subscriptions.timeline_id = timelines.id WHERE timelines.what = "user" GROUP BY 1;

#6. Number of comments left on other user’s posts:

SELECT comments.user_id, COUNT(*) AS comment_others_count FROM comments INNER JOIN posts ON comments.post_id = posts.id WHERE comments.user_id <> posts.user_id GROUP BY 1;

The first two queries are textbook-trivial. Query #3 introduces an INNER JOIN. Queries #4 and #5 introduce a table that I did not warn you about (timelines). This is a typical situation in real-world databases: for some historical architectural reason a query may be a bit more complicated than it sounded. Also, #4 and #5 are only different by the column we’re grouping on (for readability, I’ve reversed the order of tables in INNER JOIN). Also, you really need to document this pair of queries because it’s so easy to confuse. Query #6 could be written in different ways, for readability or performance.

What can we see from looking at those six queries?

First, each of them in isolation is pretty simple to write and understand. I’m working on this text at the freediving resort cafe, without access to the real database, and I’m almost sure that I got the queries mostly right, maybe with some trivial mistakes.

Second, good luck attempting to join them all and trying to extract that information back.

Third, each subquery contains GROUP BY on user_id, taken from a different table (“their_id” is also a “user_id” under the “what = "user"” condition). This is the same unique key that we’ve decided on in the beginning. If your query has a different unique key it means that you’ve defined the wrong unique key, and you need to fix it. See a “Multi-column aggregating sub-queries” section below.

Fourth, each subquery returns only rows for the users who participated in corresponding activity. If a user never posted anything, they will not be mentioned in the “number of posts per user” subquery (#1). We discussed this problem in a previous section, but for aggregate subqueries it’s not a problem. We had to deal with it only when we implemented the unique key subquery above.

The process so far

So, we had a chance to implement, debug, validate and optimize all subqueries. Now you basically have all the data you would need in the result. You only need to:

Implement filtering conditions, both explicit and implicit;
Decide on the ordering;
Decide how you would combine all the subqueries (join the data, with or without using a JOIN operator);

Multi-column aggregating sub-queries

In a single-column unique key scenario all sub-queries would have the same unique key. In multi-column it’s possible to use a subset of the key.

Let’s go back to our developers/projects example. Suppose that for each “developer per project” we want to see the percentage of total tasks in a project that the developer has resolved. We’ll need the following aggregating sub-queries:

number of tasks resolved by this developer, per project;
total number of tasks in the project.

Then we’ll divide two numbers and get the percentage.

Here we see that the unique keys for both sub-queries are different.

SELECT project_id, resolved_by AS developer_id, COUNT(*) AS resolved_count FROM tasks WHERE task.status = ‘Resolved’ GROUP BY 1, 2;

SELECT project_id, COUNT(*) AS total_resolved_count FROM tasks WHERE task.status = ‘Resolved’ GROUP BY 1;

Reminder: the unique key for this example is: (project_id, developer_id).

In the general case, unique keys of sub-queries can be subsets of the base unique key. But again, if you see that your sub-query needs a different column in a unique key, it means that you may need to change the definition of the base unique key, and revisit every other subqueries.

Subtle misunderstanding of a base unique key seems to be a source of mistakes with desigining queries. Most of the time this is not a problem, but sometimes this analysis may help.

Filtering conditions

All the queries that we wrote so far return all the data for all the users, even if there are millions of those. We probably need a much smaller subset. We can filter the dataset either in the unique key query, or depending on the aggregate sub-query.

Filtering on non-aggregated data

First example: suppose that we want to get the data only for the users who registered in the past month. It’s possible to implement this directly in the base unique key query:

SELECT user_id FROM users WHERE created_at >= '2024-02-01' AND created_at < '2024-03-01';

Here the condition is not using aggregated data, so it should be relatively cheap to execute (but not necessarily).

Filtering on aggregated data

Second scenario is when we want to filter on aggregated data. Suppose that we want to show the user stats, but only for the “big enough accounts”, namely those that have more than 100 followers.

To filter in such a way, we have to execute a subquery, and use its result to filter. In this case, do not augment a base unique key query: it should stay aggregation-free. We’ll just apply this condition later, after we decide on our join strategy.

Note a common potential mistake here: you should not change the aggregated subquery also! This would be a mistake:

-- number of users subscribed to this user (with a MISTAKE) SELECT timelines.their_id AS user_id, COUNT(*) AS subscribers_count FROM timelines INNER JOIN subscriptions ON timelines.id = subscriptions.timeline_id WHERE timelines.what = ‘user’ GROUP BY 1 HAVING subscribers_count > 100; -- <== DO NOT DO THIS

Why? Because after we would LEFT JOIN the base query with this aggregate query, the “HAVING subscribers_count > 100” condition won’t be applied to the base query. The resulting dataset would contain all IDs from the base query. For accounts with less than 100 subscribers, it won’t even show the number — it will show NULL as the subscribers count.

To handle this, you need to remove the “HAVING” clause, and filter the resulting dataset by the normal “WHERE subscribers_count > 100”.

Ordering conditions

Most probably your query would need some ordering conditions. You could order by non-aggregate data (e.g., ordering by user registration date), or by aggregate data (ordering by the number of posts written by the user).

In both cases, ordering conditions would be applied at the very end of the query construction.

Implementing non-aggregation sub-queries

In our example we have two pieces of non-aggregation data:

User name (here we mean a unique login name, such as "chrisdate");
Registration date;

Handling them looks easy, but is also a source of mistakes, mainly due to what we call sloppy GROUP BY clauses (see below).

To ensure strict SQL we do something that may look like an overkill on the first sight. Let’s write down the subqueries so that they return two-column datasets: one column for the unique key, and another column for the data:

SELECT id, name FROM users; SELECT id, created_at FROM users;

It’s not necessary that we use this query directly, but we write it in this peculiar way because it is composable and prevents mistakes. Later we may decide to “open-code” those queries for more readability, but only when it’s safe to do so.

Now let’s look into what we consider sloppy and strict SQL, before we begin to combine all of this together.

Strict and sloppy GROUP BY clauses

In our example, the user name is unique and uniquely depends on user_id. Naively, it’s sometimes possible to use such column directly in the GROUP BY clause, but we try to avoid that and make SQL query “strict”. Sloppy SQL queries seem to be a common reason for mistakes. Unfortunately, SQL is very forgiving, and it often lets you confuse yourself into a corner.

What we mean by sloppy GROUP BY is snippets like this:

or even

In both those examples our query relies on the fact that “name” uniquely depends on “user_id”, and so adding it to GROUP BY does not change the result.

Common query processing engines also understand this, so that the performance may not be affected, because the execution plan would stay the same. However, it may not be guaranteed. (Unfortunately, we need to do a lot of hand waving in some parts of this text, because we want to cover the entire spectrum of database implementations.).

If we implement this as a strict SQL (see below), a source of mistakes disappears.

Sloppy GROUP BY: non-aggregated extra columns

Another example of sloppy GROUP BY is the following:

SELECT id, name FROM users GROUP BY id;

Here you can see that the second result column is not aggregated (does not contain COUNT(), SUM() or similar functions). Some query processing engines forbid or discourage such constructs, but many of them are tolerant and this query apparently just works. However, it’s easy to make a mistake here and write a query that would return unstable and unpredictable results. This may happen especially when the query needs to be edited later.

Step N. Joining it all together

Before this step we had a number of mistake-free SQL fragments. Now we need to assemble them in such a way that no additional mistakes are introduced.

Note also that all the sub-queries that we have here are very optimization-friendly, taken in isolation. You need to confirm that the execution plan of each subquery uses indexes. We’ll also discuss how a combined query can be optimized in different ways, and how we can preserve those simple and straightforward execution plans.

Thought experiment: materialized subqueries

Let’s do a little thought experiment. Imagine that somebody has executed all the queries that we’ve designed so far, and inserted their results into a separate table. Each of those tables would have one or two columns (more if we have a multi-column unique key). Each of those intermediate tables have a primary key (in our example, “user_id”), and an index on the second column.

With this setup, our hypothetical final query would look like this:

SELECT user_ids.user_id, users.name, users.created_at, posts_count, latest_commented_at, likes_count, subscribers_count, subscriptions_count, comments_others_count FROM user_ids LEFT JOIN users ON user_ids.user_id = users.id LEFT JOIN posts_count ON user_ids.user_id = posts_count.user_id LEFT JOIN latest_commented_at ON user_ids.user_id = latest_commented_at.user_id LEFT JOIN likes_count ON user_ids.user_id = likes_count.user_id LEFT JOIN subscribers_count ON user_ids.user_id = subscribers_count.user_id LEFT JOIN subscriptions_count ON user_ids.user_id = subscriptions_count.user_id LEFT JOIN comments_others_count ON user_ids.user_id = comments_others_count.user_id WHERE subscribers_count > 100 ORDER BY users.created_at DESC LIMIT 50;

What can we see in this query? First, it is very straightforward: we take a base unique key table (user_ids) and then left join it with each intermediate table, plus an additional left join with the “users” table that contains non-aggregate data (user name and registration date). It is so repetitive that you have to be careful when copy-pasting: this has the potential to introduce trivial mistakes.

Second, all the LEFT JOINs that we have here are the most trivial possible: both tables are joined on primary key! It’s the most performant possible JOIN: the Cartesian product is trivial. The execution plan would be super boring and straightforward.

Third, note that we used both a filtering condition on aggregate data, and an ordering with a limit. Those two are going to have the most impact on the performance of this query. Depending on the data distribution, your database can decide to:

filter out user_ids that satisfy the condition;
order the remaining IDs by the registration date, and apply limit;
Execute LEFT JOINs;

Another possible execution plan is the other way round:

Traverse “users” table in descending order by the registration date;
filter out the IDs that are not in “user_ids”;
filter out the IDs that do not satisfy the subscribers count condition;
execute LEFT JOINs;
stop when we have 50 rows;

Your database system is probably best equipped to make this decision: huge amount of effort went into optimizing such simple queries, and we made sure to help the optimization process as much as possible, by:

Introducing primary keys;
introducing data indexes;
using the most performant joins possible;
using very narrow two-column tables;

As we said, filtering and ordering would be the heaviest operations here. You’d be lucky if you could move a filtering condition into a base unique key subquery. You’d be lucky if you did not have a filtering condition. You’d be lucky if you did not have to order the resulting dataset.

But this all depended on pre-materializing the sub-queries. Can we do that with a direct query?

Joining CTEs

We can trivially use the same query if we just define a CTE (common table expression) for each sub-query:

WITH user_ids AS ( SELECT id AS user_id FROM users ), posts_count AS ( SELECT user_id, COUNT(*) AS posts_count FROM posts GROUP BY user_id ) SELECT user_ids.user_id, users.name, users.created_at, posts_count FROM user_ids LEFT JOIN users ON user_ids.user_id = users.id LEFT JOIN posts_count ON user_ids.user_id = posts_count.user_id -- and so on, as in the query above, without any changes

We just copied all the subqueries without changes for now. It makes this query look a bit machine-generated, but maybe that’s fine.

Now we can try and run this mega-query, and see what happens. I’m pretty certain that the result would be correct, because we’ve used only the correctness-preserving constructs.

The performance of the mega-query would heavily depend on the database capabilities and data distribution. Hopefully, you may be lucky and it would just work on your dataset, returning the result in acceptable time.

What if you’re unlucky?

Optimizing correct mega-queries

General strategy for optimizing such mega-queries is to split them as much as possible.

Depending on your environment, your tooling and the data, the optimization problem can be trivial, if you can tolerate some intermediate steps. In most modern environments you probably can.

Double check subqueries

Before optimizing, check that your subqueries are correct. First, are they even needed? Maybe you don’t need a piece of data at all, or you can move it into a different query.

Maybe you can add a filtering condition on the base unique key subquery, so that less rows are considered. (Adding filtering conditions on aggregated subqueries would probably not help.)

Optimize subqueries

Second, confirm that execution plans for subqueries are optimal.

If the query uses a join, make sure that it’s a simple JOIN, the one that uses only primary keys. If you use anything else in the JOIN condition except for ID comparison, it may be a mistake, especially for LEFT JOINs.

If the query uses a GROUP BY, make sure that the grouping columns are covered by the index.

If the query uses a filtering condition, make sure that there is an index on this condition (or, confirm that the index does not help).

Building a pipeline

You can just look into implementing our thought experiment directly. If you think about it, our mega-query is a combination of several parts. Each part can be calculated separately. This looks awfully like a data pipeline. Maybe you can have a data pipeline that pre-materialized sub-queries?

All sub-queries can be calculated in parallel, so the total time would be the time of the slowest query.

The intermediate results are very narrow, so it may take a negligible amount of space. A typical aggregate subquery returns two numbers per row (user_id and a count). It’s 8 bytes, for 100 million rows it will be less than a gigabyte of space.

After all the subqueries are prepared, run the final query and here is your result.

Building a running pipeline may take extra time and disk resources, but it has a benefit of predictability.

One problem with mega-queries is that they are executed all-or-nothing. Also, there is no way to tell how long the query is going to take. With pipelines you can easily see the incremental progress, and you can restart execution from a checkpoint.

You can even fix one subquery and you only need to re-run it and the final query.

So, compared to any other optimization approaches, splitting a query (or not joining it in the first place) may have an outsized impact.

Select/display split

Most queries could be split in two parts, roughly defined as select and display. Here is a trivial example. Suppose that we want need to fetch the names of the users who were born on a certain day. The query is of course elementary:

SELECT name FROM users WHERE day_of_birth = '2000-03-11';

Imagine that the query is much more complicated, such as the one we’re working on. We could split the query in two parts:

SELECT id FROM users WHERE day_of_birth = '2000-03-11';

and

SELECT name FROM users WHERE id IN (...);

Select part returns only a set of IDs that correspond to the selection criteria. Display part returns all the data that corresponds to the IDs.

This is a natural split that cuts the query roughly in half. Where do we get a list of ids? We can have it as a temporary table, for example. For debugging, we could use a hardcoded array of IDs to improve turn-around time.

In our example, the select-part of the split would be:

SELECT user_ids.user_id FROM user_ids LEFT JOIN users ON user_ids.user_id = users.id LEFT JOIN subscribers_count ON user_ids.user_id = subscribers_count.user_id WHERE subscribers_count > 100 ORDER BY users.created_at DESC LIMIT 50;

It returns a list of IDs that correspond to the selection criteria.

A display-part query is the same as the original mega-query, but the WHERE and ORDER BY is a bit different:

WHERE user_ids.user_id IN (...) ORDER BY users.created_at DESC;

(Do not forget to repeat the ORDER BY clause here.)

One benefit of this approach is that you could trivially paginate through the list of IDs, and run the display query incrementally. For many database servers, having a small list of IDs to search for would dramatically simplify the execution plan.

Also, you can process the IDs in parallel, if your task allows it.

Again, this select/display split is another, less dramatic way to split a mega-query.

Summary

Complex analytical queries involving data aggregation and JOINs are often challenging to write. We present a straightforward way to structure such queries.

Our approach breaks the query into a single “unique key query” and multiple “aggregating subqueries”. Each of those queries can be tested and optimized separately.

Final mega-query is conceptually a unique key query left joined with all aggregating subqueries in a straightforward way. There is a considerable flexibility of how this “left join” operation could be physically implemented.

Three most common left join implementations are considered:

the actual SQL LEFT JOIN statement, using CTEs or SQL subqueries;
pipeline-based approach that materializes each subquery in a separate temporary table, with trivial LEFT JOIN of all materialized tables;
avoiding join altogether, using subquery results directly in the application;

A common sign of suboptimal query design is the use of DISTINCT clause to avoid “data duplication”. Our approach does not require DISTINCT, it is correct by construction.

Another sign of suboptimal query design is what we call “sloppy GROUP BY”. We avoid it by joining non-aggregated subqueries separately.

Our approach works for all relational databases. It requires small syntactic adjustments according to the capabibilities of a specific SQL dialect.

We discuss the general performance characteristics of our approach. Our approach is designed to be generally optimization-friendly, and it explicitly avoids more problematics constructs. Going into details is out of scope for this article, because of different capabilities of different database query optimizers. Additionally, query performance would heavily depend on specific data distribution for a particular database and a query.

Click to view the abridged schema of the following tables: comments, likes, posts, timelines, subscriptions, and users.

CREATE TABLE `comments` ( `id` int(11) NOT NULL AUTO_INCREMENT, `post_id` int(11) NOT NULL, `user_id` int(11) NOT NULL, `text` text NOT NULL, `created_at` datetime NOT NULL, `updated_at` datetime NOT NULL, PRIMARY KEY (`id`), KEY `index_comments_on_user_id` (`user_id`), KEY `index_comments_on_post_id_and_user_id` (`post_id`,`user_id`), KEY `index_comments_on_post_id_and_created_at` (`post_id`,`created_at`) ); CREATE TABLE `likes` ( `id` int(11) NOT NULL AUTO_INCREMENT, `post_id` int(11) NOT NULL, `user_id` int(11) NOT NULL, `created_at` datetime NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `index_likes_on_post_id_and_user_id` (`post_id`,`user_id`), KEY `index_likes_on_user_id` (`user_id`) ); CREATE TABLE `posts` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) NOT NULL, `published_at` datetime NOT NULL, `text` text NOT NULL, `comments_disabled` tinyint(1) DEFAULT '0', `created_at` datetime NOT NULL, `updated_at` datetime NOT NULL, PRIMARY KEY (`id`), KEY `index_posts_on_user_id` (`user_id`), KEY `index_posts_on_updated_at` (`updated_at`) ); CREATE TABLE `subscriptions` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) NOT NULL, `timeline_id` int(11) NOT NULL, `created_at` datetime NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `index_u_sl_t` (`user_id`,`timeline_id`), KEY `index_subscriptions_on_timeline_id` (`timeline_id`) ); CREATE TABLE `timelines` ( `id` int(11) NOT NULL AUTO_INCREMENT, `their_id` int(11) NOT NULL, `what` varchar(255) NOT NULL, `created_at` datetime NOT NULL, PRIMARY KEY (`id`), UNIQUE KEY `index_timelines_on_their_id_and_what` (`their_id`,`what`) ); CREATE TABLE `users` ( `id` int(11) NOT NULL AUTO_INCREMENT, `name` varchar(50) NOT NULL, `display_name` varchar(50) NOT NULL, `status` varchar(255) NOT NULL, `created_at` datetime NOT NULL, `updated_at` datetime NOT NULL, `email` varchar(255) NOT NULL DEFAULT '', `encrypted_password` varchar(255) NOT NULL DEFAULT '' PRIMARY KEY (`id`), UNIQUE KEY `index_users_on_email` (`email`), KEY `index_users_on_name` (`name`) );

Read Entire Article

Systematic design of multi-join GROUP BY queries

Problem

Approach

An overview of the design process

Table of contents

On the complexity of examples

Step 1. Defining unique key

Example of a composite unique key

Step 2. Implementing unique key query

Implementing uniqueness

Example: per-day aggregation

Step 3. Implementing sub-queries

The process so far

Multi-column aggregating sub-queries

Filtering conditions

Filtering on non-aggregated data

Filtering on aggregated data

Ordering conditions

Implementing non-aggregation sub-queries

Strict and sloppy GROUP BY clauses

Sloppy GROUP BY: non-aggregated extra columns

Step N. Joining it all together

Thought experiment: materialized subqueries

Joining CTEs

Optimizing correct mega-queries

Double check subqueries

Optimize subqueries

Building a pipeline

Select/display split

Summary

Related

Designing Tools for Scientific Thought

Jony Ive's OpenAI Deal Puts Pressure on Apple to Find Next B...

Show HN: Screenless SMS for Kids

Systematic design of multi-join GROUP BY queries

Problem

Approach

Main example: social media stats

An overview of the design process

Table of contents

On the complexity of examples

Step 1. Defining unique key

Example of a composite unique key

Step 2. Implementing unique key query

Implementing uniqueness

Example: per-day aggregation

Step 3. Implementing sub-queries

The process so far

Multi-column aggregating sub-queries

Filtering conditions

Filtering on non-aggregated data

Filtering on aggregated data

Ordering conditions

Implementing non-aggregation sub-queries

Strict and sloppy GROUP BY clauses

Sloppy GROUP BY: non-aggregated extra columns

Step N. Joining it all together

Thought experiment: materialized subqueries

Joining CTEs

Optimizing correct mega-queries

Double check subqueries

Optimize subqueries

Building a pipeline

Select/display split

Summary

Related

Designing Tools for Scientific Thought

Jony Ive's OpenAI Deal Puts Pressure on Apple to Find Next B...

Show HN: Screenless SMS for Kids