COUNT(*) vs COUNT(1) vs COUNT(id)

COUNT(*) returns the number of rows, and COUNT(foo) returns the number of rows with non-null foo. But there are lots of myths regarding COUNT(*), mostly coming from its syntax resembling SELECT *. Let’s bust them!

In the SQL standard COUNT(*) semantics is considered separately from COUNT(anything_else). As per specification, they work differently. Also you cannot write e.g. SUM(*) or TO_CHAR(*). Asterisk instead of an argument is something that only COUNT can accept, I doubt the standard allows it in any other case. However, the standard is poor and never considers generic aggregate functions, including custom ones. Despite the latter being present in all modern relational DBMSes, at least in Postgres, Oracle and MS SQL Server.

Regarding Postgres implementation, COUNT(*) is a call to an aggregation function without parameters, and COUNT(something_else) is a call to COUNT aggregation function with one parameter. Thanks to polymorphism, these are different functions with separate code bodies.

An aggregate function is, roughly, a subroutine with two entry points. A multiply used entry point A that accepts a chunk of data, and an entry point B, which is used only once to ask for the result. In fact, there are more entry points and their usage patterns, but it’s out of scope here, as well as the internal structure of the subroutine.

For historical reasons, the syntax for an aggregate function call looks like aggregate_function_name(*) if entry point A accepts no parameters, like aggregate_function_name(foo) if one, or e.g. AGGREGATE_FUNCTION_NAME(foo, bar, baz) if it accepts 3 arguments. It’s quite inconsistent, as call of an ordinary function or a set-returning function (generator analogue) with no arguments is written as e.g. now(), not as now(*).

If we were to develop SQL language from scratch in 2019, instead of COUNT(*) syntax we would require COUNT(), for not to confuse anyone.

In theory, we can create another custom aggregate function with no parameters, but I don’t think anything but COUNT would make sense. What useful info can be returned by an aggregate function with no parameters, other than the number of times it’s been called? Here’s an example that doesn’t claim to be useful:

CREATE FUNCTION is_count_even_step(state bool) RETURNS bool AS $$
BEGIN
    RETURN NOT state;
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE is_count_even(*) (
    SFUNC = is_count_even_step,
    STYPE = bool,
    INITCOND = true
);

CREATE TABLE table_with_2_rows AS SELECT generate_series(1, 2);
CREATE TABLE table_with_5_rows AS SELECT generate_series(1, 5);
SELECT COUNT(*) FROM table_with_2_rows; -- returns 2
SELECT is_count_even(*) FROM table_with_2_rows; -- returns true
SELECT COUNT(*) FROM table_with_5_rows; -- returns 5
SELECT is_count_even(*) FROM table_with_5_rows; -- returns false

Okay, now we know that COUNT(*) is innocent. Is it any better that COUNT(1) or COUNT(id)? Yes, it’s faster because Postgres won’t have to pass and examine the parameter value. Much faster?

On my installation, SELECT COUNT(*) FROM generate_series(1, 10000000) id and SELECT COUNT(1) FROM generate_series(1, 10000000) id take about 4000ms, while SELECT COUNT(ID) FROM generate_series(1, 10000000) id is rather 4500ms, and the overhead looks proportional to the number of rows. To see the difference between COUNT(*) and COUNT(1) we’ll have to execute each of them  in a subquery that runs multiple times, like this:

SELECT FROM (
    SELECT (SELECT COUNT(*))
    FROM generate_series(1, 10000000)
)_;

With COUNT(*) it typically completes in 4800ms, while COUNT(1) variant needs 5000ms. Not too much difference, but still a bit of.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s