COUNT(*) returns the number of rows, and
COUNT(foo) returns the number of rows with non-null
foo. But there are lots of myths regarding
COUNT(*), mostly coming from its syntax resembling
SELECT *. Let’s bust them!
In the SQL standard
COUNT(*) semantics is considered separately from
COUNT(anything_else). As per specification, they work differently. Also you cannot write e.g.
TO_CHAR(*). Asterisk instead of an argument is something that only
COUNT can accept, I doubt the standard allows it in any other case. However, the standard is poor and never considers generic aggregate functions, including custom ones. Despite the latter being present in all modern relational DBMSes, at least in Postgres, Oracle and MS SQL Server.
Regarding Postgres implementation,
COUNT(*) is a call to an aggregation function without parameters, and
COUNT(something_else) is a call to
COUNT aggregation function with one parameter. Thanks to polymorphism, these are different functions with separate code bodies.
An aggregate function is, roughly, a subroutine with two entry points. A multiply used entry point A that accepts a chunk of data, and an entry point B, which is used only once to ask for the result. In fact, there are more entry points and their usage patterns, but it’s out of scope here, as well as the internal structure of the subroutine.
For historical reasons, the syntax for an aggregate function call looks like
aggregate_function_name(*) if entry point A accepts no parameters, like
aggregate_function_name(foo) if one, or e.g.
AGGREGATE_FUNCTION_NAME(foo, bar, baz) if it accepts 3 arguments. It’s quite inconsistent, as call of an ordinary function or a set-returning function (generator analogue) with no arguments is written as e.g.
now(), not as
If we were to develop SQL language from scratch in 2019, instead of
COUNT(*) syntax we would require
COUNT(), for not to confuse anyone.
In theory, we can create another custom aggregate function with no parameters, but I don’t think anything but
COUNT would make sense. What useful info can be returned by an aggregate function with no parameters, other than the number of times it’s been called? Here’s an example that doesn’t claim to be useful:
CREATE FUNCTION is_count_even_step(state bool) RETURNS bool AS $$ BEGIN RETURN NOT state; END; $$ LANGUAGE plpgsql; CREATE AGGREGATE is_count_even(*) ( SFUNC = is_count_even_step, STYPE = bool, INITCOND = true ); CREATE TABLE table_with_2_rows AS SELECT generate_series(1, 2); CREATE TABLE table_with_5_rows AS SELECT generate_series(1, 5); SELECT COUNT(*) FROM table_with_2_rows; -- returns 2 SELECT is_count_even(*) FROM table_with_2_rows; -- returns true SELECT COUNT(*) FROM table_with_5_rows; -- returns 5 SELECT is_count_even(*) FROM table_with_5_rows; -- returns false
Okay, now we know that
COUNT(*) is innocent. Is it any better that
COUNT(id)? Yes, it’s faster because Postgres won’t have to pass and examine the parameter value. Much faster?
On my installation,
SELECT COUNT(*) FROM generate_series(1, 10000000) id and
SELECT COUNT(1) FROM generate_series(1, 10000000) id take about 4000ms, while
SELECT COUNT(ID) FROM generate_series(1, 10000000) id is rather 4500ms, and the overhead looks proportional to the number of rows. To see the difference between
COUNT(1) we’ll have to execute each of them in a subquery that runs multiple times, like this:
SELECT FROM ( SELECT (SELECT COUNT(*)) FROM generate_series(1, 10000000) )_;
COUNT(*) it typically completes in 4800ms, while
COUNT(1) variant needs 5000ms. Not too much difference, but still a bit of.