PostgreSQL — Slow count query
As you know PostgresSQL introduced index-only scan in version 9.2 like other DBMS to speed up queries by using index data only if possible, in other words PostgreSQL uses index to produce result for the queries without IO cost of accessing table tuples on disk. Before going to detail of slow count query let’s take a look on related concepts first.
- Heap is the main table data on disk. Usually the data is stored on multiple pages.
- MVCC stands for Multi version Concurrency Control. Basically each SQL statement sees a snapshot of data as it was some time ago, regardless of the current state of the underlying data.
- Visibility Map is a map of bits in which each bit indicates whether or not any tuple in a heap page has been changed recently by any transaction. The bits are only set by vacuum, but they are unset by any change operations on a page.
Now assume that table_a has an B-tree index on column_1 and we want to execute the following count query “SELECT count(*) AS row_count FROM table_a WHERE column_1 > 0”. Obviously, the planner will use index-only scan strategy for the query but sometimes the query is still slow if the table data is changed frequently and row_count is a big number.
WHY? Because statement of the above query works on a snapshot of the table data and the index can not tell if a tuple of table is deleted or updated recently by other transactions or not. So PostgreSQL can not be confident to answer row_count directly from what it sees in index data. What does it do in fact? Based on the index data, PostgreSQL will check Visibility Map to know whether the page of a tuple was changed or not. If any tuple in the page was changed, then PostgreSQL has to fetch the tuple from heap in order to get its actual state. Needless to say, if the data is changed frequently then many bits in Visibility Map are unset accordingly and PostgreSQL has to do a lot of fetches from heap.
HOW to speed the count query? The only way to speed up it is to update the Visibility Map periodically by Vacuum or right after a big update (actually, Vacuum also updates data statistics which is very useful for query planner). However, if the data is changed very frequently then the bits in Visibility Map will be unset frequently too and possibly the count query is still slow.
ALTERNATIVE? If you don’t really need an accurate count number, then you can get an estimated number from EXPLAIN result as suggested in https://wiki.postgresql.org/wiki/Count_estimate
CREATE FUNCTION count_estimate(query text) RETURNS integer AS
$func$
DECLARE
rec record;
rows integer;
BEGIN
FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
EXIT WHEN rows IS NOT NULL;
END LOOP;
RETURN rows;
END
$func$ LANGUAGE plpgsql;