Optimizing aggregate use using subselects

Mark Wong is learning about the database called Portal that Portland State University researchers maintain. It contains a bunch of interesting data about our roads and traffic, gathered from lots of sensors installed along our highways in Oregon.

Here was the original query he posted for optimization:

-- Show each detector, its state, and what percentage of time it is ok or not.
SELECT t1.detectorid, description, COUNT(*) AS count, total,
       CAST(COUNT(*) AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM loopdata_2007_01_01 t1, loop_status,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t1.detectorid = t3.detectorid
  AND t1.status = loop_status.status
GROUP BY t1.detectorid, description, t3.total
ORDER BY t1.detectorid, description, t3.total

So, my first approach to optimizing this was to just materialize the data from the aggregates. COUNT(*) across these tables was killing the performance.

Here was my quick and dirty mockup, plus an example of a trigger to add to the schema in case we wanted to add it to the load scripts:

BEGIN;
-- DROP TABLE selena.detectorid_count;
CREATE TABLE selena.detectorid_count (
        detectorid integer not null PRIMARY KEY,
        count integer not null
);

-- DROP FUNCTION update_count;
CREATE OR REPLACE FUNCTION update_count() RETURNS trigger as
$update_count$
DECLARE
        _count integer;
BEGIN
        if (TG_OP = 'UPDATE') THEN
                -- Don't care about updates
                RETURN NULL;
        ELSIF (TG_OP = 'INSERT') THEN
                SELECT count INTO _count FROM selena.detectorid_count WHERE detectorid = NEW.detectorid;
                IF _count IS NULL THEN
                        INSERT INTO selena.detectorid_count VALUES(NEW.detectorid, 1);
                ELSE
                        UPDATE selena.detectorid_count d SET d.count = _count + 1 where d.detectorid = NEW.detectorid;
                END IF;
        END IF;
        RETURN NULL;
END;
$update_count$ LANGUAGE plpgsql;

-- DROP TRIGGER update_count on loopdata_2007_01_01;
CREATE TRIGGER update_count AFTER INSERT ON loopdata_2007_01_01 FOR EACH ROW EXECUTE PROCEDURE update_count();

-- DROP TABLE selena.test.agg;
CREATE TABLE selena.test_agg (
        detectorid integer not null,
        description text not null,
        total integer not null,        PRIMARY KEY (detectorid, description, total)
);

INSERT INTO selena.test_agg SELECT t1.detectorid, description, count(*) as count from loopdata_2007_01_01 t1, loop_status WHERE t1.status = loop_status.status GROUP B
Y t1.detectorid, description;

explain analyze SELECT t1.detectorid, description, t1.total as count, t3.count,
       CAST(t1.total AS NUMERIC) / CAST(t3.count AS NUMERIC) * 100.0 AS percentage
FROM selena.test_agg t1,
      selena.detectorid_count t3
WHERE t1.detectorid = t3.detectorid
ORDER BY t1.detectorid, description, t3.count;

ROLLBACK;
-- COMMIT;

The resulting query only took 7 ms!  So, Mark kind of thought that was cheating – since obviously, if I pre-calculate the values, the queries will be faster.

Looking at the EXPLAIN ANALYZE:

explain analyze SELECT t4.detectorid, t4.description, t4.count, total,
       CAST(t4.count AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM  (SELECT t1.detectorid, description, count(*) as count
        from loopdata_2007_01_01 t1, loop_status
        where t1.status = loop_status.status
        GROUP BY t1.detectorid, description) t4,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t4.detectorid = t3.detectorid
ORDER BY t4.detectorid, t4.description, t3.total
;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=710083.07..711078.47 rows=398161 width=38) (actual time=22667.435..22668.159 rows=1386 loops=1)
   Sort Key: t1.detectorid, loop_status.description, t3.total
   Sort Method:  quicksort  Memory: 198kB
   ->  Merge Join  (cost=602939.70..651271.14 rows=398161 width=38) (actual time=17907.248..22665.778 rows=1386 loops=1)
         Merge Cond: (t1.detectorid = t3.detectorid)
         ->  GroupAggregate  (cost=544650.66..573487.36 rows=126200 width=22) (actual time=14246.234..18997.411 rows=1386 loops=1)
               ->  Sort  (cost=544650.66..551465.46 rows=2725920 width=22) (actual time=14242.307..17137.123 rows=2725920 loops=1)
                     Sort Key: t1.detectorid, loop_status.description
                     Sort Method:  external merge  Disk: 50000kB
                     ->  Hash Join  (cost=45.33..85556.32 rows=2725920 width=22) (actual time=0.076..5438.041 rows=2725920 loops=1)
                           Hash Cond: (t1.status = loop_status.status)
                           ->  Seq Scan on loopdata_2007_01_01 t1  (cost=0.00..44622.20 rows=2725920 width=4) (actual time=0.032..1685.744 rows=2725920 loops=1)
                           ->  Hash  (cost=25.70..25.70 rows=1570 width=22) (actual time=0.030..0.030 rows=6 loops=1)
                                 ->  Seq Scan on loop_status  (cost=0.00..25.70 rows=1570 width=22) (actual time=0.016..0.021 rows=6 loops=1)
         ->  Materialize  (cost=58289.03..58303.23 rows=631 width=10) (actual time=3660.990..3663.363 rows=1386 loops=1)
               ->  Subquery Scan t3  (cost=58289.03..58296.92 rows=631 width=10) (actual time=3660.983..3662.162 rows=631 loops=1)
                     ->  Sort  (cost=58289.03..58290.61 rows=631 width=2) (actual time=3660.980..3661.429 rows=631 loops=1)
                           Sort Key: t2.detectorid
                           Sort Method:  quicksort  Memory: 54kB
                           ->  HashAggregate  (cost=58251.80..58259.69 rows=631 width=2) (actual time=3659.995..3660.390 rows=631 loops=1)
                                 ->  Seq Scan on loopdata_2007_01_01 t2  (cost=0.00..44622.20 rows=2725920 width=2) (actual time=0.044..1633.167 rows=2725920 loops=1)
 Total runtime: 22686.007 ms
(22 rows)

We shaved 4000 ms off the original query!  Signficant savings, but not terrific performance. Ultimately, the solution will be to materialize the aggregate calculations ahead of time.

You must be logged in to post a comment.