<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Bacon and Tech &#187; Uncategorized</title>
	<atom:link href="http://www.baconandtech.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.baconandtech.com</link>
	<description>Because everything's better with bacon</description>
	<lastBuildDate>Wed, 28 Jul 2010 23:08:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Optimizing aggregate use using subselects</title>
		<link>http://www.baconandtech.com/2009/06/04/optimizing-aggregate-use-using-subselects/</link>
		<comments>http://www.baconandtech.com/2009/06/04/optimizing-aggregate-use-using-subselects/#comments</comments>
		<pubDate>Thu, 04 Jun 2009 17:13:21 +0000</pubDate>
		<dc:creator>selenamarie</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[portal]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://www.baconandtech.com/?p=64</guid>
		<description><![CDATA[Mark Wong is learning about the database called Portal that Portland State University researchers maintain. It contains a bunch of interesting data about our roads and traffic, gathered from lots of sensors installed along our highways in Oregon.
Here was the original query he posted for optimization:
-- Show each detector, its state, and what percentage of [...]]]></description>
			<content:encoded><![CDATA[<p>Mark Wong is learning about the database called Portal that Portland State University researchers maintain. It contains a bunch of interesting data about our roads and traffic, gathered from lots of sensors installed along our highways in Oregon.</p>
<p>Here was the original query he posted for optimization:</p>
<pre>-- Show each detector, its state, and what percentage of time it is ok or not.
SELECT t1.detectorid, description, COUNT(*) AS count, total,
       CAST(COUNT(*) AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM loopdata_2007_01_01 t1, loop_status,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t1.detectorid = t3.detectorid
  AND t1.status = loop_status.status
GROUP BY t1.detectorid, description, t3.total
ORDER BY t1.detectorid, description, t3.total</pre>
<p>So, my first approach to optimizing this was to just materialize the data from the aggregates. COUNT(*) across these tables was killing the performance.</p>
<p>Here was my quick and dirty mockup, plus an example of a trigger to add to the schema in case we wanted to add it to the load scripts:</p>
<pre>
BEGIN;
-- DROP TABLE selena.detectorid_count;
CREATE TABLE selena.detectorid_count (
        detectorid integer not null PRIMARY KEY,
        count integer not null
);

-- DROP FUNCTION update_count;
CREATE OR REPLACE FUNCTION update_count() RETURNS trigger as
$update_count$
DECLARE
        _count integer;
BEGIN
        if (TG_OP = 'UPDATE') THEN
                -- Don't care about updates
                RETURN NULL;
        ELSIF (TG_OP = 'INSERT') THEN
                SELECT count INTO _count FROM selena.detectorid_count WHERE detectorid = NEW.detectorid;
                IF _count IS NULL THEN
                        INSERT INTO selena.detectorid_count VALUES(NEW.detectorid, 1);
                ELSE
                        UPDATE selena.detectorid_count d SET d.count = _count + 1 where d.detectorid = NEW.detectorid;
                END IF;
        END IF;
        RETURN NULL;
END;
$update_count$ LANGUAGE plpgsql;

-- DROP TRIGGER update_count on loopdata_2007_01_01;
CREATE TRIGGER update_count AFTER INSERT ON loopdata_2007_01_01 FOR EACH ROW EXECUTE PROCEDURE update_count();

-- DROP TABLE selena.test.agg;
CREATE TABLE selena.test_agg (
        detectorid integer not null,
        description text not null,
        total integer not null,        PRIMARY KEY (detectorid, description, total)
);

INSERT INTO selena.test_agg SELECT t1.detectorid, description, count(*) as count from loopdata_2007_01_01 t1, loop_status WHERE t1.status = loop_status.status GROUP B
Y t1.detectorid, description;

explain analyze SELECT t1.detectorid, description, t1.total as count, t3.count,
       CAST(t1.total AS NUMERIC) / CAST(t3.count AS NUMERIC) * 100.0 AS percentage
FROM selena.test_agg t1,
      selena.detectorid_count t3
WHERE t1.detectorid = t3.detectorid
ORDER BY t1.detectorid, description, t3.count;

ROLLBACK;
-- COMMIT;
</pre>
<p>The resulting query only took 7 ms!  So, Mark kind of thought that was cheating &#8211; since obviously, if I pre-calculate the values, the queries will be faster.</p>
<p>Looking at the EXPLAIN ANALYZE:</p>
<pre>
explain analyze SELECT t4.detectorid, t4.description, t4.count, total,
       CAST(t4.count AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM  (SELECT t1.detectorid, description, count(*) as count
        from loopdata_2007_01_01 t1, loop_status
        where t1.status = loop_status.status
        GROUP BY t1.detectorid, description) t4,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t4.detectorid = t3.detectorid
ORDER BY t4.detectorid, t4.description, t3.total
;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=710083.07..711078.47 rows=398161 width=38) (actual time=22667.435..22668.159 rows=1386 loops=1)
   Sort Key: t1.detectorid, loop_status.description, t3.total
   Sort Method:  quicksort  Memory: 198kB
   ->  Merge Join  (cost=602939.70..651271.14 rows=398161 width=38) (actual time=17907.248..22665.778 rows=1386 loops=1)
         Merge Cond: (t1.detectorid = t3.detectorid)
         ->  GroupAggregate  (cost=544650.66..573487.36 rows=126200 width=22) (actual time=14246.234..18997.411 rows=1386 loops=1)
               ->  Sort  (cost=544650.66..551465.46 rows=2725920 width=22) (actual time=14242.307..17137.123 rows=2725920 loops=1)
                     Sort Key: t1.detectorid, loop_status.description
                     Sort Method:  external merge  Disk: 50000kB
                     ->  Hash Join  (cost=45.33..85556.32 rows=2725920 width=22) (actual time=0.076..5438.041 rows=2725920 loops=1)
                           Hash Cond: (t1.status = loop_status.status)
                           ->  Seq Scan on loopdata_2007_01_01 t1  (cost=0.00..44622.20 rows=2725920 width=4) (actual time=0.032..1685.744 rows=2725920 loops=1)
                           ->  Hash  (cost=25.70..25.70 rows=1570 width=22) (actual time=0.030..0.030 rows=6 loops=1)
                                 ->  Seq Scan on loop_status  (cost=0.00..25.70 rows=1570 width=22) (actual time=0.016..0.021 rows=6 loops=1)
         ->  Materialize  (cost=58289.03..58303.23 rows=631 width=10) (actual time=3660.990..3663.363 rows=1386 loops=1)
               ->  Subquery Scan t3  (cost=58289.03..58296.92 rows=631 width=10) (actual time=3660.983..3662.162 rows=631 loops=1)
                     ->  Sort  (cost=58289.03..58290.61 rows=631 width=2) (actual time=3660.980..3661.429 rows=631 loops=1)
                           Sort Key: t2.detectorid
                           Sort Method:  quicksort  Memory: 54kB
                           ->  HashAggregate  (cost=58251.80..58259.69 rows=631 width=2) (actual time=3659.995..3660.390 rows=631 loops=1)
                                 ->  Seq Scan on loopdata_2007_01_01 t2  (cost=0.00..44622.20 rows=2725920 width=2) (actual time=0.044..1633.167 rows=2725920 loops=1)
 Total runtime: 22686.007 ms
(22 rows)
</pre>
<p>We shaved 4000 ms off the original query!  Signficant savings, but not terrific performance. Ultimately, the solution will be to materialize the aggregate calculations ahead of time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.baconandtech.com/2009/06/04/optimizing-aggregate-use-using-subselects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
