Because everything’s better with bacon

comprar sildenafil viagra prijs pilule levitra tadalafil moins cher vendo viagra vendo cialis cialis te koop acheter cialis sur internet acquisto viagra senza ricetta medicament cialis levitra italia vardenafil generico prix de cialis compro levitra viagra sans prescription medicament levitra acquisto levitra vardenafil bestellen trouver du levitra cialis pharmacie levitra sur internet generique du viagra sildenafil bestellen compro viagra viagra kosten comprar cialis cialis venta libre commander kamagra compra viagra acheter cialis en belgique ordina levitra kamagra te koop viagra pharmacie pharmacie en ligne acquisto viagra on line levitra france impuissance erection cialis kauf achat cialis 20mg levitra sur le net viagra donne generische levitra comprar cialis generico viagra ricetta acheter tadalafil commander du cialis acheter cialis internet viagra farmacia costo levitra cialis ohne rezept cialis vente libre viagra quanto costa levitra en pharmacie posologia viagra acquisto viagra zithromax generique ordina viagra acheter isotretinoine viagra rezeptfrei tadalafil generique comprar vardenafil generique du cialis commander du viagra vente levitra acheter kamagra 100mg generische viagra achat cialis propecia prix viagra dosaggio tadalafil 10 mg levitra generico compro levitra achat viagra en ligne acheter kamagra kamagra pharmacie aquisto viagra acheter cialis en espagne trouble erection viagra ordonnance cialis donne vente viagra cialis receta cialis vente en ligne vendita levitra viagra recensioni acheter zyban kamagra oral jelly acheter cialis en pharmacie acheter finasteride viagra te koop levitra venta cialis ricetta medica vardenafil generique sildenafil receta acheter cialis pas cher pastilla viagra viagra ricetta medica medicament impuissance comprar levitra generica impotenza sessuale tadalafil precio achat cialis generique viagra svizzera cialis belgique acheter clomid en france viagra cialis differenze cialis livraison rapide levitra rezeptfrei dysfonction erectile acheter du cialis cialis generico vente de cialis acheter cialis sur la net cialis effet secondaire kamagra rezeptfrei levitra precio acquisto viagra svizzera impuissance homme compro sildenafil prozac sans ordonnance pastilla sildenafil comprar viagra em portugal compro cialis levitra pharmacie prezzi levitra kamagra generique acquista levitra vendo viagra milano sildenafil rezeptfrei viagra fur frauen viagra effet secondaire cialis prescrizione

Tuesday, January 12, 2010

Array Sorting

I have this little sorting problem that had been bothering me for a month, but as it’s only a minor annoyance, I hadn’t spared the cycles to work on it. Last Friday I decided I needed an instant gratification project, and set about solving it.

I have a table that contains hostnames, cards, and the slots on the host chassis those cards are installed in. All datatypes are varchars. The sorting problem arises because I have some cards that are actually installed on other cards; those are referenced by a slot/subslot designation.

Here’s some sample data:

testytest=# SELECT * FROM cards ORDER BY card_slot;
hostname | card_slot |   card_model
----------+-----------+----------------
zucchini | 0         | card
zucchini | 0/0       | daughter card
zucchini | 0/1       | daughter card
zucchini | 1         | card
zucchini | 1/0       | daughter card
zucchini | 1/1       | daughter card
zucchini | 1/15      | daughter card
zucchini | 1/2       | daughter card
zucchini | 17        | something else
zucchini | 18        | something else
zucchini | 2         | another card
zucchini | 3         | another card
zucchini | 4         | another card
(13 rows)

I’d like to see 1/15 come after 1/2, and 17 & 18 come after 2, 3, & 4. (Like I said, it’s a minor annoyance. But still an annoyance.) I need to be able to sort both pieces in numerical order. I went through some weird gymnastics writing functions to split out & return each piece, but each of my solutions introduced other problems. (These instant gratification projects so rarely are, eh.)

Before I wrapped things up for the day, I read the PostgreSQL docs about array functions & operators. That percolated around in my brain and a solution came to me while I was out hiking over the weekend:

First, I reCAST string_to_array & made it return a set of integers:
CREATE OR REPLACE function foo(varchar(15)) RETURNS integer[]
AS $$
SELECT CAST(string_to_array($1, '/') AS integer[])
$$
LANGUAGE SQL;

Since Pg includes < and > array functions, I expected to be able to sort by the array my function returned:

testytest=# SELECT hostname, card_slot, card_model, foo(card_slot) AS sort_value
FROM cards
ORDER BY sort_value;
hostname | card_slot |   card_model   | sort_value
----------+-----------+----------------+------------
zucchini | 0         | card           | {0}
zucchini | 0/0       | daughter card  | {0,0}
zucchini | 0/1       | daughter card  | {0,1}
zucchini | 1         | card           | {1}
zucchini | 1/0       | daughter card  | {1,0}
zucchini | 1/1       | daughter card  | {1,1}
zucchini | 1/2       | daughter card  | {1,2}
zucchini | 1/15      | daughter card  | {1,15}
zucchini | 2         | another card   | {2}
zucchini | 3         | another card   | {3}
zucchini | 4         | another card   | {4}
zucchini | 17        | something else | {17}
zucchini | 18        | something else | {18}
(13 rows)

Voila. Works as expected on my small (~2000 rows) data set. It also fails as expected when passed bad data (eg something with a text string).

posted by gabrielle at 12:03 am  

Thursday, June 4, 2009

Optimizing aggregate use using subselects

Mark Wong is learning about the database called Portal that Portland State University researchers maintain. It contains a bunch of interesting data about our roads and traffic, gathered from lots of sensors installed along our highways in Oregon.

Here was the original query he posted for optimization:

-- Show each detector, its state, and what percentage of time it is ok or not.
SELECT t1.detectorid, description, COUNT(*) AS count, total,
       CAST(COUNT(*) AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM loopdata_2007_01_01 t1, loop_status,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t1.detectorid = t3.detectorid
  AND t1.status = loop_status.status
GROUP BY t1.detectorid, description, t3.total
ORDER BY t1.detectorid, description, t3.total

So, my first approach to optimizing this was to just materialize the data from the aggregates. COUNT(*) across these tables was killing the performance.

Here was my quick and dirty mockup, plus an example of a trigger to add to the schema in case we wanted to add it to the load scripts:

BEGIN;
-- DROP TABLE selena.detectorid_count;
CREATE TABLE selena.detectorid_count (
        detectorid integer not null PRIMARY KEY,
        count integer not null
);

-- DROP FUNCTION update_count;
CREATE OR REPLACE FUNCTION update_count() RETURNS trigger as
$update_count$
DECLARE
        _count integer;
BEGIN
        if (TG_OP = 'UPDATE') THEN
                -- Don't care about updates
                RETURN NULL;
        ELSIF (TG_OP = 'INSERT') THEN
                SELECT count INTO _count FROM selena.detectorid_count WHERE detectorid = NEW.detectorid;
                IF _count IS NULL THEN
                        INSERT INTO selena.detectorid_count VALUES(NEW.detectorid, 1);
                ELSE
                        UPDATE selena.detectorid_count d SET d.count = _count + 1 where d.detectorid = NEW.detectorid;
                END IF;
        END IF;
        RETURN NULL;
END;
$update_count$ LANGUAGE plpgsql;

-- DROP TRIGGER update_count on loopdata_2007_01_01;
CREATE TRIGGER update_count AFTER INSERT ON loopdata_2007_01_01 FOR EACH ROW EXECUTE PROCEDURE update_count();

-- DROP TABLE selena.test.agg;
CREATE TABLE selena.test_agg (
        detectorid integer not null,
        description text not null,
        total integer not null,        PRIMARY KEY (detectorid, description, total)
);

INSERT INTO selena.test_agg SELECT t1.detectorid, description, count(*) as count from loopdata_2007_01_01 t1, loop_status WHERE t1.status = loop_status.status GROUP B
Y t1.detectorid, description;

explain analyze SELECT t1.detectorid, description, t1.total as count, t3.count,
       CAST(t1.total AS NUMERIC) / CAST(t3.count AS NUMERIC) * 100.0 AS percentage
FROM selena.test_agg t1,
      selena.detectorid_count t3
WHERE t1.detectorid = t3.detectorid
ORDER BY t1.detectorid, description, t3.count;

ROLLBACK;
-- COMMIT;

The resulting query only took 7 ms!  So, Mark kind of thought that was cheating – since obviously, if I pre-calculate the values, the queries will be faster.

Looking at the EXPLAIN ANALYZE:

explain analyze SELECT t4.detectorid, t4.description, t4.count, total,
       CAST(t4.count AS NUMERIC) / CAST(total AS NUMERIC) * 100.0 AS percentage
FROM  (SELECT t1.detectorid, description, count(*) as count
        from loopdata_2007_01_01 t1, loop_status
        where t1.status = loop_status.status
        GROUP BY t1.detectorid, description) t4,
     (SELECT detectorid, COUNT(*) AS total
      FROM loopdata_2007_01_01 t2
      GROUP BY detectorid
      ORDER BY detectorid) t3
WHERE t4.detectorid = t3.detectorid
ORDER BY t4.detectorid, t4.description, t3.total
;
                                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=710083.07..711078.47 rows=398161 width=38) (actual time=22667.435..22668.159 rows=1386 loops=1)
   Sort Key: t1.detectorid, loop_status.description, t3.total
   Sort Method:  quicksort  Memory: 198kB
   ->  Merge Join  (cost=602939.70..651271.14 rows=398161 width=38) (actual time=17907.248..22665.778 rows=1386 loops=1)
         Merge Cond: (t1.detectorid = t3.detectorid)
         ->  GroupAggregate  (cost=544650.66..573487.36 rows=126200 width=22) (actual time=14246.234..18997.411 rows=1386 loops=1)
               ->  Sort  (cost=544650.66..551465.46 rows=2725920 width=22) (actual time=14242.307..17137.123 rows=2725920 loops=1)
                     Sort Key: t1.detectorid, loop_status.description
                     Sort Method:  external merge  Disk: 50000kB
                     ->  Hash Join  (cost=45.33..85556.32 rows=2725920 width=22) (actual time=0.076..5438.041 rows=2725920 loops=1)
                           Hash Cond: (t1.status = loop_status.status)
                           ->  Seq Scan on loopdata_2007_01_01 t1  (cost=0.00..44622.20 rows=2725920 width=4) (actual time=0.032..1685.744 rows=2725920 loops=1)
                           ->  Hash  (cost=25.70..25.70 rows=1570 width=22) (actual time=0.030..0.030 rows=6 loops=1)
                                 ->  Seq Scan on loop_status  (cost=0.00..25.70 rows=1570 width=22) (actual time=0.016..0.021 rows=6 loops=1)
         ->  Materialize  (cost=58289.03..58303.23 rows=631 width=10) (actual time=3660.990..3663.363 rows=1386 loops=1)
               ->  Subquery Scan t3  (cost=58289.03..58296.92 rows=631 width=10) (actual time=3660.983..3662.162 rows=631 loops=1)
                     ->  Sort  (cost=58289.03..58290.61 rows=631 width=2) (actual time=3660.980..3661.429 rows=631 loops=1)
                           Sort Key: t2.detectorid
                           Sort Method:  quicksort  Memory: 54kB
                           ->  HashAggregate  (cost=58251.80..58259.69 rows=631 width=2) (actual time=3659.995..3660.390 rows=631 loops=1)
                                 ->  Seq Scan on loopdata_2007_01_01 t2  (cost=0.00..44622.20 rows=2725920 width=2) (actual time=0.044..1633.167 rows=2725920 loops=1)
 Total runtime: 22686.007 ms
(22 rows)

We shaved 4000 ms off the original query!  Signficant savings, but not terrific performance. Ultimately, the solution will be to materialize the aggregate calculations ahead of time.

posted by selenamarie at 10:13 am