Now that I've had a look at the database, I'm more concerned about the size than I was. The index size for the main table is way larger than memory on any reasonable DB instance, so we're essentially going to be doing index scans against the disk. I'm not sure how fast the disk will end up being, but the peak speeds I've seen are only 200MB/sec, which is 10 minutes to read the index and 2 hours to tablescan. (I was seeing 20mb/sec earlier, but apparently there's a first read issue that's really slow on a new instance).
I did a quick test on the dev sized database, and it looks like the main table can be shrunk to ~1/4 of it's size by normalizing the json documents in featureId and summary into numeric columns. This has the added benefit of not requiring json parse and numeric casts for filtering and summation, which can ease the load if the database winds up being processor constrained. (see below for the query to make the table, I've attached the patch to the api server as well.)
That doesn't affect the indexes, but there are a couple things that can be done with them:
- Convert the scenarioId field to an integer, referencing a separate table. This saves a few bytes per row, but it also makes comparisons faster since they don't have to deal with locale based string comparison. This saves about 25% on the small db.
- Change the 2 indexes on scenarios to one 2 column index: e.g.
create index scenarios_temp_scenario_id_idx on scenarios_temp (scenario_id, "featureId"); It's safe because the featureId is only ever queried with the scenario_id. This change makes a further 50% reduction in the index size, so it should probably be done regardless of any other changes.
The net change on the dev system is:
table_name | table_size | indexes_size | total_size
-------------------------------------+------------+--------------+------------
"public"."scenarios" | 5216 kB | 656 kB | 5872 kB
"public"."scenarios_temp" | 1136 kB | 240 kB | 1376 kB
(where scenarios_temp is the alternate version).
If this ratio holds, we'd be looking at something like 340GB of data and 50GB of indexes, for a total db size around 400GB.
For reference, this is what I'm seeing on the production RDS:
table_name | table_size | indexes_size | total_size
-------------------------------------+------------+--------------+------------
"public"."scenarios" | 1334 GB | 138 GB | 1471 GB
Sql to do the table follows:
-- query to make normalized table
select
"scenarioId",
"featureId",
"modelId",
("filterValues"->>'CurrentMvLineDist')::numeric as "CurrentMvLineDist",
("filterValues"->>'RoadDist')::numeric as "RoadDist",
("filterValues"->>'GridCellArea')::numeric as "GridCellArea",
("filterValues"->>'InvestmentCapita2025')::numeric as "InvestmentCapita2025",
("filterValues"->>'InvestmentCapita2030')::numeric as "InvestmentCapita2030",
(summary->>'Pop2018')::numeric as "Pop2018",
(summary->>'Pop2025')::numeric as "Pop2025",
(summary->>'Pop2030')::numeric as "Pop2030",
(summary->>'ElecCode2018')::numeric as "ElecCode2018",
(summary->>'FinalElecCode2025')::numeric as "FinalElecCode2025",
(summary->>'FinalElecCode2030')::numeric as "FinalElecCode2030",
(summary->>'InvestmentCost2025')::numeric as "InvestmentCost2025",
(summary->>'InvestmentCost2030')::numeric as "InvestmentCost2030",
(summary->>'NewCapacity2025')::numeric as "NewCapacity2025",
(summary->>'NewCapacity2030')::numeric as "NewCapacity2030",
(summary->>'ElecStatusIn2025')::numeric as "ElecStatusIn2025",
(summary->>'ElecStatusIn2030')::numeric as "ElecStatusIn2030"
into scenarios_temp
from scenarios;
-- queries to make a lookup table for scenario_id (index optimization 1)
create table scenario_id ( "scenerioId" text primary key, id serial );
insert into scenario_id ("scenerioId") select distinct "scenarioId" from scenarios_temp;
create unique index scenario_id_id_idx on scenario_id (id);
alter table scenarios_temp add column scenario_id integer references scenario_id(id);
update scenarios_temp set scenario_id=scenario_id.id from scenario_id where >scenario_id."scenerioId" = scenarios_temp."scenarioId";
-- and drop the other column, but I haven't coded support for this
alter table scenarios_temp drop column "scenerioId";
Regarding serializing JSONs fields, I believe we can't do this because the flexibility of setting any parameter from input data as a filter is a requirement. I think we could try to add a mapping table for parameters names, but I believe PostgreSQL already does that internally.
Analysis of the production database by Eric Sooros:
Eric, I agree that we can benefit from:
create index scenarios_temp_scenario_id_idx on scenarios_temp (scenario_id, "featureId");Regarding serializing JSONs fields, I believe we can't do this because the flexibility of setting any parameter from input data as a filter is a requirement. I think we could try to add a mapping table for parameters names, but I believe PostgreSQL already does that internally.