How we sped up a Postgresql procedure 50X
Introduction
What I'm about to explain may seem rudimentary to a DBA but it was over looked by our distinguished engineer. And once you learn the inner workings it's a "duh" kinda moment where you look back and wonder why you didn't see it.
What is a Postgres procedure
Let's start with the basics. There are postgresql functions and procedures. The major differences between the two is that functions returns a result while a procedure does not. In addition to the return differences, you Select a function but Call a procedure.
Diving into the specifics, a function is almost like a function in any other programming language. It takes inputs and can manipulate it or use it in a sql statement.
Adding two numbers that are fed as inputs:
1 2 3 4 5 6 7 | CREATE FUNCTION add_numbers (first_number integer, second_number integer) RETURNS integer LANGUAGE SQL IMMUTABLE AS $$ SELECT $1 + $2; $$; |
Taking two inputs and inserting into the users table:
1 2 3 4 5 | CREATE OR REPLACE PROCEDURE insert_into_user(username varchar(255) user_id integer) Language SQL AS $$ INSERT INTO users VALUES(username, user_id) $$; |
Why offset is horrible for large datasets.
Offset is exactly what it sounds like, it starts at the what is set as the offset.
1 | SELECT * FROM artists LIMIT 5 OFFSET 2; |
This selects the selects 7 rows but throws away the first 2. This isn't a problem on smaller datasets and can be non-consequential but on larger datasets in the million of rows it's unsustainable and takes longer and longer times.
For example take this semi-pseudo code postgres function:
1 2 3 4 5 6 7 8 9 10 11 | batch_size=100000 offset=batch_size loop insert into new_table SELECT * FROM artists where artist_id BETWEEN 1 and 30000000 limit batch_size OFFSET cur_limit; get diagnostics rows_inserted_last = row_count rows_inserterted = rows_inserted+rows_inserted_last cur_limit=cur_limit+rows_inserted if rows_inserted_last < batch_size return; end loop end |
This selects 100k rows at a time looping until it returns rows less than the batch size then exits. The first several runs aren't that bad, but around the the 6th or 7th loop run you have several seconds added since it has to select select 700k rows and throw away the the first 600k rows. When running this query with nearly 100 million rows it took around 8 hours.
Optimizing the query
The simplest thing that would make it worlds faster is to remove the offset. Easiest way to do that is increment the first number in BETWEEN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | batch_size=100000 offset=batch_size lower_limit loop insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and 30000000 limit batch_size; get diagnostics rows_inserted_last = row_count rows_inserterted = rows_inserted+rows_inserted_last cur_limit=cur_limit+rows_inserted lower_limit=lower_limit + batch_size if rows_inserted_last < batch_size return; end loop end |
What if we took it a step further and removed the limit too? We can do that by setting the upper limit to lower_limit +batch size:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | batch_size=100000 offset=batch_size lower_limit loop insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and upper_limit; get diagnostics rows_inserted_last = row_count rows_inserterted = rows_inserted+rows_inserted_last cur_limit=cur_limit+rows_inserted lower_limit=lower_limit + batch_size upper_limit=lower_limit+batch_size if rows_inserted_last < batch_size return; end loop end |
Okay it's looking much better. But we're exiting if there is less rows returned the batch size. While at first glance this appears to be okay, what happens if a row is deleted? We should instead exit when we reach the max upper bound.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | batch_size=100000 max_size=30000000 offset=batch_size lower_limit loop insert into new_table SELECT * FROM artists where artist_id BETWEEN lower_limit and upper_limit; get diagnostics rows_inserted_last = row_count rows_inserterted = rows_inserted+rows_inserted_last cur_limit=cur_limit+rows_inserted lower_limit=lower_limit + batch_size upper_limit=lower_limit+batch_size if rows_inserted_last=max_size then return; end if end loop end |
Other Potential Solutions
We can instead use the seek method which is a simple where clause combined with a fetch first:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | batch_size=100000 max_size=30000000 offset=batch_size lower_limit loop insert into new_table SELECT * FROM artists where artist_id > lower_limit fetch batch_size next rows only; get diagnostics rows_inserted_last = row_count rows_inserterted = rows_inserted+rows_inserted_last cur_limit=cur_limit+rows_inserted lower_limit=lower_limit + batch_size if rows_inserted_last=max_size then return; end if end loop end |
Comments
Post a Comment