We use partitioned tables in our primary PostgreSQL database to efficiently expire old data, as deleting a bunch of data from a huge table can cause database performance to suffer. Prior to version 10, PostgreSQL didn't have native support for partitioned tables, so we used the pg_partman extension to implement partitioning. It works by using PostgreSQL's table inheritance to create child tables of the table that is to be partitioned and triggers to insert data into the child tables rather than the parent table. That extension has worked well for us, but it has a downside -- it isn't an option when you are using Amazon RDS, as it isn't supported. Now that that PostgreSQL has support for native partitions, I figured now was the time to see about dropping that extension so we'd have the option of using RDS.
Our partitioning use-case is fairly simple: we partition tables based on time, creating a new partition for every day, week, or month, depending on how many rows we want to store across all partitions and on how long we want to retain the data. All of our partitioned tables have a created_at
column that will be used to determine which partition stores each row. For example, we might have a table defined like this:
create table events (
project_id integer,
data jsonb,
created_at timestamp
)
partition by range (created_at);
And if we wanted to have weekly partitions, they would look like this:
create table events_p2019_10_28 partition of events for values from ('2019-10-28') to ('2019-11-04');
create table events_p2019_11_04 partition of events for values from ('2019-11-04') to ('2019-11-11');
With a time-based partitioning scheme, deleting old data is as simple as dropping one of the partitions. The regular maintenance we need to do, then, is create new partitions for date ranges as we approach them and delete old partitions containing data we no longer want. To make that maintenance a little easier, I have created the pg_partition_manager gem. Naturally, it's inspired by my experience with the pg_partman extension, which has served us so well.
Let's take a look at how you'd use this gem, given the events table and partitioning scheme described above. You would create a script or rake task that looks like this:
require "pg_partition_manager"
PgPartitionManager::Time.process([{parent_table: "public.events", period: "week", premake: 1, retain: 3}])
The parent_table
is defined as schema.table_name
(public
, the default schema, is often the only one Rails developers end up using). The period
can be day, week, or month. You can choose how many tables you want created in advance (after the current period) with premake
, and how many tables you want to keep (prior to the current period) with retain
. The gem defaults to pre-creating 4 tables if you don't specify premake
, and it defaults to retaining data for 7 days, 4 weeks, and 6 months if you don't specify retain
.
Invoke that script/task with a daily cron job, and you're all set -- it will create and drop the tables as needed.
All ActiveRecord queries act just as they would with a non-partitioned table, so there's nothing you need to change in your code. That is, Event.create
, Event.where
, etc., will work just as they always have, with PostgreSQL putting the data in the right partition for you when you insert it. There is one change you may notice if you have lots of data, though... when you include created_at
in your queries, PostgreSQL won't have to scan all the partitions -- just the ones the cover the range you specify in your where clause.
To recap, if you have a lot of time-based data that you want to delete as it expires, use PostgreSQL partitioned tables and the pg_partition_manager gem to make your app happy. :)