Here at Honeybadger, we have a lot of data, which presents us with a few problems. One of the biggest challenges is data culling. Removing old data that nobody uses any more, while keeping the good stuff.
Complicating matters is the fact that our data is structured hierarchically. There are Parents and Children. If we're to maintain data integrity, we have to delete the children before we delete the parent. And the children have their own associations that need to be cleaned up.
If we had a smaller dataset, we could do something like this:
parents.each do |parent|
parent.children.each do |child|
child.destroy
end
end
Or, heck. Maybe even:
parents.destroy_all
But this approach would take forever due to the size of our dataset. The last thing we want to do is create a model instance for every row we're deleting.
Instead, we're going to do things the hard way, using the magic of temporary tables. We'll use the temporary table as a "scratch pad" so all the data stays in postgres.
def delete(ids)
ActiveRecord::Base.transaction do
conn = ActiveRecord::Base.connection
#
# Delete all children for this parent, then put their ids in a temporary table
#
sql_to_delete_children = %[
with deleted as (
delete from errors where parent_id in (#{ids}) returning id
)
select id into temp child_deletions from deleted
]
conn.execute(sql_to_delete_children)
#
# Pop ids from the temporary table, 100 at a time
#
sql_to_pop_ids = %[
with deleted as (
delete from child_deletions
where id in (select id from child_deletions order by id limit 100)
returning id
)
select id from deleted
]
while (children = conn.select_values(sql_to_pop_ids)).any? do
#
# Delete any associations belonging to the children
#
%w(tickets ratings facets).each do |t|
conn.execute("delete from #{t} where error_id in (#{children.join(',')})")
end
end
conn.execute("drop table child_deletions")
end
end