Skip to content

Conversation

@juanmanzojr
Copy link
Contributor

@juanmanzojr juanmanzojr commented Oct 29, 2025

This PR changes the use of the limit parameter in the InactiveUserDeleter job to prevent a large number of users from being deleted.

Previously, a dry run of the job caused total lockup of the machine it ran on because we have a lot of users considered inactive, when our query inactive_users.size ran. Removing that safety constraint and instead adding a break block within our loop ensures we no longer run this expensive query and enforce a limit within our loop as to how many users can be deleted.

This will only really occur for this first batch of inactive users since we have so many.

Follow up work: Bring back the safety constraint?

Links

Copy link
Contributor

@carl-codeorg carl-codeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at it, having two different limit parameters feels confusing to me. What do you think about simplifying it to just having the single limit and removing the safety constraint limit? It also seems to me that if we're artificially limiting the query that would trigger the safety constraint to a number lower than what would actually trigger it, then the safety check becomes useless. What do you think?

Copy link
Contributor

@carl-codeorg carl-codeorg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we were already calling account_batch = inactive_users.limit(BATCH_SIZE), I wonder if the limit in the query itself would even help? Starting to second guess whether it's the query, or if the logic in the loop needs to be refactored.

@juanmanzojr
Copy link
Contributor Author

juanmanzojr commented Oct 29, 2025

Now that I look at it, having two different limit parameters feels confusing to me. What do you think about simplifying it to just having the single limit and removing the safety constraint limit? It also seems to me that if we're artificially limiting the query that would trigger the safety constraint to a number lower than what would actually trigger it, then the safety check becomes useless. What do you think?

Yeah i agree.
If we use just one limit, it effectively makes the safety constraint useless for sizes under 8,000. I know we mentioned an estimate as to how many inactive users to delete per run but I have forgotten the estimate. If we remove the safety constraint and rely on the query limit, you are suggesting then there would be really no point in account_batch = inactive_users.limit(BATCH_SIZE)? If we have a large limit say 100,000 would batching not be beneficial?

@carl-codeorg
Copy link
Contributor

Now that I look at it, having two different limit parameters feels confusing to me. What do you think about simplifying it to just having the single limit and removing the safety constraint limit? It also seems to me that if we're artificially limiting the query that would trigger the safety constraint to a number lower than what would actually trigger it, then the safety check becomes useless. What do you think?

Yeah i agree. If we use just one limit, it effectively makes the safety constraint useless for sizes under 8,000. I know we mentioned an estimate as to how many inactive users to delete per run but I have forgotten the estimate. If we remove the safety constraint and rely on the query limit, you are suggesting then there would be really no point in account_batch = inactive_users.limit(BATCH_SIZE)? If we have a large limit say 100,000 would batching not be beneficial?

I think batching could still potentially be useful. I was more getting at the safety constraint not being beneficial if we have another limit on the query prior to batching. Before we process any users, the whole query would be millions - so if we use a safety limit that is reasonable, it'll always fail - and if we artificially limit the query to something much lower, say a limit of 100k per run, it'll never hit that safety limit. I was thinking that, at least for the initial large group of users, it'd more useful just to have one upper limit of the number of records processed per run of the job. But keep that limit low enough to avoid the resource consumption issue from the dry run last week.

But now I'm second guessing whether the limit approach is actually going to fix that issue.

Copy link
Member

@artem-vavilov artem-vavilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share more details about this change? Was it made due to query timeouts or another reason?

# @param limit [Integer] The maximum number of accounts to delete in a single run.
# This is a safety limit to prevent accidental deletion of too many accounts.
def initialize(dry_run: false, inactive_since: nil, limit: ACCOUNT_DELETION_LIMIT)
def initialize(dry_run: false, inactive_since: nil, query_limit: nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for renaming limit to query_limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We renamed limit to query_limit because we wanted to apply the limit directly to the query rather than as a separate safety constraint after fetching all inactive users. Previously, the safety limit (limit) was meant to prevent accidentally deleting too many accounts, but in practice, it wasn’t very effective, the initial query could still return millions of records before that limit was enforced.
This change also makes the safety constraint somewhat redundant, since we’re now controlling how many users can even be considered for deletion in the first place. I like the idea of having an upper limit for how many records are processed per run of the job.

Queries::User::Inactive.
call(inactive_since: inactive_since).
where.not(id: processed_user_ids).
limit(@query_limit)
Copy link
Member

@artem-vavilov artem-vavilov Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query_limit will be overridden by the batch_size limit, meaning it has no real effect, and all matched accounts will be deleted in batches of 1,000 records:

loop do
account_batch = inactive_users.limit(BATCH_SIZE)
account_batch.each do |user|
delete_user(user)
self.num_accounts_deleted += 1
rescue StandardError => exception
self.num_errors += 1
Honeybadger.notify(exception, context: {user_id: user.id})
log_message("Error deleting user_id #{user.id}: #{exception.message}")
ensure
processed_user_ids << user.id
end
break if account_batch.size < BATCH_SIZE
end

We need to add another break condition based on num_accounts_deleted in the loop:

break if num_accounts_deleted >= query_limit

@juanmanzojr juanmanzojr requested review from Copilot and removed request for carl-codeorg and Copilot November 7, 2025 08:33
@juanmanzojr
Copy link
Contributor Author

Could you please share more details about this change? Was it made due to query timeouts or another reason?

Yeah, because the initial batch of inactive users is so large, when we run this query it locked up the machine it ran on. We need a way at least the first go time around to limit the amount of users we process per run, since total_size = inactive_users.size
if total_size > limit was already running the query. At least for the initial batch of inactive users, the limit constraint will not be very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants