Improve Housekeeper for distributed execution of tasks

We have successfully operated all our Deckard instances with a single housekeeper pod for years. To enhance scalability of the housekeeper tasks, I propose the following improvements for the housekeeper feature:

1. Implement a distributed locking mechanism for each task to support running multiple housekeeper pods simultaneously. While most tasks can run concurrently due to their atomic nature, running the same task in parallel on different housekeeper instances can lead to resource waste.

2. Address potential issues, such as Prometheus metrics duplication. Currently, we expose numerous queue metrics in the `/metrics` endpoint of a Deckard instance when the housekeeper is enabled. Since the housekeeper is responsible for measuring many of these metrics, duplication can occur if we deploy many housekeper pods with the `/metrics` enabled. This mainly affects gauge metrics (queue size, queue oldest elements, etc). We must consider how to guarantee these metrics will not be duplicated since a single housekeeper may be responsible to generate metrics that summarizes all the environment.

3. Some jobs, like unlocking, currently have a performance limitation. If we have too many elements being locked, it will start to pile up elements to unlock, which may impact in the lock time precision. Currently we prefer using score filters (max_score and min_score) to act as a locker (filtering for min_score as now() and then adding the lock time to then score when acking/adding a message), but we want to have a functional locking mechanism

By incorporating these enhancements, we aim to achieve better scalability, improved fault tolerance, and overall performance in our distributed Deckard setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Housekeeper for distributed execution of tasks #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Housekeeper for distributed execution of tasks #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions