Skip to content

Improve Housekeeper for distributed execution of tasks #21

@lucasoares

Description

@lucasoares

We have successfully operated all our Deckard instances with a single housekeeper pod for years. To enhance scalability of the housekeeper tasks, I propose the following improvements for the housekeeper feature:

  1. Implement a distributed locking mechanism for each task to support running multiple housekeeper pods simultaneously. While most tasks can run concurrently due to their atomic nature, running the same task in parallel on different housekeeper instances can lead to resource waste.

  2. Address potential issues, such as Prometheus metrics duplication. Currently, we expose numerous queue metrics in the /metrics endpoint of a Deckard instance when the housekeeper is enabled. Since the housekeeper is responsible for measuring many of these metrics, duplication can occur if we deploy many housekeper pods with the /metrics enabled. This mainly affects gauge metrics (queue size, queue oldest elements, etc). We must consider how to guarantee these metrics will not be duplicated since a single housekeeper may be responsible to generate metrics that summarizes all the environment.

  3. Some jobs, like unlocking, currently have a performance limitation. If we have too many elements being locked, it will start to pile up elements to unlock, which may impact in the lock time precision. Currently we prefer using score filters (max_score and min_score) to act as a locker (filtering for min_score as now() and then adding the lock time to then score when acking/adding a message), but we want to have a functional locking mechanism

By incorporating these enhancements, we aim to achieve better scalability, improved fault tolerance, and overall performance in our distributed Deckard setup.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions