Feature: "watched_columns" speedup in more 10000 times for millions requests#606
Feature: "watched_columns" speedup in more 10000 times for millions requests#606TerionGVS5 wants to merge 8 commits intotoluaina:mainfrom
Conversation
|
@toluaina Good morning Sir. Waiting for your review ☀️, hope this feature will be useful for high load systems. |
|
Apologies for the delay. I needed some time to review this. (I've spent a bit of time in recent times fixing bugs with exclusions). Excluding the watched columns in this way has some implications. For example, with through tables you wouldn’t receive updates and could miss records. Your approach gave me an idea, so I’ve started a feature branch to explore it. Could you take a look and share your thoughts? |
|
Good evening! Thank you for answering.
We are skipping only on UPDATE operation. In common cases m2m tables for creating use INSERT and for removing DELETE. So even this table will contain any specific info like:
Or
We can set Maybe I missed some, but in our cases all okey with through tables.
Sure 👍🏻 |
|
@toluaina Good evening. I looked at your idea – it's very unusual. I didn't think something like this could be done with practically pure SQL. On the one hand, it's very cool and could work faster, but on the other hand, processing logic in Python feels more familiar to me.
Otherwise, foreign keys will be skipped unless the user explicitly specifies them. in my case with python way skipping, message NEW and OLD always PK, FK, and watched_columns (if set). |
|
@TerionGVS5 That’s an interesting approach! I implemented something similar in this PR From what I understand in your PR, you’re handling the filtering at the log level instead—did I get that right? |
|
@DrRiyadh Greetings, something like that, yes. I tried not to change the existing logic too much, but simply expand it. It was also easier to see how everything worked. For example: making an update on the front end that doesn't actually change data, or changes data that doesn't interest us. And then checking the logs to see that this change was recorded by us, but was allowed through according to our rules. |
First of all, thank you for this amazing project — it has been incredibly useful! 🙌
In our high-load system, we handle millions of updates per day, and during peak times Redis queues can contain hundreds of thousands of payloads.
A significant portion of these updates turned out to be redundant — especially in two cases:
• Many-to-many or relation tables frequently updated with the same data (no real value change).
• Large tables where only a few columns are relevant for the OpenSearch document, but updates to any column currently trigger a full document recalculation.
I introduced a simple improvement that helps avoid unnecessary re-indexing and can significantly boost performance in similar high-throughput environments:
• You can now define a watched_columns list inside the document model.
Only updates to these columns will trigger a document refresh in OpenSearch.
• Previously, if a table had 20 columns but only 2 of them were defined in columns, any update to the remaining 18 still caused a full reindex — even when the relevant data didn’t change.
• With watched_columns, this overhead is eliminated.
In our setup, this optimization reduced redundant updates by several orders of magnitude — even under peak load, the Redis queue rarely exceeds 10 messages.
It works especially well for frequent re-updates in M2M tables.
Backward compatibility
• Default behavior remains unchanged.
• If watched_columns is not defined, all updates will continue to trigger document recalculations as before.
Tests
• Added.
Docs
• README.rst: updated.
• HISTORY.rst: updated.