-
Notifications
You must be signed in to change notification settings - Fork 15
[TNTP-2109] Switch to safe mode on vshard rebalance #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ita-sammann
wants to merge
17
commits into
master
Choose a base branch
from
TNTP-2109-rebalance-safe-mode
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
2f630af
[TNTP-2109] Switch to safe mode on vshard rebalance
ita-sammann 9b626bb
[TNTP-2109] Add safe mode with bucket_ref/unref
ita-sammann abbcfbd
[TNTP-2109] Add tests for safe mode
ita-sammann 757680b
Add bucket ref/unref for crud operations
Satbek 8ea36f9
Fix tests after fast/safe mode changes
Satbek a8352f2
Allow some fast methods to finish after rebalance started
ita-sammann bb11b86
Minor changes according to review comments
ita-sammann d937de4
make_bucket_ref_err: call BucketRefError only once
Satbek c7a913e
fix double_buckets_not_applied test, now works for get
Satbek b8cf51d
rm manual fast mode enable in select_readview_test
Satbek b552571
Fix typo in comment
ita-sammann 6654ca2
[TNTP-2109] Use local space and on_commit trigger to toggle safe mode
ita-sammann 15f2332
[TNTP-2109] Use `internal.trigger` instead of custom hooks
ita-sammann 926adcd
[TNTP-2109] Simplify rebalance metrics
ita-sammann eba0e35
[TNTP-2109] Remove `atomic` from single calls
ita-sammann 9d81362
[TNTP-2109] Use correct unref function when mode has changed
ita-sammann f771c5a
[TNTP-2109] Remove concats from fiber name to improve perf
ita-sammann File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,197 @@ | ||
| local fiber = require('fiber') | ||
| local log = require('log') | ||
| local trigger = require('internal.trigger') | ||
| local vshard_consts = require('vshard.consts') | ||
| local utils = require('crud.common.utils') | ||
| local has_metrics_module, metrics = pcall(require, 'metrics') | ||
|
|
||
| local SAFE_MODE_SPACE = '_crud_rebalance_safe_mode_local' | ||
|
|
||
| local rebalance = { | ||
| safe_mode = false, | ||
| -- Trigger is run with one argument: true if safe mode is enabled and false if disabled. | ||
| on_safe_mode_toggle = trigger.new('_crud.safe_mode_toggle'), | ||
| _router_cache_last_clear_ts = 0 -- On module load we don't know when (and if) route cache was cleared. | ||
| } | ||
|
|
||
| local function safe_mode_bucket_trigger(_, new, space, op) | ||
| if space ~= '_bucket' then | ||
| return | ||
| end | ||
| -- We are interested only in two operations that indicate the beginning of bucket migration: | ||
| -- * We are receiving a bucket (new bucket with status RECEIVING) | ||
| -- * We are sending a bucket to another node (existing bucket status changes to SENDING) | ||
| if (op == 'INSERT' and new.status == vshard_consts.BUCKET.RECEIVING) or | ||
| (op == 'REPLACE' and new.status == vshard_consts.BUCKET.SENDING) then | ||
| local stored_safe_mode = box.space[SAFE_MODE_SPACE]:get{ 'status' } | ||
| if not stored_safe_mode or not stored_safe_mode.value then | ||
| box.space[SAFE_MODE_SPACE]:replace{ 'status', true } | ||
| end | ||
| end | ||
| end | ||
|
|
||
| local function _safe_mode_enable() | ||
| -- The trigger is needed to detect the beginning of rebalance process to enable safe mode. | ||
| -- If safe mode is enabled we don't need the trigger anymore. | ||
| for _, trig in pairs(box.space._bucket:on_replace()) do | ||
| if trig == safe_mode_bucket_trigger then | ||
| box.space._bucket:on_replace(nil, trig) | ||
| end | ||
| end | ||
| rebalance.safe_mode = true | ||
|
|
||
| -- This function is running inside on_commit trigger, need pcall to protect from errors in external code. | ||
| pcall(rebalance.on_safe_mode_toggle.run, rebalance.on_safe_mode_toggle, true) | ||
|
|
||
| log.info('Rebalance safe mode enabled') | ||
| end | ||
|
|
||
| local function _safe_mode_disable() | ||
| -- We have disabled safe mode so we need to add the trigger to detect the beginning | ||
| -- of rebalance process to enable safe mode again. | ||
| box.space._bucket:on_replace(safe_mode_bucket_trigger) | ||
| rebalance.safe_mode = false | ||
|
|
||
| -- This function is running inside on_commit trigger, need pcall to protect from errors in external code. | ||
| pcall(rebalance.on_safe_mode_toggle.run, rebalance.on_safe_mode_toggle, false) | ||
|
|
||
| log.info('Rebalance safe mode disabled') | ||
| end | ||
|
|
||
| local function create_space() | ||
| local safe_mode_space = box.schema.space.create(SAFE_MODE_SPACE, { | ||
| engine = 'memtx', | ||
| format = { | ||
| { name = 'key', type = 'string' }, | ||
| { name = 'value', type = 'any' }, | ||
| }, | ||
| is_local = true, | ||
| if_not_exists = true, | ||
| }) | ||
| safe_mode_space:create_index('primary', { parts = { 'key' }, if_not_exists = true }) | ||
| safe_mode_space:insert{ 'status', false } | ||
| end | ||
|
|
||
| local function create_trigger() | ||
| box.space[SAFE_MODE_SPACE]:on_replace(function() | ||
| box.on_commit(function(rows_iter) | ||
| local safe_space_id = box.space[SAFE_MODE_SPACE].id | ||
| -- There may be multiple operations on safe mode status tuple in one transaction. | ||
| -- We will take only the last action. | ||
| -- 0 = do nothing, 1 = enable safe mode, -1 = disable safe mode | ||
| local safe_mode_action = 0 | ||
| for _, old, new, sp in rows_iter() do | ||
| if sp ~= safe_space_id then | ||
| goto continue | ||
| end | ||
| assert((old == nil or old.key == 'status') and (new.key == 'status')) | ||
|
|
||
| if (not old or not old.value) and new.value then | ||
| safe_mode_action = 1 | ||
| elseif old.value and not new.value then | ||
| safe_mode_action = -1 | ||
| end | ||
|
|
||
| ::continue:: | ||
| end | ||
|
|
||
| if safe_mode_action == 1 then | ||
| _safe_mode_enable() | ||
| elseif safe_mode_action == -1 then | ||
| _safe_mode_disable() | ||
| end | ||
| end) | ||
| end) | ||
| end | ||
|
|
||
| function rebalance.init() | ||
| local stored_safe_mode | ||
| if not box.info.ro then | ||
| if box.space[SAFE_MODE_SPACE] == nil then | ||
| create_space() | ||
| create_trigger() | ||
| end | ||
| stored_safe_mode = box.space[SAFE_MODE_SPACE]:get{ 'status' } | ||
| else | ||
| while box.space[SAFE_MODE_SPACE] == nil or box.space[SAFE_MODE_SPACE].index[0] == nil do | ||
| fiber.sleep(0.05) | ||
| end | ||
| create_trigger() | ||
| stored_safe_mode = box.space[SAFE_MODE_SPACE]:get{ 'status' } | ||
| end | ||
|
|
||
| if stored_safe_mode and stored_safe_mode.value then | ||
| _safe_mode_enable() | ||
| else | ||
| _safe_mode_disable() | ||
| end | ||
| end | ||
|
|
||
| function rebalance.safe_mode_status() | ||
| return rebalance.safe_mode | ||
| end | ||
|
|
||
| function rebalance.safe_mode_enable() | ||
| box.space[SAFE_MODE_SPACE]:replace{ 'status', true } | ||
| end | ||
|
|
||
| function rebalance.safe_mode_disable() | ||
| box.space[SAFE_MODE_SPACE]:replace{ 'status', false } | ||
| end | ||
|
|
||
| --- Rebalance storage API | ||
| rebalance.storage_api = { | ||
| rebalance_safe_mode_status = rebalance.safe_mode_status, | ||
| rebalance_safe_mode_enable = rebalance.safe_mode_enable, | ||
| rebalance_safe_mode_disable = rebalance.safe_mode_disable, | ||
| } | ||
|
|
||
| --- Rebalance router API | ||
| rebalance.router_api = {} | ||
|
|
||
| function rebalance.router_api.cache_clear() | ||
| local router = utils.get_vshard_router_instance() | ||
| if router == nil then | ||
| log.warn("Router is not initialized yet") | ||
| return | ||
| end | ||
| rebalance._router_cache_last_clear_ts = fiber.time() | ||
| return router:_route_map_clear() | ||
| end | ||
|
|
||
| function rebalance.router_api.cache_length() | ||
| local router = utils.get_vshard_router_instance() | ||
| if router == nil then | ||
| log.warn("Router is not initialized yet") | ||
| return | ||
| end | ||
| return router.known_bucket_count | ||
| end | ||
|
|
||
| function rebalance.router_api.cache_last_clear_ts() | ||
| return rebalance._router_cache_last_clear_ts | ||
| end | ||
|
|
||
| --- Rebalance related metrics | ||
| if has_metrics_module then | ||
| local safe_mode_enabled_gauge = metrics.gauge( | ||
| 'tnt_crud_storage_safe_mode_enabled', | ||
| "is safe mode enabled on this storage instance" | ||
| ) | ||
| local router_cache_length_gauge = metrics.gauge( | ||
| 'tnt_crud_router_cache_length', | ||
| "number of bucket routes in vshard router cache" | ||
| ) | ||
| local router_cache_last_clear_ts_gauge = metrics.gauge( | ||
| 'tnt_crud_router_cache_last_clear_ts', | ||
| "when vshard router cache was cleared last time" | ||
| ) | ||
|
|
||
| metrics.register_callback(function() | ||
| safe_mode_enabled_gauge:set(rebalance.safe_mode_status() and 1 or 0) | ||
| router_cache_length_gauge:set(rebalance.router_api.cache_length()) | ||
| router_cache_last_clear_ts_gauge:set(rebalance.router_api.cache_last_clear_ts()) | ||
| end) | ||
| end | ||
|
|
||
| return rebalance |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, the commits are absolutely chaotic, this is not the way, we develop open-source modules. We should rebase that branch in master to make the commit history linear, and not just merge it. So, all commits like Minor changes according to review comments will be visible to our users. You can check out, how Georgy Moiseev did it: proper commits, proper commit messages, tests in every commit (e.g. 8d7cae0).
Now, I'm forced to review more than 1k lines in one step, which is very inconvenient and increases the chanse of missed bugs. And our users won't be able to check the individual commits, if they want to. Of course, it's up to you, since I'm not the maintainer of that module, but it's just not nice to develop and merge code like that.
IMHO, the features should be properly split between commits, every commit must include the associated tests, proper commit messages and mentioning of the #448 ticket. Of course, refactoring or code moving should be in the separate commits. Between commits all of the test must pass
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course these commits will not go to master, I will re-split them before merging the PR.
My bad. Never thought of this PR as of multiple features that can be reviewed separately.