Skip to content

Commit 9500eb1

Browse files
pauloricardomgtolbertamclohfinkfrankghhimanshujindal
committed
Backport Automated Repair Inside Cassandra (CEP-37)
Includes bug fixes and features: - Improved observability in AutoRepair (CASSANDRA-20581) - Stop repair scheduler if two major versions detected (CASSANDRA-20048) - Safeguard Full repair against disk protection (CASSANDRA-20045) - Stop AutoRepair monitoring thread upon shutdown (CASSANDRA-20623) - Fix race condition in auto-repair scheduler (CASSANDRA-20265) - Minimum repair task duration setting (CASSANDRA-20160) - Preview_repaired auto-repair type (CASSANDRA-20046) - Gate auto-repair behind cassandra.autorepair.enable JVM property - Add cassandra.autorepair.check_min_version to gate minimum version enforcement - Prevent auto-repair from running if any node is below 5.0.7 - Make system_distributed auto-repair schema conditional on feature being enabled - Add user-friendly errors for disabled auto-repair and schema incompatibility patch by Paulo Motta; reviewed by Andy Tolbert, Jaydeepkumar Chovatia for CASSANDRA-21138 Co-Authored-By: Andy Tolbert <andy_tolbert@apple.com> Co-Authored-By: Chris Lohfink <clohfink@netflix.com> Co-Authored-By: Francisco Guerrero <frankgh@apache.org> Co-Authored-By: Himanshu Jindal <himanshj@amazon.com> Co-Authored-By: Jaydeepkumar Chovatia <jchovati@uber.com> Co-Authored-By: Kristijonas Zalys <kzalys@uber.com> Co-Authored-By: jaydeepkumar1984 <chovatia.jaydeep@gmail.com>
1 parent 0807210 commit 9500eb1

98 files changed

Lines changed: 14524 additions & 120 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.build/run-tests.sh

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -188,21 +188,23 @@ _build_all_dtest_jars() {
188188
if [ -d ${TMP_DIR}/cassandra-dtest-jars/.git ] && [ "https://github.com/apache/cassandra.git" == "$(git -C ${TMP_DIR}/cassandra-dtest-jars remote get-url origin)" ] ; then
189189
echo "Reusing ${TMP_DIR}/cassandra-dtest-jars for past branch dtest jars"
190190
if [ "x" == "x${OFFLINE}" ] ; then
191-
until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done
191+
until git -C ${TMP_DIR}/cassandra-dtest-jars fetch --quiet --tags origin ; do echo "git -C ${TMP_DIR}/cassandra-dtest-jars fetch failed… trying again… " ; done
192192
fi
193193
else
194194
echo "Cloning cassandra to ${TMP_DIR}/cassandra-dtest-jars for past branch dtest jars"
195195
rm -fR ${TMP_DIR}/cassandra-dtest-jars
196196
pushd $TMP_DIR >/dev/null
197-
until git clone --quiet --depth 1 --no-single-branch https://github.com/apache/cassandra.git cassandra-dtest-jars ; do echo "git clone failed… trying again… " ; done
197+
until git clone --quiet --depth 1 --no-single-branch --tags https://github.com/apache/cassandra.git cassandra-dtest-jars ; do echo "git clone failed… trying again… " ; done
198198
popd >/dev/null
199199
fi
200200

201201
# cassandra-4 branches need CASSANDRA_USE_JDK11 to allow jdk11
202202
[ "${java_version}" -eq 11 ] && export CASSANDRA_USE_JDK11=true
203203

204204
pushd ${TMP_DIR}/cassandra-dtest-jars >/dev/null
205-
for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0 ; do
205+
# Note: cassandra-5.0.7 tag is used instead of cassandra-5.0 branch to enable
206+
# testing upgrades from 5.0.7 to the current local build for autorepair feature
207+
for branch in cassandra-4.0 cassandra-4.1 cassandra-5.0.7 ; do
206208
git clean -qxdff && git reset --hard HEAD || echo "failed to reset/clean ${TMP_DIR}/cassandra-dtest-jars… continuing…"
207209
git checkout --quiet $branch
208210
dtest_jar_version=$(grep 'property\s*name=\"base.version\"' build.xml |sed -ne 's/.*value=\"\([^"]*\)\".*/\1/p')

CHANGES.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,18 @@
11
5.0.8
2+
* Backport Automated Repair Inside Cassandra for CEP-37 (CASSANDRA-21138)
23
* Update cassandra-stress to support TLS 1.3 by default by auto-negotiation (CASSANDRA-21007)
34
* Ensure schema created before 2.1 without tableId in folder name can be loaded in SnapshotLoader (CASSANDRA-21173)
45
Merged from 4.1:
56
Merged from 4.0:
7+
Backported from 6.0:
8+
* Improved observability in AutoRepair to report both expected vs. actual repair bytes and expected vs. actual keyspaces (CASSANDRA-20581)
9+
* Stop repair scheduler if two major versions are detected (CASSANDRA-20048)
10+
* AutoRepair: Safeguard Full repair against disk protection (CASSANDRA-20045)
11+
* Stop AutoRepair monitoring thread upon Cassandra shutdown (CASSANDRA-20623)
12+
* Fix race condition in auto-repair scheduler (CASSANDRA-20265)
13+
* Implement minimum repair task duration setting for auto-repair scheduler (CASSANDRA-20160)
14+
* Implement preview_repaired auto-repair type (CASSANDRA-20046)
15+
* Automated Repair Inside Cassandra for CEP-37 (CASSANDRA-19918)
616

717

818
5.0.7

NEWS.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,45 @@ restore snapshots created with the previous major version using the
6565
'sstableloader' tool. You can upgrade the file format of your snapshots
6666
using the provided 'sstableupgrade' tool.
6767

68+
5.0.8
69+
======
70+
71+
New features
72+
------------
73+
- CEP-37 Auto Repair is a fully automated scheduler that provides repair orchestration within Apache Cassandra. This
74+
significantly reduces operational overhead by eliminating the need for operators to deploy external tools to submit
75+
and manage repairs. See
76+
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-37+Apache+Cassandra+Unified+Repair+Solution for more
77+
details on the motivation and design.
78+
79+
Upgrading
80+
---------
81+
- The auto-repair feature requires enabling the JVM property `cassandra.autorepair.enable=true` (add
82+
`-Dcassandra.autorepair.enable=true` to JVM options) before starting the node. This property creates the required
83+
schema elements for auto-repair, including the auto_repair column in system_schema.tables and system_schema.views,
84+
as well as the auto_repair_history and auto_repair_priority tables in system_distributed. After enabling this
85+
property, you still need to enable auto-repair scheduling either in cassandra.yaml under the `auto_repair` section
86+
or at runtime via JMX.
87+
88+
Users who do not intend to use auto-repair can leave this property disabled (the default) to maintain schema
89+
compatibility with pre-5.0.8 nodes during rolling upgrades. This property must be set consistently across all
90+
nodes before startup and cannot be changed at runtime.
91+
92+
WARNING: This property is non-reversible. Once enabled, it cannot be disabled. Attempting to start a node
93+
with `cassandra.autorepair.enable=false` after it was previously enabled will cause the node to fail during
94+
initialization due to schema incompatibility (the persisted schema contains auto-repair columns that are not
95+
recognized when the property is disabled). To disable auto-repair scheduling after the property has been
96+
enabled, use cassandra.yaml or JMX instead of changing the JVM property.
97+
98+
IMPORTANT: The `cassandra.autorepair.enable` property must be enabled consistently across all nodes in the
99+
cluster before any schema changes are made. When some nodes have the property enabled and others do not, the
100+
system_distributed keyspace schema generation will differ between nodes (generation 7 with auto-repair vs
101+
generation 6 without), causing schema disagreement. This is similar to what happens during a major version
102+
upgrade when new system tables are added. Any schema change (e.g. CREATE KEYSPACE) attempted while nodes
103+
are in this inconsistent state will time out and schema versions will not converge until all nodes are
104+
brought up with the same setting. Once all nodes have the property set consistently, schema will converge
105+
automatically.
106+
68107
5.0.7
69108
======
70109

conf/cassandra.yaml

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1951,6 +1951,13 @@ report_unconfirmed_repaired_data_mismatches: false
19511951
# Materialized views are considered experimental and are not recommended for production use.
19521952
materialized_views_enabled: false
19531953

1954+
# Specify whether Materialized View mutations are replayed through the write path on streaming, e.g. repair.
1955+
# When enabled, Materialized View data streamed to the destination node will be written into commit log first. When setting to false,
1956+
# the streamed Materialized View data is written into SSTables just the same as normal streaming. The default is true.
1957+
# If this is set to false, streaming will be considerably faster however it's possible that, in extreme situations
1958+
# (losing > quorum # nodes in a replica set), you may have data in your SSTables that never makes it to the Materialized View.
1959+
# materialized_views_on_repair_enabled: true
1960+
19541961
# Enables SASI index creation on this node.
19551962
# SASI indexes are considered experimental and are not recommended for production use.
19561963
sasi_indexes_enabled: false
@@ -2253,6 +2260,7 @@ drop_compact_storage_enabled: false
22532260
# excluded_keyspaces: # comma separated list of keyspaces to exclude from the check
22542261
# excluded_tables: # comma separated list of keyspace.table pairs to exclude from the check
22552262

2263+
22562264
# This property indicates with what Cassandra major version the storage format will be compatible with.
22572265
#
22582266
# The chosen storage compatibility mode will determine the versions of the written sstables, commitlogs, hints, etc.
@@ -2281,3 +2289,168 @@ drop_compact_storage_enabled: false
22812289
# compatibility mode would no longer toggle behaviors as when it was running in the UPGRADING mode.
22822290
#
22832291
storage_compatibility_mode: CASSANDRA_4
2292+
2293+
2294+
# Prevents preparing a repair session or beginning a repair streaming session if pending compactions is over
2295+
# the given value. Defaults to disabled.
2296+
# reject_repair_compaction_threshold: 1024
2297+
2298+
# Ratio of disk that must be unused to run repair. It is useful to avoid disks filling up during
2299+
# repair as anti-compaction during repair may contribute to additional space temporarily.
2300+
# For example, setting this to 0.2 means at least 20% of disk must be unused.
2301+
# Set to 0.0 to disable this check. Defaults to 0.0 (disabled) on 5.0 for backward-compatibility.
2302+
# repair_disk_headroom_reject_ratio: 0.0
2303+
2304+
# Configuration for Auto Repair Scheduler.
2305+
#
2306+
# This feature is disabled by default.
2307+
#
2308+
# NOTE: The auto-repair feature requires enabling the JVM property `cassandra.autorepair.enable=true`.
2309+
#
2310+
# See: https://cassandra.apache.org/doc/latest/cassandra/managing/operating/auto_repair.html for an overview of this
2311+
# feature.
2312+
#
2313+
# auto_repair:
2314+
# # Enable/Disable the auto-repair scheduler.
2315+
# # If set to false, the scheduler thread will not be started.
2316+
# # If set to true, the repair scheduler thread will be created. The thread will
2317+
# # check for secondary configuration available for each repair type (full, incremental,
2318+
# # and preview_repaired), and based on that, it will schedule repairs.
2319+
# enabled: true
2320+
# repair_type_overrides:
2321+
# full:
2322+
# # Enable/Disable full auto-repair
2323+
# enabled: true
2324+
# # Minimum duration between repairing the same node again. This is useful for tiny clusters,
2325+
# # such as clusters with 5 nodes that finish repairs quickly. This means that if the scheduler completes one
2326+
# # round on all nodes in less than this duration, it will not start a new repair round on a given node until
2327+
# # this much time has passed since the last repair completed. Consider increasing to a larger value to reduce
2328+
# # the impact of repairs, however note that one should attempt to run repairs at a smaller interval than
2329+
# # gc_grace_seconds to avoid potential data resurrection.
2330+
# min_repair_interval: 24h
2331+
# token_range_splitter:
2332+
# # Implementation of IAutoRepairTokenRangeSplitter; responsible for splitting token ranges
2333+
# # for repair assignments.
2334+
# #
2335+
# # Out of the box, Cassandra provides org.apache.cassandra.repair.autorepair.{RepairTokenRangeSplitter,
2336+
# # FixedTokenRangeSplitter}.
2337+
# #
2338+
# # - RepairTokenRangeSplitter (default) attempts to intelligently split ranges based on data size and partition
2339+
# # count.
2340+
# # - FixedTokenRangeSplitter splits into fixed ranges based on the 'number_of_subranges' option.
2341+
# # class_name: org.apache.cassandra.repair.autorepair.RepairTokenRangeSplitter
2342+
#
2343+
# # Optional parameters can be specified in the form of:
2344+
# # parameters:
2345+
# # param_key1: param_value1
2346+
# parameters:
2347+
# # The target and maximum amount of compressed bytes that should be included in a repair assignment.
2348+
# # This scopes the amount of work involved in a repair and includes the data covering the range being
2349+
# # repaired.
2350+
# bytes_per_assignment: 50GiB
2351+
# # The maximum number of bytes to cover in an individual schedule. This serves as
2352+
# # a mechanism to throttle the work done in each repair cycle. You may reduce this
2353+
# # value if the impact of repairs is causing too much load on the cluster or increase it
2354+
# # if writes outpace the amount of data being repaired. Alternatively, adjust the
2355+
# # min_repair_interval.
2356+
# # This is set to a large value for full repair to attempt to repair all data per repair schedule.
2357+
# max_bytes_per_schedule: 100000GiB
2358+
# incremental:
2359+
# enabled: false
2360+
# # Incremental repairs operate over unrepaired data and should finish quickly. Running incremental repair
2361+
# # frequently keeps the unrepaired set smaller and thus causes repairs to operate over a smaller set of data,
2362+
# # so a more frequent schedule such as 1h is recommended.
2363+
# # NOTE: Please consult
2364+
# # https://cassandra.apache.org/doc/latest/cassandra/managing/operating/auto_repair.html#enabling-ir
2365+
# # for guidance on enabling incremental repair on ane exiting cluster.
2366+
# min_repair_interval: 24h
2367+
# token_range_splitter:
2368+
# parameters:
2369+
# # Configured to attempt repairing 50GiB of compressed data per repair.
2370+
# # This throttles the amount of incremental repair and anticompaction done per schedule after incremental
2371+
# # repairs are turned on.
2372+
# bytes_per_assignment: 50GiB
2373+
# # Restricts the maximum number of bytes to cover in an individual schedule to the configured
2374+
# # max_bytes_per_schedule value (defaults to 100GiB for incremental).
2375+
# # Consider increasing this value if more data is written than this limit within the min_repair_interval.
2376+
# max_bytes_per_schedule: 100GiB
2377+
# preview_repaired:
2378+
# # Performs preview repair over repaired SSTables, useful to detect possible inconsistencies in the repaired
2379+
# # data set.
2380+
# enabled: false
2381+
# min_repair_interval: 24h
2382+
# token_range_splitter:
2383+
# parameters:
2384+
# bytes_per_assignment: 50GiB
2385+
# max_bytes_per_schedule: 100000GiB
2386+
# # Time interval between successive checks to see if ongoing repairs are complete or if it is time to schedule
2387+
# # repairs.
2388+
# repair_check_interval: 5m
2389+
# # Minimum duration for the execution of a single repair task. This prevents the scheduler from overwhelming
2390+
# # the node by scheduling too many repair tasks in a short period of time.
2391+
# repair_task_min_duration: 5s
2392+
# # The scheduler needs to adjust its order when nodes leave the ring. Deleted hosts are tracked in metadata
2393+
# # for a specified duration to ensure they are indeed removed before adjustments are made to the schedule.
2394+
# history_clear_delete_hosts_buffer_interval: 2h
2395+
# # By default repair is disabled if there are mixed major versions detected - which would happen
2396+
# # if a major version upgrade is being performed on the cluster, but a user can enable it using this flag
2397+
# mixed_major_version_repair_enabled: false
2398+
# # NOTE: Each of the below settings can be overridden per repair type under repair_type_overrides
2399+
# global_settings:
2400+
# # If true, attempts to group tables in the same keyspace into one repair; otherwise, each table is repaired
2401+
# # individually.
2402+
# repair_by_keyspace: true
2403+
# # Number of threads to use for each repair job scheduled by the scheduler. Similar to the -j option in nodetool
2404+
# # repair.
2405+
# number_of_repair_threads: 1
2406+
# # Number of nodes running repair in parallel. If parallel_repair_percentage is set, the larger value is used.
2407+
# parallel_repair_count: 3
2408+
# # Percentage of nodes in the cluster running repair in parallel. If parallel_repair_count is set, the larger value
2409+
# # is used.
2410+
# parallel_repair_percentage: 3
2411+
# # Whether to allow a node to take its turn running repair while one or more of its replicas are running repair.
2412+
# # Defaults to false, as running repairs concurrently on replicas can increase load and also cause anticompaction
2413+
# # conflicts while running incremental repair.
2414+
# allow_parallel_replica_repair: false
2415+
# # An addition to allow_parallel_replica_repair that also blocks repairs when replicas (including this node itself)
2416+
# # are repairing in any schedule. For example, if a replica is executing full repairs, a value of false will
2417+
# # prevent starting incremental repairs for this node. Defaults to true and is only evaluated when
2418+
# # allow_parallel_replica_repair is false.
2419+
# allow_parallel_replica_repair_across_schedules: true
2420+
# # Repairs materialized views if true.
2421+
# materialized_view_repair_enabled: false
2422+
# # Delay before starting repairs after a node restarts to avoid repairs starting immediately after a restart.
2423+
# initial_scheduler_delay: 5m
2424+
# # Timeout for retrying stuck repair sessions.
2425+
# repair_session_timeout: 3h
2426+
# # Force immediate repair on new nodes after they join the ring.
2427+
# force_repair_new_node: false
2428+
# # Threshold to skip repairing tables with too many SSTables. Defaults to 10,000 SSTables to avoid penalizing good
2429+
# # tables.
2430+
# sstable_upper_threshold: 50000
2431+
# # Maximum time allowed for repairing one table on a given node. If exceeded, the repair proceeds to the
2432+
# # next table.
2433+
# table_max_repair_time: 6h
2434+
# # Avoid running repairs in specific data centers. By default, repairs run in all data centers. Specify data
2435+
# # centers to exclude in this list. Note that repair sessions will still consider all replicas from excluded
2436+
# # data centers. Useful if you have keyspaces that are not replicated in certain data centers, and you want to
2437+
# # not run repair schedule in certain data centers.
2438+
# ignore_dcs: []
2439+
# # Repair only the primary ranges owned by a node. Equivalent to the -pr option in nodetool repair. Defaults
2440+
# # to true. General advice is to keep this true.
2441+
# repair_primary_token_range_only: true
2442+
# # Maximum number of retries for a repair session.
2443+
# repair_max_retries: 3
2444+
# # Backoff time before retrying a repair session.
2445+
# repair_retry_backoff: 30s
2446+
# token_range_splitter:
2447+
# # Splitter implementation to generate repair assignments. Defaults to RepairTokenRangeSplitter.
2448+
# class_name: org.apache.cassandra.repair.autorepair.RepairTokenRangeSplitter
2449+
# parameters:
2450+
# # Maximum number of partitions to include in a repair assignment. Used to reduce number of partitions
2451+
# # present in merkle tree leaf nodes to avoid overstreaming.
2452+
# partitions_per_assignment: 1048576
2453+
# # Maximum number of tables to include in a repair assignment. This reduces the number of repairs,
2454+
# # especially in keyspaces with many tables. The splitter avoids batching tables together if they
2455+
# # exceed other configuration parameters like bytes_per_assignment or partitions_per_assignment.
2456+
# max_tables_per_assignment: 64

0 commit comments

Comments
 (0)