CNDB-15669: Add Mutator.getXXXTailTrie tests#5
Conversation
Use explicit class names when instantiating them via ParameterizedClass instead of using Strings with the name.
…Graph (apache#1586) CNDB-12980 The postingsMap in CompactionGraph reuses the VectorFloat object when reading from storage. We need to copy this vector before storing it in another data structure if we want it to hold the right information. --------- Co-authored-by: Joel Knighton <joel.knighton@datastax.com>
This patch introduces two changes: - it adds a reading group to guard against sweeping the memtable which the metric is going to potentially iterate through (preventing crashes). - changes the metric calculation by using an estimate (used already by SAI query planner) instead of iterating through the whole memtable (which is quite a heavy operation).
### What is the issue https://github.com/riptano/cndb/issues/11725 # New functionality ORDER BY BM25, enabled by SAI Version.EC # Enhanced functionality ORDER BY ANN can also used the synthetic score column so that coordinator does not need to recompute similarity for every row returned by the different replicas. Controlled by SelectStatement.ANN_USE_SYNTHETIC_SCORE (default false)
)" This reverts commit b6930b5 adding back apache#1578
…Internal in CQL utests (apache#1564) Add a new `CQLTester#executeWithCoordinator` method that makes the query go through `ReadQuery#execute` rather than `ReadQuery#executeInternal`, so it goes through coordinator validation and reconciliation. Add a new `CQLTester#executeInternal` method that runs the query with `ReadQuery#executeInternal`, same as `CQLTester#execute` used to do. Make `CQLTester#execute` internally call `CQLTester#executeInternal` by default to preserve the beheviour of most utests. Add a new `CQLTester#enableCoordinatorExecution` method that makes `CQLTester#execute` use `CQLTester#executeWithCoordinator` instead. Make `SAITester` initialization call `CQLTester#enableCoordinatorExecution` so SAI tests extending it use coordinator validation and reconciliation. The latter is particularly useful to make sure that SAI behaves the same as `RowFilter` so SAI results are not wrongly filtered out by the coordinator. Some SAI tests have been excluded from this behaviour because the tested cases don't match this requirement, and follow-up issues have been created to fix this.
What was ported: - current compaction throughput measurement by CompactionManager - exposing current compaction throughput in StorageService and CompactionMetrics - nodetool getcompactionthroughput, including tests Not ported: - changes to `nodetool compactionstats`, because that would require porting also the tests which are currently missing in CC and porting those tests turned out to be a complex task without porting the other changes in the CompactionManager API - Code for getting / setting compaction throughput as double
This commit introduces a new AdaptiveCompressor class.
AdaptiveCompressor uses ZStandard compression with a dynamic
compression level based on the current write load. AdaptiveCompressor's
goal is to provide similar write performance as LZ4Compressor
for write heavy workloads, but a significantly better compression ratio
for databases with a moderate amount of writes or on systems
with a lot of spare CPU power.
If the memtable flush queue builds up, and it turns out the compression
is a significant bottleneck, then the compression level used for
flushing is decreased to gain speed. Similarly, when pending
compaction tasks build up, then the compression level used
for compaction is decreased.
In order to enable adaptive compression:
- set `-Dcassandra.default_sstable_compression=adaptive` JVM option
to automatically select `AdaptiveCompressor` as the main compressor
for flushes and new tables, if not overriden by specific options in
cassandra.yaml or table schema
- set `flush_compression: adaptive` in cassandra.yaml to enable it
for flushing
- set `AdaptiveCompressor` in Table options to enable it
for compaction
Caution: this feature is not turned on by default because it
may impact read speed negatively in some rare cases.
Fixes riptano/cndb#11532
### What is the issue
The `NON_DAEMON` IndexRegistry returns `true` for all calls to
`supportsExpression`. This leads to an incorrect result from the
`getEqBehavior` method where we get `MATCH` instead of `EQ` because the
index indicates that it can handle `ANALYZER_MATCHES` expressions and
`EQ` expressions.
It is an odd edge case because the javadoc for the `NON_DAEMON` object
is:
```java
/**
* An {@code IndexRegistry} intended for use when Cassandra is initialized in client or tool mode.
* Contains a single stub {@code Index} which possesses no actual indexing or searching capabilities
* but enables query validation and preparation to succeed. Useful for tools which need to prepare
* CQL statements without instantiating the whole ColumnFamilyStore infrastructure.
*/
```
This presents a problem for the eq/match logic where we want to find a
nuanced solution to the different solutions. My proposal here is to just
make it use the EQ behavior, but that might have adverse side effects in
untested code.
### What does this PR fix and why was it fixed
The original fix used in apache#1434 was just to avoid the NPE we hit, but it
allowed for the wrong result in eq behavior. This fix is to say that we
should just return `EQ` in that case. It's possible this fix has
negative consequences that we haven't seen yet, but at the very least,
the CNDB tests pass with it.
The Ford Fulkerson optimization may take too long in some configs ### What is the issue Some configs make the FF computation take too long ### What does this PR fix and why was it fixed This PR adds a feature flag so you can workaround it
This patch updates Netty to address CVE-2025-24970. It also adds the Netty native epoll dependency that was missing when Netty was upgraded from 4.1.58 to 4.1.117. Since 4.1.58, the native libraries are now separated out into new dependencies and must be explicitly added for them to be bundled into the tarball. ### What is the issue When Netty was upgraded to 4.1.117.Final, the native epoll libraries that used to be shipped in the netty-all.jar were not included. At some point since Netty 4.1.58, the native epoll and native kqueue libraries were split inot their own architecture dependent jarfiles/artifacts and must be included explicitly to pick them up. ### What does this PR fix and why was it fixed This patch adds the missing native epol dependencies. This is similar to what was done in OSS Cassandra when they upgraded to Netty 4.1.96 here: apache@53d1644
### What is the issue While working on CNDB-12382 I have seen that increasing the number of stripes helps to solve the timeout problems. ### What does this PR fix and why was it fixed This PR introduces an abstraction over the LOCKS mechanism in CounterMutation and provides a different implementation. The new implementation is not based on striping the locks (that is about sharing the same lock for different keys) but it maintains a set of reference counted objects that wrap the actual locks. This implementation also makes it clearer that the key for the locks is an Integer, computed by summing some hash codes, and this already creates some kind of bucketing. It is still possible that two mutations fall into the same lock key. When do the new implementation works better ? - it should generally works better for most of the workloads - when there are few active counter mutations (the number of locks is usually expected to be like the size of the Counters Mutation Stage thread pool) Rollback: Add -Dcassandra.use_striped_counter_lock_manager=true to the writers JVM options Licensing: this code is copied and adapted from [LocalLockManager](https://github.com/diennea/herddb/blob/master/herddb-utils/src/main/java/herddb/utils/LocalLockManager.java) from the HerdDB project (Apache 2 licensed).
…che#1605) ### What is the issue The functions validating schema object names are misleading and placed in different files. This led to incorrect use of them and failures on creating tables with long names. ### What does this PR fix and why was it fixed This PR reorganizes the name validation functions, so they are not confused and misused. The error messages are fixed accordantly. This PR also makes minor change to a long table name test to use the keyspace name reported in an incident. New test is added on non-alpha-numeric table name to improve the test coverage.
### What is the issue Fixes https://github.com/riptano/cndb/issues/12922 ### What does this PR fix and why was it fixed This PR follows up on apache#1525 by integrating the CQL configured `rerank_k` in to SAI so that the ANN query can take the user input and let it influence query execution. Key changes: * Bump `MessagingService` to `VERSION_DS_11` * Fixes implementation of `MessagingService.instance().endpointsWithVersionBelow` by using the `keyspace` variant, which is necessary for CNDB. * When configured, the `rerank_k` is used as the graph's `rerankK` parameter. The only detail worth mentioning here is that we ignore the segment proportionality computation if `rerank_k` is provided. This diverges from the smart default design, but not too greatly. * Added a guardrail for `rerank_k` to prevent it from exceeding 4 times the `cassandra.sai.vector_search.max_top_k`. This is debatable and easily changed, so please let me know if we want something different. * Made the in memory brute force cost comparison logic use the `rerankK` value instead of the `limit`. I am pretty sure this is correct, but please review closely. Instead of validating `rerank_k`'s value within the query, I added a test that ensures that as we increase the rerank k, we increase the recall. CNDB pr: https://github.com/riptano/cndb/pull/13095
…pache#1618) CNDB-11460 another fix for cqlsh unicode tests in TestCqlshUnicode - make a second read in read_until to get UTF-8 characters that are split across reads.
…#1617) ### What is the issue Node crashes during node replacements result in hibernated nodes that cannot join the cluster anymore due to a lack of SYN messages from seeds. ### What does this PR fix and why was it fixed Port DB-1482, which allows the use a jmx endpoint on a seed to bring the hibernated node back to the gossiping candidate list. Tested via: datastax/cassandra-dtest#75.
Fixes: https://github.com/riptano/cndb/issues/13196 ### What is the issue We were incorrectly sharing a file reader across threads in BM25 queries, which can lead to invalid results, as I reproduced in the test. ### What does this PR fix and why was it fixed The PR fixes the issue by creating a `DocLengthsReader` object per query.
### What is the issue Fixes https://github.com/riptano/cndb/issues/13203 The underlying problem is that the `OutboundConnection` needs to store the message version from the peer. Otherwise, when the local `current_version` is higher than the remove `current_version`, we store the wrong value (we were storing the local value, not the agreed upon value). The PR also fixes an issue in the ANN_OPTIONS serialization logic that we can hit in rare cases where the peers haven't yet connected, so the validation logic that would prevent serialization due to a peer having too low of a message version is skipped. ### What does this PR fix and why was it fixed This PR fixes several issues related to upgrades by improving handshake logic. --------- Co-authored-by: Sergio Bossa <sergio.bossa@gmail.com>
### What is the issue CNDB-12599 ### What does this PR fix and why was it fixed Exposes the space requirements of compaction picks and compaction tasks for CNDB's compaction space management.
Fixes riptano/cndb#12683 Table names are used in file names. Since the table names were not validated on its length, creating or flushing a table with too long name fails on too long file name. There are two cases with using table names in file names: - keyspace_name .table_name-controller-config.JSON - table_name-32chars_table_id The maximum allowed file name size is 255 chars. Thus, - keyspace and table names, which are together longer than 231 chars, will fail. - a table name, which is longer than 222 chars, will fail. This PR checks that creating new table name should not have too long table name. New tables are identified through the query's client state, which should not be internal. The limit is 222 chars for combined keyspace and table names. This is more restrictive than necessary, but is easier to explain in the documentation. The change is tested in CNDB that creating new tables with long names are prevented, but existing tables still work.
…o datastax/cassandra-ccm
…essaging version (apache#1624) This makes sure the proper version is used, as following CNDB-13203, each connection maintains its own version, separated from the one recorded in the MessagingService. ### What is the issue `ANNOptions#validate()` was previously checking endpoint versions via `MessagingService#versions`: but following #13203, in order to resolve a race condition, each endpoint connection tracks its own `EndpointMessagingVersions` object, with the version negotiated during handshake, while `MessagingService#versions` is updated with the max version of each endpoint. It follows checking `MessagingService#versions` might not rely on the right version used for the connection. ### What does this PR fix and why was it fixed This PR introduces a new `MessagingService` method to check the connection version, and uses it inside `ANNOptions#validate()`.
Trying to create an SAI or other index on a column, which name is either long or has non-alphanumeric characters, fails, while the table was successfully created with such name. This happens if an index name is not provided. The reason for this limitation is that the column name is used for generating the default index name, which is used in the file names. Fixes riptano/cndb#12503, fixes riptano/cndb#12609 This PR removes the check for column names during an index creation. Thus it allows to create indexes on existing columns with any names including qualified names and long names. 1. The check was preventing qualified column names with characters, which cannot be used in file names. This was never an issue as the index name construction from table and column names already removes any non-alphanumeric or non-underscore characters. 2. The check was preventing long column names. This limitation is removed by - Adding calculation of maximum amount characters, which can be used in index name, so it will allow to construct file names without exceeding the limit of 255 chars (255 is added as a constant). - Then the index name is truncated. 3. The uniqueness of index names is already fixed by existing implementation, which adds counter prefix when needed. This change to generated index name does not affect existing indexes as the existing name from `IndexMetadata` is always serialized to JSON, thus avoiding to re-generating the name. This fix required to fix CQL tester, since it was assuming that the restriction is in place. So column name regex pattern matcher is fixed in `createIndex` to allow non-restricted column names. There might still be possible corner cases on some column names used for generating index names, which will not work with `createIndex`. In such case `execute` should be used. It also includes several fixes in affected files: - Improve the name and usage of constants. - Fix warnings with minor improvements in affected files.
The goal of this commit is to allow low-level code, typically at the level of file reads (say, a custom `FileChannel` implementation), to obtain context on which higher level operation they correspond to. A typical use case may for tiered storage where `FileChannel` implementations may forward reads to remote storage, and where having context on the overall operation could help, say, take prior work done to set timeout, or aggregate metrics per "user-level" operations (rather than per `FileChannel` operation). To do this, we reuse the existing `ExecutorLocals` mechanism, and simply have the top-level operation setup the proper context as an `ExecutorLocal`, allowing it to be accessed by any low-level operations operating on behalf of that operation. This is, in many way, similar to tracing, but instead of a `TraceState` that collect what happens during the operation, it is a relatively flexible notion of `OperationContext`. As of this patch, this feature is fairly barebone and mostly exists for extensions in the following sense: 1. only user reads (`ReadCommand` execution) currently sets up such a context. The code is written in such a way that adding support for other operations should be easy, but this is not done. 2. the context set by reads by default has barely any information: it merely has a `toString()` method that can roughly tell what the operation itself is, and so could have use for debugging, but not much more. Further, that context is not read by anything in this patch. However, said context are created through a "factory" and the factory class is configurable through a system property. So extension can override the factory in order to create contexts with more information/state, and they fetch/use those context where they see fit.
patch by Berenguer Blasi; reviewed by Andres de la Peña for CASSANDRA-17333
…as causing test flakiness.
### What is the issue When the node gets replaced, it announces itself as hibernated (one of the silent-shutdown states). When the node replacement fails, as the other nodes see the replacing node in that state then the replacing node won't receive any gossip messages from the seed at subsequent startup - which ends up with an exception. ### What does this PR fix and why was it fixed This patch adds an explicit shutdown announcement via gossip to let other nodes know that the node was explicitly shut down - as it was due to the exception. That allows other nodes (seeds in particular) to contact the replacing node at its next startup, thus allowing it to retry the replacement.
Fix Cursor.skipToWhenAhead for reverse iteration Add Cursor.dumpBranch for debugging Fix various methods to return Preencoded byte-comparables Fix deletion-aware collection merge cursor's reporting of deletion branch at tail points Fix deletion-aware collection merge cursor's failure on one deletion branch
… relevant deletions
in a combined `encodedState` returned by advancing methods. This saves megamorphic calls to `incomingTransition` and can be augmented by further information at no cost.
This functionality has two main applications: - it allows reverse walks that present prefix content in the correct byte-comparable order (i.e. prefixes after children) - it makes it possible to have full control over what is and isn't included in a trie ranges (e.g. making it possible to have a branch set and nested ranges)
…and TrieMemtable to Stage3 version Remove duplicate configuration object and add tests for stage 3
This change extends the coverage of the memtable trie to the cell level, defining mappings of trie branches to and from the legacy concepts of complex columns and rows.
This makes it possible to have completely off-heap trie memtable, where cell data is stored inside the trie structure if it is small enough to fit, or placed in natively-allocated memory and referenced by memory address.
… by pre-computing branch cursors before the hook fires
|
Well, I did not intend to enable these methods to be called through The force copy predicate is supposed to be a very quick check that should not run anything as complex as this, and also something we stop calling as soon as it returns true; in some cases we start force-copying even without consulting it. It is not a suitable or reliable vehicle for whatever may want to use the tails. Can we pause this one until I add the proper documentation for these methods and decide if we shouldn't have users specify the upserters and predicate in a subclass of the mutators so that they have access to these methods? This should happen by the end of next week. |
6cb2af2 to
952e794
Compare
cfce0f0 to
7901ada
Compare
What is the issue
getExistingDeletionTailTrie()andgetMutationDeletionTailTrie()read state that is only initialised after theneedsForcedCopyhook fires, making both methods unusable from within the predicate — the former crashes withArrayIndexOutOfBoundsException, the latter silently returns null.What does this PR fix and why was it fixed
Moves the computation of
existingAlternateBranchandincomingAlternateBranchto before the hook and caches them in two new fields onMutator, so both getters return correct results when called fromneedsForcedCopy. The mutation logic and performance are unaffected — the two lines were already executing on every iteration, just after the hook. Tests covering all four tail trie getters are added inMutatorTailTrieTest.