Skip to content

CNDB-15669: Add Mutator.getXXXTailTrie tests#5

Open
lesnik2u wants to merge 1940 commits into
blambov:CNDB-15669from
lesnik2u:CNDB-15669-TestTailTrie
Open

CNDB-15669: Add Mutator.getXXXTailTrie tests#5
lesnik2u wants to merge 1940 commits into
blambov:CNDB-15669from
lesnik2u:CNDB-15669-TestTailTrie

Conversation

@lesnik2u

Copy link
Copy Markdown

What is the issue

getExistingDeletionTailTrie() and getMutationDeletionTailTrie() read state that is only initialised after the needsForcedCopy hook fires, making both methods unusable from within the predicate — the former crashes with ArrayIndexOutOfBoundsException, the latter silently returns null.

What does this PR fix and why was it fixed

Moves the computation of existingAlternateBranch and incomingAlternateBranch to before the hook and caches them in two new fields on Mutator, so both getters return correct results when called from needsForcedCopy. The mutation logic and performance are unaffected — the two lines were already executing on every iteration, just after the hook. Tests covering all four tail trie getters are added in MutatorTailTrieTest.

tiagomlalves and others added 30 commits February 25, 2025 15:55
Use explicit class names when instantiating them via ParameterizedClass
instead of using Strings with the name.
…Graph (apache#1586)

CNDB-12980

The postingsMap in CompactionGraph reuses the VectorFloat object when
reading from storage. We need to copy this vector before storing it in
another data structure if we want it to hold the right information.

---------

Co-authored-by: Joel Knighton <joel.knighton@datastax.com>
This patch introduces two changes:
- it adds a reading group to guard against sweeping the memtable which
the metric
is going to potentially iterate through (preventing crashes).
- changes the metric calculation by using an estimate (used already by
SAI query planner) instead of iterating
through the whole memtable (which is quite a heavy operation).
)

For compressed sstables, fix the partition ending position calculation to prevent
going out of bounds when the partition end falls on a chunk boundary.
### What is the issue
https://github.com/riptano/cndb/issues/11725

# New functionality

ORDER BY BM25, enabled by SAI Version.EC

# Enhanced functionality

ORDER BY ANN can also used the synthetic score column so that coordinator does not need to recompute similarity for every row returned by the different replicas.  Controlled by SelectStatement.ANN_USE_SYNTHETIC_SCORE (default false)
…Internal in CQL utests (apache#1564)

Add a new `CQLTester#executeWithCoordinator` method that makes the query go 
through `ReadQuery#execute` rather than `ReadQuery#executeInternal`, so it 
goes through coordinator validation and reconciliation. 

Add a new `CQLTester#executeInternal` method that runs the query with 
`ReadQuery#executeInternal`, same as `CQLTester#execute` used to do.

Make `CQLTester#execute` internally call `CQLTester#executeInternal` by default
to preserve the beheviour of most utests.

Add a new `CQLTester#enableCoordinatorExecution` method that makes 
`CQLTester#execute` use `CQLTester#executeWithCoordinator` instead. 

Make `SAITester` initialization call `CQLTester#enableCoordinatorExecution` so
SAI tests extending it use coordinator validation and reconciliation. The latter
is particularly useful to make sure that SAI behaves the same as `RowFilter`
so SAI results are not wrongly filtered out by the coordinator. Some SAI tests 
have been excluded from this behaviour because the tested cases don't match 
this requirement, and follow-up issues have been created to fix this.
What was ported:
- current compaction throughput measurement by CompactionManager
- exposing current compaction throughput in StorageService
  and CompactionMetrics
- nodetool getcompactionthroughput, including tests

Not ported:
- changes to `nodetool compactionstats`, because that would
  require porting also the tests which are currently missing in CC
  and porting those tests turned out to be a complex task without
  porting the other changes in the CompactionManager API
- Code for getting / setting compaction throughput as double
This commit introduces a new AdaptiveCompressor class.

AdaptiveCompressor uses ZStandard compression with a dynamic
compression level based on the current write load. AdaptiveCompressor's
goal is to provide similar write performance as LZ4Compressor
for write heavy workloads, but a significantly better compression ratio
for databases with a moderate amount of writes or on systems
with a lot of spare CPU power.

If the memtable flush queue builds up, and it turns out the compression
is a significant bottleneck, then the compression level used for
flushing is decreased to gain speed. Similarly, when pending
compaction tasks build up, then the compression level used
for compaction is decreased.

In order to enable adaptive compression:
  - set `-Dcassandra.default_sstable_compression=adaptive` JVM option
    to automatically select `AdaptiveCompressor` as the main compressor
    for flushes and new tables, if not overriden by specific options in
    cassandra.yaml or table schema
  - set `flush_compression: adaptive` in cassandra.yaml to enable it
    for flushing
  - set `AdaptiveCompressor` in Table options to enable it
    for compaction

Caution: this feature is not turned on by default because it
may impact read speed negatively in some rare cases.

Fixes riptano/cndb#11532
### What is the issue
The `NON_DAEMON` IndexRegistry returns `true` for all calls to
`supportsExpression`. This leads to an incorrect result from the
`getEqBehavior` method where we get `MATCH` instead of `EQ` because the
index indicates that it can handle `ANALYZER_MATCHES` expressions and
`EQ` expressions.

It is an odd edge case because the javadoc for the `NON_DAEMON` object
is:

```java
    /**
     * An {@code IndexRegistry} intended for use when Cassandra is initialized in client or tool mode.
     * Contains a single stub {@code Index} which possesses no actual indexing or searching capabilities
     * but enables query validation and preparation to succeed. Useful for tools which need to prepare
     * CQL statements without instantiating the whole ColumnFamilyStore infrastructure.
     */
```

This presents a problem for the eq/match logic where we want to find a
nuanced solution to the different solutions. My proposal here is to just
make it use the EQ behavior, but that might have adverse side effects in
untested code.

### What does this PR fix and why was it fixed
The original fix used in apache#1434 was just to avoid the NPE we hit, but it
allowed for the wrong result in eq behavior. This fix is to say that we
should just return `EQ` in that case. It's possible this fix has
negative consequences that we haven't seen yet, but at the very least,
the CNDB tests pass with it.
The Ford Fulkerson optimization may take too long in some configs

### What is the issue
Some configs make the FF computation take too long

### What does this PR fix and why was it fixed
This PR adds a feature flag so you can workaround it
This patch updates Netty to address CVE-2025-24970.
It also adds the Netty native epoll dependency
that was missing when Netty was upgraded from
4.1.58 to 4.1.117. Since 4.1.58, the native
libraries are now separated out into new dependencies
and must be explicitly added for them to be bundled into the tarball.

### What is the issue
When Netty was upgraded to 4.1.117.Final, the native epoll libraries
that used to be shipped in the netty-all.jar were not included. At some
point since Netty 4.1.58, the native epoll and native kqueue libraries
were split inot their own architecture dependent jarfiles/artifacts and
must be included explicitly to pick them up.

### What does this PR fix and why was it fixed
This patch adds the missing native epol dependencies. This is similar to
what was done in OSS Cassandra when they upgraded to Netty 4.1.96 here:

apache@53d1644
### What is the issue
While working on CNDB-12382 I have seen that increasing the number of
stripes helps to solve the timeout problems.

### What does this PR fix and why was it fixed
This PR introduces an abstraction over the LOCKS mechanism in
CounterMutation and provides a different implementation.

The new implementation is not based on striping the locks (that is about
sharing the same lock for different keys) but it maintains a set of
reference counted objects that wrap the actual locks.


This implementation also makes it clearer that the key for the locks is
an Integer, computed by summing some hash codes, and this already
creates some kind of bucketing.
It is still possible that two mutations fall into the same lock key.

When do the new implementation works better ?
- it should generally works better for most of the workloads
- when there are few active counter mutations (the number of locks is
usually expected to be like the size of the Counters Mutation Stage
thread pool)

Rollback:
Add -Dcassandra.use_striped_counter_lock_manager=true to the writers JVM
options


Licensing: this code is copied and adapted from [LocalLockManager](https://github.com/diennea/herddb/blob/master/herddb-utils/src/main/java/herddb/utils/LocalLockManager.java)
from the HerdDB project (Apache 2 licensed).
…che#1605)

### What is the issue

The functions validating schema object names are misleading and placed
in different files. This led to incorrect use of them and failures on creating
tables with long names.

### What does this PR fix and why was it fixed

This PR reorganizes the name validation functions, so they are not
confused and misused. The error messages are fixed accordantly.

This PR also makes minor change to a long table name test to use the
keyspace name reported in an incident. New test is added on
non-alpha-numeric table name to improve the test coverage.
### What is the issue
Fixes https://github.com/riptano/cndb/issues/12922

### What does this PR fix and why was it fixed

This PR follows up on apache#1525 by integrating the CQL configured `rerank_k`
in to SAI so that the ANN query can take the user input and let it
influence query execution.

Key changes:
* Bump `MessagingService` to `VERSION_DS_11`
* Fixes implementation of
`MessagingService.instance().endpointsWithVersionBelow` by using the
`keyspace` variant, which is necessary for CNDB.
* When configured, the `rerank_k` is used as the graph's `rerankK`
parameter. The only detail worth mentioning here is that we ignore the
segment proportionality computation if `rerank_k` is provided. This
diverges from the smart default design, but not too greatly.
* Added a guardrail for `rerank_k` to prevent it from exceeding 4 times
the `cassandra.sai.vector_search.max_top_k`. This is debatable and
easily changed, so please let me know if we want something different.
* Made the in memory brute force cost comparison logic use the `rerankK`
value instead of the `limit`. I am pretty sure this is correct, but
please review closely.

Instead of validating `rerank_k`'s value within the query, I added a
test that ensures that as we increase the rerank k, we increase the
recall.

CNDB pr: https://github.com/riptano/cndb/pull/13095
…pache#1618)

CNDB-11460 another fix for cqlsh unicode tests in TestCqlshUnicode -
make a second read in read_until to get UTF-8 characters that are split
across reads.
…#1617)

### What is the issue
Node crashes during node replacements result in hibernated nodes that
cannot join the cluster anymore due to a lack of SYN messages from seeds.

### What does this PR fix and why was it fixed
Port DB-1482, which allows the use a jmx endpoint on a seed to bring the
hibernated node back to the gossiping candidate list.

Tested via: datastax/cassandra-dtest#75.
Fixes: https://github.com/riptano/cndb/issues/13196

### What is the issue
We were incorrectly sharing a file reader across threads in BM25
queries, which can lead to invalid results, as I reproduced in the test.

### What does this PR fix and why was it fixed
The PR fixes the issue by creating a `DocLengthsReader` object per
query.
### What is the issue
Fixes https://github.com/riptano/cndb/issues/13203

The underlying problem is that the `OutboundConnection` needs to store
the message version from the peer. Otherwise, when the local
`current_version` is higher than the remove `current_version`, we store
the wrong value (we were storing the local value, not the agreed upon
value).

The PR also fixes an issue in the ANN_OPTIONS serialization logic that
we can hit in rare cases where the peers haven't yet connected, so the
validation logic that would prevent serialization due to a peer having
too low of a message version is skipped.

### What does this PR fix and why was it fixed
This PR fixes several issues related to upgrades by improving handshake
logic.

---------

Co-authored-by: Sergio Bossa <sergio.bossa@gmail.com>
### What is the issue
CNDB-12599

### What does this PR fix and why was it fixed
Exposes the space requirements of compaction picks and compaction tasks
for CNDB's compaction space management.
Fixes riptano/cndb#12683

Table names are used in file names. Since the table names were not
validated on its length, creating or flushing a table with too long name
fails on too long file name.
There are two cases with using table names in file names:
- keyspace_name .table_name-controller-config.JSON
- table_name-32chars_table_id 

The maximum allowed file name size is 255 chars. Thus, 
- keyspace and table names, which are together longer than 231 chars,
will fail.
- a table name, which is longer than 222 chars, will fail.

This PR checks that creating new table name should not have too long
table name.
New tables are identified through the query's client state, which should
not be internal.
The limit is 222 chars for combined keyspace and table names. This is
more restrictive than necessary, but is easier to explain in the
documentation.

The change is tested in CNDB that creating new tables with long names
are prevented, but existing tables still work.
…essaging version (apache#1624)

This makes sure the proper version is used, as
following CNDB-13203, each connection maintains its own version,
separated from the one recorded in the MessagingService.

### What is the issue
`ANNOptions#validate()` was previously checking endpoint versions via
`MessagingService#versions`: but following #13203, in order to resolve a
race condition, each endpoint connection tracks its own
`EndpointMessagingVersions` object, with the version negotiated during
handshake, while `MessagingService#versions` is updated with the max
version of each endpoint. It follows checking
`MessagingService#versions` might not rely on the right version used for
the connection.

### What does this PR fix and why was it fixed
This PR introduces a new `MessagingService` method to check the
connection version, and uses it inside `ANNOptions#validate()`.
Trying to create an SAI or other index on a column, which name is either
long or has
non-alphanumeric characters, fails, while the table was successfully
created with such name. This happens if an index name is not provided.
The reason for this limitation is that the column name is used for
generating the default index name, which is used in the file names.

Fixes riptano/cndb#12503, fixes
riptano/cndb#12609

This PR removes the check for column names during an index creation.
Thus it allows to create indexes on existing columns with any names
including qualified names and long names.

1. The check was preventing qualified column names with characters,
which cannot be used in file names. This was never an issue as the index
name construction from table and column names already removes any
non-alphanumeric or non-underscore characters.
2. The check was preventing long column names. This limitation is
removed by
  - Adding calculation of maximum amount characters, which can be used in
index name, so it will allow to construct file names without exceeding
the limit of 255 chars (255 is added as a constant).
  - Then the index name is truncated. 
3. The uniqueness of index names is already fixed by existing
implementation, which adds counter prefix when needed.

This change to generated index name does not affect existing indexes as
the existing name from `IndexMetadata` is always serialized to JSON,
thus avoiding to re-generating the name.

This fix required to fix CQL tester, since it was assuming that the
restriction is in place. So column name regex pattern matcher is fixed
in `createIndex` to allow non-restricted column names. There might still
be possible corner cases on some column names used for generating index
names, which will not work with `createIndex`. In such case `execute`
should be used.

It also includes several fixes in affected files:
- Improve the name and usage of constants.
- Fix warnings with minor improvements in affected files.
The goal of this commit is to allow low-level code, typically at the
level of file reads (say, a custom `FileChannel` implementation), to
obtain context on which higher level operation they correspond to. A
typical use case may for tiered storage where `FileChannel`
implementations may forward reads to remote storage, and where having
context on the overall operation could help, say, take prior work done
to set timeout, or aggregate metrics per "user-level" operations (rather
than per `FileChannel` operation).

To do this, we reuse the existing `ExecutorLocals` mechanism, and simply
have the top-level operation setup the proper context as an
`ExecutorLocal`, allowing it to be accessed by any low-level operations
operating on behalf of that operation. This is, in many way, similar to
tracing, but instead of a `TraceState` that collect what happens during
the operation, it is a relatively flexible notion of `OperationContext`.

As of this patch, this feature is fairly barebone and mostly exists for
extensions in the following sense:
1. only user reads (`ReadCommand` execution) currently sets up such a
   context. The code is written in such a way that adding support for
   other operations should be easy, but this is not done.
2. the context set by reads by default has barely any information: it
   merely has a `toString()` method that can roughly tell what the
   operation itself is, and so could have use for debugging, but not
   much more. Further, that context is not read by anything in this
   patch. However, said context are created through a "factory" and
   the factory class is configurable through a system property. So
   extension can override the factory in order to create contexts
   with more information/state, and they fetch/use those context where
   they see fit.
patch by Berenguer Blasi; reviewed by Andres de la Peña for CASSANDRA-17333
### What is the issue
When the node gets replaced, it announces itself as hibernated (one of
the silent-shutdown states). When the node replacement fails, as the
other nodes see the replacing node in that state then the replacing node
won't receive any gossip messages from the seed at subsequent startup -
which ends up with an exception.

### What does this PR fix and why was it fixed
This patch adds an explicit shutdown announcement via gossip to let
other nodes know that the node was explicitly shut down - as it was due
to the exception. That allows other nodes (seeds in particular) to
contact the replacing node at its next startup, thus allowing it to retry
the replacement.
blambov and others added 20 commits November 6, 2025 13:33
Fix Cursor.skipToWhenAhead for reverse iteration
Add Cursor.dumpBranch for debugging
Fix various methods to return Preencoded byte-comparables
Fix deletion-aware collection merge cursor's reporting of deletion branch at tail points
Fix deletion-aware collection merge cursor's failure on one deletion branch
in a combined `encodedState` returned by advancing methods.
This saves megamorphic calls to `incomingTransition` and can
be augmented by further information at no cost.
This functionality has two main applications:
- it allows reverse walks that present prefix content in the correct
  byte-comparable order (i.e. prefixes after children)
- it makes it possible to have full control over what is and isn't
  included in a trie ranges (e.g. making it possible to have a branch
  set and nested ranges)
…and TrieMemtable to Stage3 version

Remove duplicate configuration object and add tests for stage 3
This change extends the coverage of the memtable trie to the
cell level, defining mappings of trie branches to and from the
legacy concepts of complex columns and rows.
This makes it possible to have completely off-heap trie memtable,
where cell data is stored inside the trie structure if it is small
enough to fit, or placed in natively-allocated memory and referenced
by memory address.
… by pre-computing branch cursors before the hook fires
@blambov

blambov commented Mar 18, 2026

Copy link
Copy Markdown
Owner

Well, I did not intend to enable these methods to be called through NodeFeatures, but rather be something that a mutation can do as part of the operation of an upserter, as done by TriePartitionUpdaterLegacyIndex e.g. here.

The force copy predicate is supposed to be a very quick check that should not run anything as complex as this, and also something we stop calling as soon as it returns true; in some cases we start force-copying even without consulting it. It is not a suitable or reliable vehicle for whatever may want to use the tails.

Can we pause this one until I add the proper documentation for these methods and decide if we shouldn't have users specify the upserters and predicate in a subclass of the mutators so that they have access to these methods? This should happen by the end of next week.

@blambov blambov force-pushed the CNDB-15669 branch 6 times, most recently from 6cb2af2 to 952e794 Compare March 27, 2026 14:24
@blambov blambov force-pushed the CNDB-15669 branch 2 times, most recently from cfce0f0 to 7901ada Compare April 6, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.