Smartly Apply Constraints During Cartesian Product#773
Smartly Apply Constraints During Cartesian Product#773
Conversation
6a33c52 to
503fef9
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes discrete search space construction by applying discrete constraints incrementally during Cartesian product generation (including improved Polars/Pandas interop), aiming to reduce intermediate memory use and runtime for highly constrained spaces.
Changes:
- Added
baybe.searchspace.utilswith shared Cartesian product helpers and a new incremental constrained-product builder. - Extended discrete constraint interfaces to support (or explicitly refuse) early filtering via
UnsupportedEarlyFilteringError, plus ahas_polars_implementationcapability flag. - Updated discrete search space constructors and tests to use the new incremental filtering path (and added parity tests vs the naive approach).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
baybe/searchspace/utils.py |
New utilities: parameter ordering, pandas/polars cartesian product, and incremental constrained cartesian product builder. |
baybe/searchspace/discrete.py |
Switches discrete space construction to incremental filtering; Polars path builds partial product and merges remainder via pandas. Adds new from_simplex validation. |
baybe/constraints/base.py |
Adds _required_filtering_parameters and has_polars_implementation; updates docs for partial-dataframe filtering semantics. |
baybe/constraints/discrete.py |
Updates discrete constraints to support early/partial filtering and to raise UnsupportedEarlyFilteringError when unsupported. |
baybe/exceptions.py |
Adds UnsupportedEarlyFilteringError. |
tests/constraints/test_constrained_cartesian_product.py |
New test ensuring naive vs incremental constrained product results match across several scenarios. |
tests/constraints/test_constraints_polars.py |
Updates imports for moved cartesian product helpers. |
tests/test_searchspace.py |
Updates imports for moved cartesian product helpers. |
tests/hypothesis_strategies/alternative_creation/test_searchspace.py |
Adjusts simplex-related tests to reflect new from_simplex constraints. |
CHANGELOG.md |
Documents incremental filtering and new constraint capability/exception additions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AdrianSosic
left a comment
There was a problem hiding this comment.
Hi @Scienfitz, I'll need some more time for the review but wanted to already share some comments so that you can start to think about it / we can discuss. More will follow 🙃
|
|
||
| @override | ||
| def get_invalid(self, data: pd.DataFrame) -> pd.Index: | ||
| if not set(self.parameters) <= set(data.columns): |
There was a problem hiding this comment.
Two things:
- Wasn't the
_required_filtering_parametersproperty exactly intended for this purpose, i.e. don't have to hardcode theset(self.parameters)in each constraint but to dynamically read it from the property? - I thought the reason to add the property in the first place was that we can automate this check, but now you've manually implemented it in each constraint!? I would have expected a
template-design approach, where there the check is implemented in the base class and the subclasses only implement a_get_invalidor similar.
There was a problem hiding this comment.
it depends
I actually only wanted to use the helper in the two cosntraints tha are more complex, ie where self.parameters is NOT the full set of affected parameters
But I can understand your request and its def stragne to sue it int he colum checks for some and some not. But then consider the many otehr applciations of self.parameters that I wold have to chagne for consistency as well. here an exmaple:

should I apply it here on the top and in evaluate_data as well? That would be way more invasive than I intended
A variant to clean this a bit could be: Do not make this would revert the utility the _required_parameters a base class property but only give it to the two problematic constraints. Would enforce. a bit better that I am only suing this as workaround int he two problematic constraints. Opinion?_required_parameters property has in things like the new utils and things outside of the constraint class itself
There was a problem hiding this comment.
@AdrianSosic with the changes to the control flow the template worfklow now made much more sense and I added it there
There was a problem hiding this comment.
RE the should I apply it everywhere: I think to be consistent, you need to apply it in all places where self.parameters carries the same semantic meaning. You certainly know: there are cases where a quantity just happens to carry the same value but is semantically different! In these cases, self.parameters needs to stay, but in all others, I suggest to change it to self._required_parameters. You would indeed have to check case by case 😬
There was a problem hiding this comment.
so what do you want further changed?
There was a problem hiding this comment.
Exactly what I wrote: whenever there is a self.parameter that is semantically equivalent to self._required_parameters --> replace the latter with the former to make this semantic coupling explicit!
There was a problem hiding this comment.
hmm only ones I could find was in valdation where it was indeed useful
this required lifting the property to Constraint which should be fine
3360cd5
78eb87b to
66c39c4
Compare
AVHopp
left a comment
There was a problem hiding this comment.
Incomplete review. Only had a look at the changes made to the constraints so far. Tried to comprehend the constraints, and the logic for them seems to check out. Will give a more in-depth review after some of the general issues pointed out by the others have been addressed.
| - Polars path in discrete search space construction now builds the Cartesian product | ||
| only for parameters involved in Polars-capable constraints, merging the rest | ||
| incrementally via pandas |
There was a problem hiding this comment.
| - Polars path in discrete search space construction now builds the Cartesian product | |
| only for parameters involved in Polars-capable constraints, merging the rest | |
| incrementally via pandas | |
| - `Polars` path in discrete search space construction now builds the Cartesian product | |
| only for parameters involved in `Polars`-capable constraints, merging the rest | |
| incrementally via `pandas` |
There was a problem hiding this comment.
that would be quite inconsistent with the existing CHANGELOG fyi
| cols = set(data.columns) | ||
| pairs = [(p, c) for p, c in zip(self.parameters, self.conditions) if p in cols] | ||
| if not pairs: | ||
| raise UnsupportedEarlyFilteringError( |
There was a problem hiding this comment.
The docstring clearly states what this function should do: Get the indices of dataframe entries that are invalid under the constraint.. No indices are invalid, so it should return an empty index and not an error in my opinion.
Co-authored-by: AdrianSosic <adrian.sosic@merckgroup.com>
66c39c4 to
2c634b5
Compare
| if not simplex_parameters: | ||
| return cls(parameters=product_parameters, exp_rep=product_space) | ||
| # Validate minimum number of simplex parameters | ||
| if len(simplex_parameters) < 2: |
There was a problem hiding this comment.
Can you elaborate why you changed this? Of course, we both agree that calling from_simplex with just one parameter is sort of meaningless but:
- it's not "wrong", i.e. it still works. So should we really forbid it?
- If we really do want to forbid (which I'm not sure, perhaps a warning is more appropriate), I think the message needs to change. You are just saying
what is not allowed, but it's missingwhat you should do instead.
There was a problem hiding this comment.
- despite it "working" and not causing any problems I think it is most likely always a mistake if this is called with just 1 simplex parameter (can you name me a usecase?) So its simply safe to forbid it
- The concepts of cardinality and sum constraints (which this mthod implements in a special way) do not make sense if there is just 1 parameter, agree? So its safe to forbid.
Yes I can change the message
There was a problem hiding this comment.
Agree that there is no original use case, but one may argue from a different angle:
- Errors are intended for
not compatible / something is wrong - Warnings are intended for
makes no particular sense / is unusual, but logic is still valid
The use case I could possibly see is when stuff is set up programmatically: You define your specs e.g. via list comprehensions and vary (for whatever reason) the number of numerical parameters that go into the simplex rule. If we go with error, it means that the user would need to change the code logic for the edge case of 1 parameter, while with a warning, that case would simply run through as well.
But if you have a strong preference for error, I'm also fine with it.
There was a problem hiding this comment.
ok I turned it into warnings for len < 2
in the case of 0 the rest is piped through to from_product while for the case of 1 the logic of the function is called as is
5e5d731
| # Remove entries that violate parameter constraints: | ||
| _apply_constraint_filter_pandas(exp_rep, constraints) | ||
| # Merge product parameters and apply constraints incrementally | ||
| exp_rep = parameter_cartesian_prod_pandas_constrained( |
There was a problem hiding this comment.
Not caused by your change, but general question: why are we using hard-coded pandas here? Could it be that we simple forgot to add the polars path?
There was a problem hiding this comment.
yes there was never a polars path here imo
| polars_param_names: set[str] = set() | ||
| for c in polars_constraints: | ||
| polars_param_names.update(c._required_parameters) | ||
| polars_params = [p for p in parameters if p.name in polars_param_names] |
There was a problem hiding this comment.
Perhaps a stupid question, but what happens downstream from here when the set of parameters is only a subset of the polars parameters? For example, I could have a polars constraint on ["A", "B", "C"] while my parameters contain only ["A", "B"]. Then polars_params will be "incorrectly" filtered down to just ["A", "B"]
| ) | ||
| initial_df = lazy_df.collect().to_pandas() | ||
| # Apply Polars constraints that failed back via pandas | ||
| _apply_constraint_filter_pandas( |
There was a problem hiding this comment.
Hm, could it be that this is now also just dead code because we filter to polars-compatible constraints upfront? Or what am I not seeing here?
| # (smallest expansion factor during cross-merging). | ||
| ordered: list[DiscreteParameter] = [] | ||
| available: set[str] = set() | ||
| remaining = list(constrained) |
There was a problem hiding this comment.
the remaining variable is a bit unlucky, I think:
- it has initially the same content as
unconstrainedand is mutated, but unconstrained is never used again --> no need for second variable - I think all of them should be sets, not lists, right?
| parameters: Sequence[DiscreteParameter], | ||
| constraints: Sequence[DiscreteConstraint], | ||
| ) -> list[DiscreteParameter]: | ||
| """Compute an optimal parameter ordering for incremental space construction. |
There was a problem hiding this comment.
Just out of curiosity: have you considered if a true optimal solution is feasible?
If not, I think the docstring needs to be adjusted (i.e. it's not optimal but a greedy approximation)
There was a problem hiding this comment.
not considered and not intended to speak in the mathematical sense
8f2e521
| import polars as pl | ||
|
|
||
|
|
||
| def compute_parameter_order( |
There was a problem hiding this comment.
really minor so just resolve if disagree: perhaps the name of the function could be a bit less generic, i.e. compute says very little (it's like get). How about something with optimize, to make clear what is happening?
| # Initialize the dataframe | ||
| if initial_df is not None: | ||
| df = initial_df | ||
| else: | ||
| df = pd.DataFrame() |
There was a problem hiding this comment.
How about
| # Initialize the dataframe | |
| if initial_df is not None: | |
| df = initial_df | |
| else: | |
| df = pd.DataFrame() | |
| # Initialize the dataframe | |
| df = initial_df or pd.DataFrame() |
There was a problem hiding this comment.
seems explicitly forbdden by pandas
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This PR implements a more optimized Cartesian product creation in the presence of constraints which can result in memory and time gains of many orders of magnitude (see mini benchmark below).
Rationale
from_simplexconstructor was usedAs soon as possible filter: A constraint can be applied as soon as all of its affected parameters are in the current crossjoin-df. After this application the constraint is fully ensured and does not have to be applied again. If the order in which cross join goes over the parameters is optimized this would already lead to an improvement as subsequent operations "see" much smaller left-dataframes.Partial/early filter:Look ahead: Some constraints can look ahead based on the possible parameter values that might be incoming and recognize that constraints cannot be fulfilled even in future crossjoin iterations.from_simpleximplements for the very special case of 1 global sum constraint and 1 cardinality constraint. If we ever implement look-ahead filters for all constraints thefrom_simplexconstructor might become obsoleteIMPROVEnotes to remember about tier 3. To achieve thisConstraint.get_invalidwas extended to handle situations where not all parameters are in the df to be filtered. The constraint can the decide whether it can apply early filtering or returns the newUnsupportedEarlyFilteringErrorif it needs all parameters presentparameter_cartesian_prod_pandas_constrainedwhich itself performs the process described above after deciding on a smart parameter order for the crossjoinGood To Know
has_polars_implementation, discussion here_filtering_parameters, discussion hereDiscreteNoLabelDuplicatesConstraintinDiscretePermutationInvarianceConstraint .get_invalidexplained hereMini Benchmark:
from_product, 7×8 cat, NoLabelDuplicates (2M→40K rows)from_simplex, 6-slot mixture + 3 extras (~12B→22K rows)