Skip to content

Add regexp raw pattern API#1061

Open
janhartman wants to merge 1 commit into
mainfrom
jan/regexp-string-api
Open

Add regexp raw pattern API#1061
janhartman wants to merge 1 commit into
mainfrom
jan/regexp-string-api

Conversation

@janhartman
Copy link
Copy Markdown

@janhartman janhartman commented May 14, 2026

Some downstream callers need to marshal or recompile query regexps and were reaching into q.Regexp.String(), which is the underlying syntax tree debug string rather than an official query API.

This adds query.Regexp.RegexpString() as the public API for getting the marshaled raw regexp pattern. Existing query.Regexp string and gob serialization paths now route through that method.

query.Regexp.String() remains query/debug formatting and still includes wrappers like regex:"...".

Copy link
Copy Markdown
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure about this, so I haven't deeply reviewed the regexp.go code change. As such I am requesting changes. If you are keen on this direction then re request review and I will review properly.

internal/syntaxutil is a part of the stdlib's regexp/syntax code with a commit reverted that gave us a performance problem. See README.md from internal/syntaxutil

We shouldn't just be modifying it. If we do we need to use a pattern which makes it clear how we changed it. EG put the new functions in a different file + include comments in the original file indicating changes. Finally you also need to update the README.

Why do you need to make this change out of interest. In the original thread about this issue I noted we had OOMs in Sourcegraph due to not using this function. Did you find even after using this function we still had OOMs and needed to "unsimplify".

How did ya test this. Did you have a reproduction before and after? In particular I wonder if our issue is infact that Simplify does a simplification we don't like / we don't think makes sense. Is the issue then not in simplify? EG we have other bits of code which take syntax.Regexp and construct matchtrees. I worry those bad patterns affect that code path as well.

Comment thread internal/syntaxutil/regexp_test.go Outdated
}
}

func TestRegexpStringCompactOptionalRepeatNonGreedy(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test and the above are so similiar it makes it hard to tell the difference. Can you maybe add a table test which tests what happens after simplify? Or maybe in the below table test you have want and wantSimplify?

Comment thread query/query.go
}

// RegexpString returns the marshaled raw regexp pattern for q.
func (q *Regexp) RegexpString() string {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense no matter what. We want a way to get at this. I do think in Sourcegraph though we want to be able to directly use syntaxutil.RegexpString. We had a need to marshal the string for observability, which is outside of the need of marshalling a query.Regexp

Comment thread query/regexp_test.go Outdated
}{
{
name: "simple literal",
q: &Regexp{Regexp: mustParseRE(`abc`)},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. your input table should just be the regex string. The extra noise of constructing regexp here makes the case harder to read. Also the subtest name seems not that useful over the actual input. Just include the input in the t.Fatal

Comment thread query/regexp_test.go Outdated
if got := tt.q.RegexpString(); got != tt.want {
t.Fatalf("RegexpString() = %q, want %q", got, tt.want)
}
if got, want := tt.q.String(), `regex:"`+tt.want+`"`; got != want {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low value assertation, just focus on regexpstring

Comment thread query/regexp_test.go Outdated
}
}

func TestRegexpStringIsRawPattern(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low value test

Comment thread query/regexp_test.go Outdated
}
}

func TestRegexpStringBoundedRepeatCompiles(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so this test better demonstrates the fix you are adding. Are you sure the better fix isn't "undoing" the optimize step which creates the large pattern? Or does that have potential perf impacts I suppose and we should just focus on marshalling?

@janhartman janhartman force-pushed the jan/regexp-string-api branch from eac2b70 to 37db178 Compare May 14, 2026 13:47
@janhartman
Copy link
Copy Markdown
Author

@keegancsmith Thanks for the review. I scaled this back to the minimal Zoekt change.

@janhartman janhartman requested a review from keegancsmith May 14, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants