Add like_regex predicate to JSON path #24616
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces the
like_regex
predicate to JSON path. It aims to comply with the ISO/IEC 9075-2:2016(E) standard. The only feature defined in the standard that is not implemented here is the case-insensitive mode (thei
flag). The reason for this is described in this issue: #24615.Relevant links to the specifications referenced by the standard:
7.6.1 Regular Expression Syntax
)F Regular Expressions
)I want to highlight a specific part of the specification for reviewers to verify if my interpretation is correct.
The ISO/IEC 9075-2:2016(E) standard, in Section 9.22, specifies the following:
Points 2 and 3 specify that if the provided arguments contain characters outside the UCS repertoire, they should be transliterated to UCS. My understanding is that if the arguments are valid UTF-8 sequences, they contain only characters within the UCS repertoire. Even if the sequence includes code points that do not have characters currently assigned to them, they are still part of the repertoire, as specified in Unicode Technical Report #17:
Point 9 states that the definition of a regular expression match is implementation-defined when the character repertoire of the provided subject string is not UCS. The approach in this PR is similar to what we do for string functions: if the string is not valid UTF-8, the result is undefined.
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: