Page 1 of 1

Regular expression support builtin to Cobra

PostPosted: Tue Jan 04, 2011 1:56 am
by hopscc
I've added a patch to ticket:174 that makes cobra support Regular Expressions as builtins to the language.

Basically it adds a new type for a Regular expression Literal (regexp) supporting just a RE pattern or a RE pattern with flags
(this is expressed as a string (like a raw string) single or double delimited with an 're' prefix, patterns with flags are '/' delimited, patterns without flags may be '/' delimited but dont need to be)
re = re'\s([a-zA-Z]+)'   # simple Regexp literal
re = re'/\s([a-z]+/i)' # Regexp with flags - case insensitive


Flags supported are
i - Ignorecase = case-insensitive matching.
c - Compile = compile the RE. This yields faster execution but increases startup time
s - Singleline = Single-line mode. Change so '.' matches every char instead of '[^\n]*' (every character except \n).
m - Multiline = Multiline mode. Change so '^' and '$' match start and end of lines instead of start and end of entire string
x - ExplicitCaptures - the only valid captures are explicitly named or numbered groups of the form (?<name>…).
W - Ignores unescaped white space in the pattern and enables comments marked with '#'. (Not very useful without multiline (string) support)

plus support for 3 new binary operators
~

    - operator name is 'RE_match'
    - generally equivalent to .Net match (overloaded in enumerable for to .Net matches)
    - generates nil ( failure) or a Match ( or when overloaded a MatchCollection)

~=

    - operator name is 'RE_hasMatch'
    - tests if a RE matches a string ( .Net IsMatch).
    - generates a bool

~|

    - operator name is 'RE_splits'
    - splits a string on the RE pattern
    - Generates a List<of String>
In all three cases the operators expect the LHS operand to be a Regular Expression ( Regexp) and the RHS operand to be a string

These are supported both for typed and dynamic operands.

Theres a longer description on the ticket.

Heres an example
# Contrived example
str = '@param fare param1\n @param fare param 2\n@param fare last param'

re = re'/^\s*@param\s+(.*)$/m'
reX = re'/^no.match.evah$/m'
assert 'Regex' in re.typeOf.toString

# isMatch
if re ~= str, assert true
else, assert false, 'str match re - ismatch FAIL'
assert re ~= str
assert not reX ~= str

# Match
if re ~ str, assert true
else, assert false, 'str match re - match FAIL'
m = re ~ str
assert 'Match' in m.typeOf.toString
assert m and m.success
#print m

m = reX ~ str
assert not m
assert not reX ~ str

#Matches/MatchCollection
for m in re ~ str
assert m.groups[1].value.startsWith('fare')


# split
reSplit = re'\n?\s?@param '
#split = reSplit.split(str)
split = reSplit ~| str
assert split.count == 4
assert split[0] == ''
for i in 1 : split.count
assert split[i].startsWith('fare')


I'm not sure its any clearer than using the lib functions but it is less wordy.

Re: Regular expression support builtin to Cobra

PostPosted: Tue Jan 04, 2011 4:39 am
by Charles
Eh, I've had some thoughts on regexes for awhile, but I really didn't want to get into a new major feature right now because it detracts from bug fixes and refinements which I think we need more than regexes. Plus major features often introduce new bugs and new sorely needed refinements.

In any case, given that Cobra is keyword and method oriented, why would we use ~| instead of a method name or existing keyword operator?:

# "in" works for strings:
what = 'fox'
text = 'The quick brown fox jumps over the lazy dog.'
assert what in text

# why not for regexes?
assert someRE in text


The cryptic ~| could be done with a .split method which also opens up the possibility of an overload that takes options (max splits, etc.).

Re: Regular expression support builtin to Cobra

PostPosted: Tue Jan 04, 2011 8:24 am
by torial
Charles wrote:Eh, I've had some thoughts on regexes for awhile, but I really didn't want to get into a new major feature right now because it detracts from bug fixes and refinements which I think we need more than regexes. Plus major features often introduce new bugs and new sorely needed refinements.

In any case, given that Cobra is keyword and method oriented, why would we use ~| instead of a method name or existing keyword operator?:

# "in" works for strings:
what = 'fox'
text = 'The quick brown fox jumps over the lazy dog.'
assert what in text

# why not for regexes?
assert someRE in text


The cryptic ~| could be done with a .split method which also opens up the possibility of an overload that takes options (max splits, etc.).


Of the three operators added, as you said ~| can simply be .split, and "in" covers the ~= (ie the boolean hasMatch), but what about the ~ (ie the matches operation) ? Are you open to ~ as a new operator? If a keyword preferred, are there any that would be suitable and are reusable for such a purpose?

Re: Regular expression support builtin to Cobra

PostPosted: Wed Jan 05, 2011 3:45 am
by hopscc
Whoops - thought I;d posted this last night

Interesting... I have many questions

What bug fixes (tickets) and refinements ( enhancements) do we need (more) ?
A Specific listing, Annotation and/or augmentation of any of the tickets might be useful.

Are you saying we cant add major features cos they may cause bugs/need further refinement ?
Are 'bug fixes and refinements' somehow immune from this possibility
- Why are we not equally paralysed from making changes for them also.

If cobra is keyword and method oriented, Why do we use any operators at all rather than all method names and keywords ?

re ~| (re_splits) , it is already available as methods on the .Net class (regex.split) as are all the other regex capabilities (match ~, isMatch ~=) These of course have different names on other platforms and in other languages, though the use of '~' (with modofiers) seems a reasonably common choice.
The ticket mentions convenience at having them built in
and making them built in hoists that support away from the library implementation and is more succinct (and arguably more readable)

What existing keyword would be an intelligible/natural substitute for ~| (re_splits) ?

How would overloading existing keywords be any more clear than a small related set of additional operators (using the same prefix ~) ?


I cant say that I see a RE as being in a string, it may or may not match some, parts or all of the string though....
leaving that aside perhaps thats marginally OK for a boolean match/not match test - what happens for the rest of the capabilities (groups and captures) ?

All operators start off being cryptic ( but succinct) , its only familiarity with common usage or experience that makes them less so
( <> vs != , ?=, >>,...)


I think I'll defer mentioning use of augmented RE pattern literals and a ~: operator for a RE substitution/replacement. :)

Re: Regular expression support builtin to Cobra

PostPosted: Thu Jan 06, 2011 6:33 pm
by todd.a
Great addition hops.

I can see in for isMatch and even in the for statement (provided it returns a MatchCollection in that context), but as mentioned before I don't see how matches fits into the picture so that groups and captures can be sucked out. This keyword would probably be the better fit for the for statement anyway. I'm not really a fan of the operators either--they tend to be forgotten unless you use them daily--and I can't say they're intuitive from the point of view of the traditional operator use (neither bitwise nor logical).

Another common operation would be replace. Seems fitting for an overload for methods in the String class (where applicable).

What about the syntax not requiring re'...' and just /../ like JavaScript? I'm not pushing for additional parsing headaches :). It would also be nice for the compiler to provide errors/warnings for the expression (dunno if this is already done) so that we don't have to wait till run-time.

Anyway, it's a great addition and it's nice seeing Regexes make their way in the language as a first-class citizen.