Forums

Regular expression support builtin to Cobra

General discussion about Cobra. Releases and general news will also be posted here.
Feel free to ask questions or just say "Hello".

Regular expression support builtin to Cobra

Postby hopscc » Tue Jan 04, 2011 1:56 am

I've added a patch to ticket:174 that makes cobra support Regular Expressions as builtins to the language.

Basically it adds a new type for a Regular expression Literal (regexp) supporting just a RE pattern or a RE pattern with flags
(this is expressed as a string (like a raw string) single or double delimited with an 're' prefix, patterns with flags are '/' delimited, patterns without flags may be '/' delimited but dont need to be)
re = re'\s([a-zA-Z]+)'   # simple Regexp literal
re = re'/\s([a-z]+/i)' # Regexp with flags - case insensitive


Flags supported are
i - Ignorecase = case-insensitive matching.
c - Compile = compile the RE. This yields faster execution but increases startup time
s - Singleline = Single-line mode. Change so '.' matches every char instead of '[^\n]*' (every character except \n).
m - Multiline = Multiline mode. Change so '^' and '$' match start and end of lines instead of start and end of entire string
x - ExplicitCaptures - the only valid captures are explicitly named or numbered groups of the form (?<name>…).
W - Ignores unescaped white space in the pattern and enables comments marked with '#'. (Not very useful without multiline (string) support)

plus support for 3 new binary operators
~

    - operator name is 'RE_match'
    - generally equivalent to .Net match (overloaded in enumerable for to .Net matches)
    - generates nil ( failure) or a Match ( or when overloaded a MatchCollection)

~=

    - operator name is 'RE_hasMatch'
    - tests if a RE matches a string ( .Net IsMatch).
    - generates a bool

~|

    - operator name is 'RE_splits'
    - splits a string on the RE pattern
    - Generates a List<of String>
In all three cases the operators expect the LHS operand to be a Regular Expression ( Regexp) and the RHS operand to be a string

These are supported both for typed and dynamic operands.

Theres a longer description on the ticket.

Heres an example
# Contrived example
str = '@param fare param1\n @param fare param 2\n@param fare last param'

re = re'/^\s*@param\s+(.*)$/m'
reX = re'/^no.match.evah$/m'
assert 'Regex' in re.typeOf.toString

# isMatch
if re ~= str, assert true
else, assert false, 'str match re - ismatch FAIL'
assert re ~= str
assert not reX ~= str

# Match
if re ~ str, assert true
else, assert false, 'str match re - match FAIL'
m = re ~ str
assert 'Match' in m.typeOf.toString
assert m and m.success
#print m

m = reX ~ str
assert not m
assert not reX ~ str

#Matches/MatchCollection
for m in re ~ str
assert m.groups[1].value.startsWith('fare')


# split
reSplit = re'\n?\s?@param '
#split = reSplit.split(str)
split = reSplit ~| str
assert split.count == 4
assert split[0] == ''
for i in 1 : split.count
assert split[i].startsWith('fare')


I'm not sure its any clearer than using the lib functions but it is less wordy.
hopscc
 
Posts: 632
Location: New Plymouth, Taranaki, New Zealand

Re: Regular expression support builtin to Cobra

Postby Charles » Tue Jan 04, 2011 4:39 am

Eh, I've had some thoughts on regexes for awhile, but I really didn't want to get into a new major feature right now because it detracts from bug fixes and refinements which I think we need more than regexes. Plus major features often introduce new bugs and new sorely needed refinements.

In any case, given that Cobra is keyword and method oriented, why would we use ~| instead of a method name or existing keyword operator?:

# "in" works for strings:
what = 'fox'
text = 'The quick brown fox jumps over the lazy dog.'
assert what in text

# why not for regexes?
assert someRE in text


The cryptic ~| could be done with a .split method which also opens up the possibility of an overload that takes options (max splits, etc.).
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Regular expression support builtin to Cobra

Postby torial » Tue Jan 04, 2011 8:24 am

Charles wrote:Eh, I've had some thoughts on regexes for awhile, but I really didn't want to get into a new major feature right now because it detracts from bug fixes and refinements which I think we need more than regexes. Plus major features often introduce new bugs and new sorely needed refinements.

In any case, given that Cobra is keyword and method oriented, why would we use ~| instead of a method name or existing keyword operator?:

# "in" works for strings:
what = 'fox'
text = 'The quick brown fox jumps over the lazy dog.'
assert what in text

# why not for regexes?
assert someRE in text


The cryptic ~| could be done with a .split method which also opens up the possibility of an overload that takes options (max splits, etc.).


Of the three operators added, as you said ~| can simply be .split, and "in" covers the ~= (ie the boolean hasMatch), but what about the ~ (ie the matches operation) ? Are you open to ~ as a new operator? If a keyword preferred, are there any that would be suitable and are reusable for such a purpose?
torial
 
Posts: 229
Location: IA

Re: Regular expression support builtin to Cobra

Postby hopscc » Wed Jan 05, 2011 3:45 am

Whoops - thought I;d posted this last night

Interesting... I have many questions

What bug fixes (tickets) and refinements ( enhancements) do we need (more) ?
A Specific listing, Annotation and/or augmentation of any of the tickets might be useful.

Are you saying we cant add major features cos they may cause bugs/need further refinement ?
Are 'bug fixes and refinements' somehow immune from this possibility
- Why are we not equally paralysed from making changes for them also.

If cobra is keyword and method oriented, Why do we use any operators at all rather than all method names and keywords ?

re ~| (re_splits) , it is already available as methods on the .Net class (regex.split) as are all the other regex capabilities (match ~, isMatch ~=) These of course have different names on other platforms and in other languages, though the use of '~' (with modofiers) seems a reasonably common choice.
The ticket mentions convenience at having them built in
and making them built in hoists that support away from the library implementation and is more succinct (and arguably more readable)

What existing keyword would be an intelligible/natural substitute for ~| (re_splits) ?

How would overloading existing keywords be any more clear than a small related set of additional operators (using the same prefix ~) ?


I cant say that I see a RE as being in a string, it may or may not match some, parts or all of the string though....
leaving that aside perhaps thats marginally OK for a boolean match/not match test - what happens for the rest of the capabilities (groups and captures) ?

All operators start off being cryptic ( but succinct) , its only familiarity with common usage or experience that makes them less so
( <> vs != , ?=, >>,...)


I think I'll defer mentioning use of augmented RE pattern literals and a ~: operator for a RE substitution/replacement. :)
hopscc
 
Posts: 632
Location: New Plymouth, Taranaki, New Zealand

Re: Regular expression support builtin to Cobra

Postby todd.a » Thu Jan 06, 2011 6:33 pm

Great addition hops.

I can see in for isMatch and even in the for statement (provided it returns a MatchCollection in that context), but as mentioned before I don't see how matches fits into the picture so that groups and captures can be sucked out. This keyword would probably be the better fit for the for statement anyway. I'm not really a fan of the operators either--they tend to be forgotten unless you use them daily--and I can't say they're intuitive from the point of view of the traditional operator use (neither bitwise nor logical).

Another common operation would be replace. Seems fitting for an overload for methods in the String class (where applicable).

What about the syntax not requiring re'...' and just /../ like JavaScript? I'm not pushing for additional parsing headaches :). It would also be nice for the compiler to provide errors/warnings for the expression (dunno if this is already done) so that we don't have to wait till run-time.

Anyway, it's a great addition and it's nice seeing Regexes make their way in the language as a first-class citizen.
todd.a
Site Admin
 
Posts: 81
Location: Chicago, IL


Return to Discussion

Who is online

Users browsing this forum: No registered users and 8 guests