Sunday, February 13, 2011

Are regex tools (like RegexBuddy) a good idea?

One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. - Jamie Zawinski

And:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. - Brian Kernighan

My concerns are (respectively:)

  • That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).

  • That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.

Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?

  • Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.

    glenatron : I almost used exactly this analogy in my reply :)
    Kris : I'm not gonna bother posting this myself, just upvote this.
    Keng : wow...congrats on the rep max out (200/day max increase).
    EBGreen : Heh...that appears to be the way it works. I just answered to simply express my opinion. I didn't expect this response at all.
    Adam Bellaire : A succinct answer is often best. In this case the consensus seems to be: yes, I am being paranoid, let them learn and use whatever tools are most appropriate to do the job best.
    EBGreen : Tou're not paranoid if people really are out to get you. Programmers do make things too complex sometimes. But as long as they don't go dark someone will usually point it out to them and they hopefully learn from it.
    From EBGreen
  • You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.

  • Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.

  • I'm not sure why there is so much diffidence against regex.

    Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.

    Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.

    I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.

    On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's

    Adam Bellaire : Hi Remo, to clarify, I don't discourage regexes. Like you say, a single-line regex is often the best solution to a problem that otherwise might require 20 lines of code. It's giant, monstrous regexes that can't be deciphered outside RegexBuddy that I'm worried about. :)
    EBGreen : Do you do code review? If so, that would hopefully catch any regex that would be a problem.
    Kibbee : What kind of code would be needed to replace the giant, monstrous regex that you are worried about? A Large regex with good comments, which is broken apart with the ignore pattern whitespace option is much easier to comprehend than the equivalent substring/indexof solution.
    Adam Bellaire : That's a good point. I guess I wass thinking along the lines of /^1?$|^(11+?)\1+$/ but bigger. Sure, it'll find a prime number, but it's inefficient in both time and space, and it's also difficult to determine what the heck it's actually for (if it weren't so famous that, like this example).
    From Remo.D
  • Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.

    Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.

    If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.

    Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.

    Adam Bellaire : As I commented to Remo, I fully agree with and embrace the use of regular expressions in general. I'm starting to see (based on the responses) that since that's already a given, worrying about which tool is being used is, in fact, being overly paranoid. :-)
    From glenatron
  • Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.

    Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.

    From Pistos
  • I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.

    It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.

    From rmeador
  • What you should be doing is getting your other devs hooked up with RB.

    Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.

    Dan : Perl is one of the few languages that makes working with regular expressions relatively painless.
    Kibbee : I agree. I would love another language to support REGEX as a first class citizen, so you didn't have to create objects, and initialize 10 different things just to create a simple regex.
    From Keng
  • Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.

    Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.

    As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:

    • Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
    • Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
    • Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
    • Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
    • Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.
    From Dan
  • Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.

    So, documentation can help.

    This is a real trivial example is a table like the following (just a suggestion)

    Expression        Match     Reason
    ^                 Pos 0     Start of input
    \s+               "      "  At least one space
    (abs|floor|ceil)  ceil      One of "abs", "floor", or "ceil"
    ...
    

    I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.

    However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.

    It's relative.

    From Axeman

0 comments:

Post a Comment