Monday, April 11, 2011

Regex to find the code blocks in C#

I have to find the code blocks from the given code using Regex in C#.

e.g. I have to find the For loop block from the following code

For A in 1..10 
Loop
stmt1;
For C in cur_op
Loop
stmt2;
end loop;
end loop;
For T in 4..8
loop
stmt3;
end loop;

I want to retrieve the code blocks as

For A in 1..10 
Loop
stmt1;
For C in cur_op
Loop
stmt2;
end loop;
end loop;

and

For T in 4..8
loop
stmt3;
end loop;

Can anyone suggest me any Regex for this?

Thanks.

From stackoverflow
  • I don't think it's possible. You're asking for a regex parsing a context-free language, and while Perl REs actually can parse CFLs, I'm not sure C# regular expressions can do it, and using it is not the biggest pleasure out there.

    Natural solution for your problem would be to create a parser for the language, and get the info from that. You could use a parser generator like CoCo/R or ANTLR.

    Archie : i had thought of using parser. But because of few reasons i decided to implement the parsing in C# only. One of those reasons is i need to write the code to parse multiple languages.So will have to write grammar for all of them. Anyway, thanks for the suggestion.
  • Balancing with a Regex is not that obvious since you need to properly match the outer-most For.. end loop;.

    It could be simpler with a basic loop going through each line, incrementing a counter when it contains FOR and decreasing it when it contains end loop;. While the counter is positive, you append the current line to your result.
    When the counter reaches 0, you've got your block.

    I've proposed something similar recently for another SO question.

    This is fine and easy if all you need is getting the inner blocks between For..end loop;.

    As jpalecek said, if you need to do more, you would be better off using a more appropriate parsing method or your job will become increasingly difficult.

    Archie : i have considered this kind of programming. but i thought if only single regex can get me what i want then it would be far more easier. anyway, thanks for the suggestion.
  • Well, it's possible to do that with .net Regex, and if you really don't need a real parser, you can go for this solution. It is nicely explained in this article on codeproject.com and I can confirm you that it works well (I used that to implement a simple bbcode parser).

    You pattern might look like something like that :

    String pattern = @"
    (?# line 01) For ... in ...
    (?# line 02) (?>
    (?# line 03)   For ... in ... (?<DEPTH>)
    (?# line 04)   |
    (?# line 05)   end loop; (?<-DEPTH>)
    (?# line 06)   |
    (?# line 07)   .?
    (?# line 08) )*
    (?# line 09) (?(DEPTH)(?!))
    (?# line 10) end loop;
    ";
    
    Archie : thanks for the suggestion. but it is not quite useful since the structure of the code is not fixed.

0 comments:

Post a Comment