Wednesday, April 13, 2011

Regex word boundary for multi-byte strings

I am using posix c regex library(regcomp/regexec) on my search application. My application supports different languages including those that uses multi-byte characters. I'm encountering a problem when using word boundary metacharacter (\b). For single-byte strings, it works just fine, e.g:

"\bpaper\b" matches "paper"

However, if the regex and query strings are multi-byte, it doesn't seem to work correctly, e.g:

"\b紙張\b" doesn't match "紙張"

Am I missing something? Any help would be highly appreciated.

Requested Info:

  • Programming Language: C
  • Regex Library: GNU C (regex.h)

Thanks.

From stackoverflow
  • I think this depends on the library / programming language you are using and on the configuration of your RegExp library. Probably you have to turn on multibyte support, tell the library which character encoding you are using or edit the locale settings accordingly. Some special operations like \b or \w depend on these settings.

  • if the regex and query strings are multi-byte, it doesn't seem to work correctly

    What is “multi-byte” in this context? A string encoded into UTF-8 bytes? A locale-specific multibyte encoding such as GB?

    If you're not dealing with wide (Unicode) strings natively, you can't expect any more support for non-ASCII characters than just detecting they're there. POSIX regex doesn't specify any character classes for bytes outside the ASCII range, so it doesn't know that any of the bytes in "\xe7\xb4\x99" (the UTF-8 representation of '紙') could be considered word-letters; hence it sees no word boundaries.

    What constitutes a letter or a word in Unicode is a more involved question than simple ASCII regex can cope with. (And obviously, what constitutes a ‘word’ in Chinese is arguable in itself.) If all you want to detect is plain old spaces, you could do that explicitly:

    (\s|^)紙張(\s|$)
    
    teriz : I meant UTF-8 bytes. I realized just now that word boundary metacharacter only works for word classes, which technically means alphanumeric character plus _. This worked for me! Thanks! =)

0 comments:

Post a Comment