4/25/2009

Unicode Support in Flex

Flex/Yacc is widely used in programming language design and implementation projects, especially those projects that are from academic community. But it's a pity that many serious projects which need language compiling tasks don't use Flex even if they use Yacc. For example, Mysql use Yacc and a hand written lexer to do SQL parsing. One of the reasons is that the original Flex lacks Unicode support directly, which is unacceptable for software that aims to today's flat world.

A hot discussion with useful information on this problem can be found Here .

Although Flex doesn't have direct Unicode support, there are some solutions to this problem:

Solution 1 - Use Utf-8 Encoding

The basic idea is that by using utf-8 encoding, the source language code can be looked as a byte stream that is compatible with ASCII, with which current Flex works well.

For single Unicode character not in range regex expression, just substitute it with its utf-8 counterpart (most likely, multiple bytes sequence). For unicode character range regex expression, you may need rewrite it into several regex expressions using explicit byte values (such as: \x1f, \xa5)

Let's see some concrete example:

1. Suppose you plan to use Chinese word "转移" as language keyword for branch statement, you can write the rule as:
"\xe8\xbd\xac\xe7\xa7\xbb" { return KW_BRANCH; }
Here, "e8 bd ac e7 a7 bb" is utf-8 representation of "转移".

2. Suppose you plan do define the Unicode version of Any Char (as '.' in ansi version), that is to say define chars in range: '\u0000' - '\uffff' as a name UniAnyChar , you can write the rule as:

UniAnyChar [\0-\x7F] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEF][\x80-\xBF][\x80-\xBF]

To fully understand this example, you need to know how utf-8 works and how regex is used in flex. Please read the following links for further information.
http://lists.gnu.org/archive/html/help-flex/2005-01/msg00030.html
http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html

In fact, you are wrting regex against bit pattern other than more readable charset in this solution. You can see that if you adopt this method, the lex rule will be very hard to understand and maintain.

Solution 2 - Unicode Version Flex

There is a patch that can modify Flex to have Unicode support. After apply this patch, build the binary, you can use the -U option to process a Unicode lex file. In Unicode lex file, regex can be written using Unicode chars directly. Here is an example of Unicode lex file.

Here is the instruction on how to apply this patch.

But, this patch is NOT official release, and only works for Flex version 2.5.4a. Further more, it is designed for linux world, you may need cygwin/mingw to apply it on windows platform.

Other Tool

Of course you have other choices, such as using Unicode built-in language and corresponding lex/yacc tools.

Or you can use another fantastic tool called ANTLR, that works with C/C++.

Some paper also reported Unicode flex projects.

NOTE:
- On Windows platform, you can use WideCharToMultiByte()/MultiByteToWideChar() with CodePage = CP_UTF8 to do utf-8 encoding/decoding.

2 comments:

Eric Prud'hommeaux said...

Adding utf-8 Encoding to Lex describes Solution 1 in some detail. The yacker web demo provides an easy way to play around with this.

The need for Unicode shouldn't drive people away from great tools like flex.

Anonymous said...

gcc return

error: expected expression before ‘[’ token
UniAnyChar [\0-\x7F] | [\xC2-\xDF][\x80-\xBF] | \xE0[\xA0-\xBF][\x80-\xBF] | [\xE1-\xEF][\x80-\xBF][\x80-\xBF] ;

.....