Tokenization with Flex and Ruby - Format Express Blog

This article explains how to integrate a Flex scanner into Ruby code. To illustrate this, I show step by step how to create a gem containing a basic JSON parser using Flex rules.

My use case: the first step for a formatter like Format Express is to split the input into significant tokens; For example, the JSON formatter must isolate the syntactic symbols ({, :, ,, ]...), identify JSON keys and values ("foo", 42, ...), ignore the insignificant white spaces, and so on...
I use Flex to extract the tokens, given a set of rules (regular expressions). From the sequence of tokens, the JSON can be nicely reconstructed.

This article is split into 4 sections:

Create a gem project with C extension
Setup a Flex file and test it
Integrate Flex into Ruby code
What's next? & FAQ

Prerequisite

I make sure flex is installed.


          $ flex --version
          flex 2.6.4

I will also be using ruby, rake, bundle and gcc in this article.


          $ ruby --version; rake --version; bundle --version; gcc --version
          ruby 2.7.1p83
          rake, version 13.0.3
          Bundler version 2.1.4
          gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

A note on how Flex works: The flex command is a parser generator, that means the flex command is not used directly to parse input, but is a tool to generate the source code of a C program containing the parser matching the provided rules. To be usable from Ruby code, the C program must be compiled and wrapped inside a gem.

Create a new gem project with C extension

So I create a dedicated gem named json_flex_parser. I run bundle gem command to create the gem project skeleton, with the --ext option for the generation of a gem with C extensions.


          $ bundle gem json_flex_parser --ext --no-coc --mit
          MIT License enabled in config
          create  json_flex_parser/Gemfile
          create  json_flex_parser/lib/json_flex_parser.rb
          create  json_flex_parser/lib/json_flex_parser/version.rb
          create  json_flex_parser/json_flex_parser.gemspec
          create  json_flex_parser/Rakefile
          create  json_flex_parser/README.md
          create  json_flex_parser/bin/console
          create  json_flex_parser/bin/setup
          create  json_flex_parser/.gitignore
          create  json_flex_parser/.travis.yml
          create  json_flex_parser/test/test_helper.rb
          create  json_flex_parser/test/json_flex_parser_test.rb
          create  json_flex_parser/LICENSE.txt
          create  json_flex_parser/ext/json_flex_parser/extconf.rb
          create  json_flex_parser/ext/json_flex_parser/json_flex_parser.h
          create  json_flex_parser/ext/json_flex_parser/json_flex_parser.c
          Initializing git repo in ./json_flex_parser
          Gem 'json_flex_parser' was successfully created.
          For more information on making a RubyGem visit https://bundler.io/guides/creating_gem.html

A new directory json_flex_parser has been created, with many files. I'll look at some of then later in this article, but many won't be modified. You can check the Bundler documentation for the more options.

For now, I just validate that everything is fine by compiling the gem, using the rake command.


          $ cd json_flex_parser
          json_flex_parser$ rake compile
          mkdir -p tmp/i686-linux/json_flex_parser/2.7.1
          cd tmp/i686-linux/json_flex_parser/2.7.1
          ruby -I. ../../../../ext/json_flex_parser/extconf.rb
          creating Makefile
          cd -
          cd tmp/i686-linux/json_flex_parser/2.7.1
          /usr/bin/make
          compiling ../../../../ext/json_flex_parser/json_flex_parser.c
          linking shared-object json_flex_parser/json_flex_parser.so
          cd -
          mkdir -p tmp/i686-linux/stage/lib/json_flex_parser
          install -c tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so lib/json_flex_parser/json_flex_parser.so
          cp tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so tmp/i686-linux/stage/lib/json_flex_parser/json_flex_parser.so

The bundle command generated an empty C extension in ./ext/json_flex_parser/json_flex_parser.c and the rake command just compiled it without errors. In the third section of this article, I'll override this file with the C parser generated by Flex. But for now, let just setup a Flex file and test it, without any Ruby integration.

Setup the Flex rules

The Flex rules are specified in a dedicated file json_flex_parser.l, which I put in a new directory src/json_flex_parser, with the following content:


          %{
            #include <stdlib.h>
            #include <stdio.h>
          %}
           
          /* Some Flex options */
          %option reentrant
          %option fast
          %option nounput noyywrap noinput
           
          /* Definitions that will be used in the rules */
          DELIMITERS         [,:{}\[\]]
          DIGIT              [0-9]
           
          %%
           
           /* ---  The rules (simplified JSON subset) --- */
           
          [ \t]+          { }                              /* Ignore spaces and tabs */
          {DELIMITERS}    { printf("=> %s\n", yytext); }   /* JSON syntax: commas, colons, braces... */
          \"[^"]*\"       { printf("=> %s\n", yytext); }   /* Strings */
          -?{DIGIT}+      { printf("=> %s\n", yytext); }   /* Integer numbers */
          .               { printf("=> %s\n", yytext); }   /* Anything else */
           
          %%
           
          /* The C main function that will be used to test the rules with stdin */
          int main(int argc, char *argv[]) {
            printf("Enter JSON\n");
            yyscan_t scanner;
            yylex_init ( &scanner );
            yylex ( scanner );
            yylex_destroy ( scanner );
            return EXIT_SUCCESS;
          }

Each rule (lines 19-23) is composed of a regular expression to identify a token and a C block to deal with the token. For simplification and shortness, I declared only 5 rules. It's not enough to have a complete JSON parser, but will be sufficient for this article. Links to more complete sets of rules for JSON can be found in the FAQ at the end of the article.

Quick look at the rules:

the first rule identify spaces and tabs and does nothing with them (they're just discarded).
the 3 next rules identify lexical JSON tokens (marks, strings, numbers) and print each token on a separate line.
the last rule handles "anything else" (because the input may not be a valid JSON), and also prints it.

On line 28 is the declaration of a main method, so I can execute the parser as standalone: the parser will read the standard input and apply the rules on it. There is no Ruby integration yet, it will be done in next section.

But first, I want to validate the Flex file. I run the flex command to turn it into a C program. The C source will be generated into a temporary tmp directory that I create first.


          src/json_flex_parser$ mkdir tmp
          src/json_flex_parser$ flex --outfile=tmp/lexer.c --header-file=tmp/lexer.h json_flex_parser.l

I check the generated C in the tmp directory, and compile the with gcc.


          src/json_flex_parser$ cd tmp
          src/json_flex_parser/tmp$ ls
          lexer.c  lexer.h
          src/json_flex_parser/tmp$ gcc -o lexer.out lexer.c
          src/json_flex_parser/tmp$ ls
          lexer.c  lexer.h  lexer.out

So far, so good. Now let's try it ! I execute the program, and provide it an example JSON like
{"name":"foo", "result":{"code":0} }


          src/json_flex_parser/out$ ./lexer.out
          Enter JSON
          {"name":"foo", "result":{"code":0} }
          => {
          => "name"
          => :
          => "foo"
          => ,
          => "result"
          => :
          => {
          => "code"
          => :
          => 0
          => }
          => }
          ^C

As expected, the JSON as been split into significant tokens (braces, strings, numbers), the spaces have been discarded and each token is printed on a separate line. The basic Flex rules are working as expected, now it's time to integrate this Flex file into a Ruby library. The directory tmp can be deleted.

Integrate Flex into Ruby code

First, I update Rakefile to declare 3 tasks:

one for the compilation of the Flex file, to generate the C files into the ext/json_flex_parser directory
one for the compilation of those C files, to build the C extension lib/json_flex_parser/json_flex_parser.so
one final task for the execution of tests.


          require "bundler/gem_tasks"
          require "rake/testtask"
          require "rake/extensiontask"
           
          DIR_SRC = "src/json_flex_parser"
          DIR_EXT = "ext/json_flex_parser"
          DIR_LIB = "lib/json_flex_parser"
           
          task :default => 'json_flex_parser:compile'
           
          namespace 'json_flex_parser' do
           
            # Task for the compilation of Flex file into C files
            desc 'Compile Flex file'
            task :flex do
              exec "flex -8 -P json_flex_ --outfile=#{DIR_EXT}/lexer.c --header-file=#{DIR_EXT}/lexer.h #{DIR_SRC}/json_flex_parser.l"
            end
            CLEAN.include("#{DIR_EXT}/*.c")
            CLEAN.include("#{DIR_EXT}/*.h")
            task :compile => :flex
           
            # Task for the compilation of the C extension
            Rake::ExtensionTask.new do |ext|
              ext.name = "json_flex_parser"
              ext.ext_dir = DIR_EXT
              ext.lib_dir = DIR_LIB
            end
            CLEAN.include('tmp')
            CLEAN << "#{DIR_LIB}/*.so"
           
            # Task for the execution of tests
            Rake::TestTask.new do |t|
              t.name = :test
              t.libs << "lib"
              t.test_files = FileList["test/**/*_test.rb"]
            end
            task :test => :compile
           
            private
           
            # Execute a system command
            def exec(command)
              puts command
              puts %x[#{command}]
              $?.success? || exit($?.exitstatus)
            end
          end

I validate the rake file is correct by listing the tasks.


          json_flex_parser$ rake --tasks
          rake build                                      # Build json_flex_parser-0.1.0.gem into the pkg directory
          rake clean                                      # Remove any temporary products
          rake clobber                                    # Remove any generated files
          rake install                                    # Build and install json_flex_parser-0.1.0.gem into system gems
          rake install:local                              # Build and install json_flex_parser-0.1.0.gem into system gems without network access
          rake json_flex_parser:compile                   # Compile Flex file
          rake json_flex_parser:compile:json_flex_parser  # Compile json_flex_parser
          rake json_flex_parser:test                      # Run tests
          rake release[remote]                            # Create tag v0.1.0 and build and push json_flex_parser-0.1.0.gem to TODO: Set to 'http://mygemserver.com'

The first version of json_flex_parser.l must be modified for the integration with the Ruby code: instead of the main method that reads from standard input and prints tokens, I want to declare a new Ruby class with a parse method that will accept a Ruby string as input, submit it to the C parser, and return an array containing all the tokens.
Let's see that how to do each of theses tasks individually, and next altogether in the final version of json_flex_parser.l

Declare a new Ruby class from C code


          VALUE cExpressParserModule = rb_define_module("ExpressParser");
          VALUE cExpressParserClass = rb_define_class_under(cExpressParserModule, "JSONFlexParser", rb_cObject);
          rb_define_singleton_method(cExpressParserClass, "parse", parse, -1);
           
          // Implementation of ExpressParser::JSONFlexParser#parse
          static VALUE parse(int argc, VALUE *argv, VALUE self) {
            ...
          }

These C lines create a Ruby module ExpressParser and a Ruby class JSONFlexParser with a class method parse which will be implemented in C.
It can be called from Ruby code with ExpressParser::JSONFlexParser.parse(input).

Instantiate a Ruby array from C code


          struct extra_data_type {
            VALUE tokens;
          };
           
          // Implementation of ExpressParser::JSONFlexParser#parse
          static VALUE parse(int argc, VALUE *argv, VALUE self) {
           
            // Instantiate new array
            struct extra_data_type extra_data;
            extra_data.tokens = rb_ary_new();
            ...
            json_flex_lex_init_extra(&extra_data, &scanner)
          }

The Ruby array must be available to every Flex rule, so first I declare a C structure extra_data_type for holding the array. In the parse method, a new struct is created and the Ruby array is initialized with rb_ary_new(). Finally, the struct is given to json_flex_lex_init_extra, so every rule can access the Ruby array with yyextra->tokens.

Push a string to the Ruby array


          -?{DIGIT}+   {
              VALUE rb_string = rb_str_new(yytext, yyleng));
              rb_ary_push(yyextra->tokens, rb_string);
          }

Inside a rule, the C string yytext is converted into a Ruby string with rb_str_new, which is than pushed into the array with rb_ary_push.

The full implementation of json_flex_parser.l

Finally, the main method is removed, now the yylex is called from the parse method. Also add some code to convert the Ruby input into a C string available to the Flex scanner, by redefining YY_INPUT:


          %{
          #include <ruby.h>
           
          /* Utility methods to read Ruby input */
          #define YY_INPUT(buf, result, max_size) do {\
              VALUE arg[2];\
              VALUE str;\
              result = 0U;\
              arg[0] = *((VALUE *)yyin);\
              arg[1] = SIZET2NUM(max_size);\
              str = rb_rescue2(read_wrapper, (VALUE)arg, read_wrapper_rescue, Qnil, rb_eEOFError, 0);\
              if (str != Qnil) {\
                  (void)memcpy(buf, RSTRING_PTR(str), RSTRING_LEN(str));\
                  result = (size_t)RSTRING_LEN(str);\
              }\
          } while(0)
           
          static VALUE read_wrapper(VALUE arg) {
              VALUE *aarg = (VALUE *)arg;
              return rb_funcall(aarg[0], rb_intern("readpartial"), 1, aarg[1]);
          }
          static VALUE read_wrapper_rescue(VALUE arg, VALUE unknown) { return Qnil; }
           
          /* User-specific data for the rules: yyextra */
          struct extra_data_type {
            VALUE tokens;
          };
          #define YY_EXTRA_TYPE struct extra_data_type *
           
          %}
           
          /* Some Flex options */
          %option reentrant
          %option fast
          %option nounput noyywrap noinput
           
          /* Definitions that will be used in the rules */
          DELIMITERS         [,:{}\[\]]
          DIGIT              [0-9]
           
          %%
           
           /* ---  The rules (simplified JSON subset) --- */
           
          [ \t]+       { }  /* Ignore spaces and tabs */
          {DELIMITERS} { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Comma, colon... */
          \"[^"]*\"    { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Strings */
          -?{DIGIT}+   { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Integer numbers */
          .            { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Anything else */
           
          %%
           
          static VALUE parse(int argc, VALUE *argv, VALUE self);
           
          void Init_json_flex_parser(void) {
              // Declaration of the Ruby module and class
              VALUE cExpressParserModule = rb_define_module("ExpressParser");
              VALUE cExpressParserClass = rb_define_class_under(cExpressParserModule, "JSONFlexParser",
                                                                rb_cObject);
              rb_define_singleton_method(cExpressParserClass, "parse", parse, -1);
          }
           
          // Implementation of ExpressParser::JSONFlexParser#parse
          static VALUE parse(int argc, VALUE *argv, VALUE self) {
              yyscan_t scanner;
              VALUE source = Qnil;
              VALUE opts = rb_hash_new();
              VALUE input = Qnil;
              VALUE cStringIO = rb_const_get(rb_cObject, rb_intern("StringIO"));
           
              // Instantiate new Ruby array
              struct extra_data_type extra_data;
              extra_data.tokens = rb_ary_new();
           
              // Read input
              rb_scan_args(argc, argv, "11", &source, &opts);
              if (rb_respond_to(source, rb_intern("read"))) {
                  input = source;
              } else if (rb_obj_is_kind_of(source, rb_cString) == Qtrue) {
                  input = rb_funcall(cStringIO, rb_intern("new"), 1, source);
              } else {
                  rb_raise(rb_eTypeError, "no implicit conversion to String");
              }
           
              // Tokenize input
              if (json_flex_lex_init_extra(&extra_data, &scanner) == 0) {
                  json_flex_set_in((FILE*)(&input), scanner);
                  yylex(scanner);
                  json_flex_lex_destroy(scanner);
              }
           
              return extra_data.tokens;
          }

Earlier in the Rakefile, I declared a task for a test. It's time to implement it and validate the C extension is working. In test/json_flex_parser_test.rb, send a JSON string to ExpressParser::JSONFlexParser and check it returns the expected list of tokens:


          require_relative 'test_helper'
          require 'json_flex_parser'
           
          class JsonFlexParserTest < Minitest::Test
           
            def test_basic_json
              input = '{"hello": "world"}'
              expected = ['{', '"hello"', ':', '"world"', '}']
              assert_equal(expected, ExpressParser::JSONFlexParser.parse(input))
            end
           
          end

Now I can run the test with rake json_flex_parser:test: first the Flex file is compiled, next the C extension is built with the generated C files, and finally the test class is executed.
Also first run rake clean to remove the extension build during first section.


          json_flex_parser$ rake clean
          json_flex_parser$ rake json_flex_parser:test
          flex -8 -P json_flex_ --outfile=ext/json_flex_parser/lexer.c --header-file=ext/json_flex_parser/lexer.h src/json_flex_parser/json_flex_parser.l
           
          cd tmp/i686-linux/json_flex_parser/2.7.1
          /usr/bin/make
          compiling ../../../../ext/json_flex_parser/lexer.c
          linking shared-object json_flex_parser/json_flex_parser.so
          cd -
          install -c tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so lib/json_flex_parser/json_flex_parser.so
          cp tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so tmp/i686-linux/stage/lib/json_flex_parser/json_flex_parser.so
          Run options: --seed 52840
           
          # Running:
          Finished in 0.000717s, 2789.7310 runs/s, 2789.7310 assertions/s.
          1 runs, 1 assertions, 0 failures, 0 errors, 0 skips

Success!

What's next? & FAQ

I've created a tokenizer usable from Ruby using Flex rules. For the formatting process, the next step is to loop through the tokens to insert new lines and indentation at the correct places and recreate a human-readable version of the original input. In the next article, I show how to embed the gem into a Rails project.

Why not use an already existing Ruby parser like JSON.parse()?
First because these formatters usually expect a valid input, whereas I want my formatter to be tolerant and be able to format any input, even if invalid or incomplete. Also, Format Express is not limited to JSON, but also supports input that looks like JSON, like HJSON, the dump of a JavaScript object, Groovy maps, ... So I need full control over the tokenizer.

How to build a full JSON parser?
The rules used in this article need to be expanded to match the complete JSON syntax (my rules do not handle escaped characters in strings, float numbers, booleans, ...). There are several examples on the internet.

Why don't you use Bison?
Flex is usually paired with Bison, which role is to check the grammar of tokens extracted by Flex (for example, validate matching braces, the pairs structure, ...). In my case, Bison is not required, because Format Express is not a validator and must be able to format even malformed input.