This article explains how to integrate a Flex scanner into Ruby code. To illustrate this, I show step by step how to create a gem containing a basic JSON parser using Flex rules.
My use case:
the first step for a formatter like Format Express is to split the input into significant tokens; For example, the JSON formatter must
isolate the syntactic symbols ({
, :
, ,
, ]
...), identify JSON keys and values ("foo"
, 42
, ...), ignore the insignificant white spaces,
and so on...
I use Flex to extract the tokens, given a set of rules (regular expressions). From the sequence of tokens, the JSON can be nicely reconstructed.
This article is split into 4 sections:
- Create a gem project with C extension
- Setup a Flex file and test it
- Integrate Flex into Ruby code
- What's next? & FAQ
Prerequisite
I make sure flex
is installed.
$ flex --version
flex 2.6.4
I will also be using ruby
, rake
, bundle
and gcc
in this article.
$ ruby --version; rake --version; bundle --version; gcc --version
ruby 2.7.1p83
rake, version 13.0.3
Bundler version 2.1.4
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
A note on how Flex works: The flex
command is a parser generator, that means the flex
command is not used directly to parse input,
but is a tool to generate the source code of a C program containing the parser matching the provided rules. To be usable from Ruby code, the C program must be compiled and wrapped
inside a gem.
Create a new gem project with C extension
So I create a dedicated gem named json_flex_parser
. I run bundle gem
command to create the gem project skeleton,
with the --ext
option for the generation of a gem with C extensions.
$ bundle gem json_flex_parser --ext --no-coc --mit
MIT License enabled in config
create json_flex_parser/Gemfile
create json_flex_parser/lib/json_flex_parser.rb
create json_flex_parser/lib/json_flex_parser/version.rb
create json_flex_parser/json_flex_parser.gemspec
create json_flex_parser/Rakefile
create json_flex_parser/README.md
create json_flex_parser/bin/console
create json_flex_parser/bin/setup
create json_flex_parser/.gitignore
create json_flex_parser/.travis.yml
create json_flex_parser/test/test_helper.rb
create json_flex_parser/test/json_flex_parser_test.rb
create json_flex_parser/LICENSE.txt
create json_flex_parser/ext/json_flex_parser/extconf.rb
create json_flex_parser/ext/json_flex_parser/json_flex_parser.h
create json_flex_parser/ext/json_flex_parser/json_flex_parser.c
Initializing git repo in ./json_flex_parser
Gem 'json_flex_parser' was successfully created.
For more information on making a RubyGem visit https://bundler.io/guides/creating_gem.html
A new directory json_flex_parser
has been created, with many files. I'll look at some of then later in this article, but many won't be modified.
You can check the Bundler documentation for the more options.
For now, I just validate that everything is fine by compiling the gem, using the rake
command.
$ cd json_flex_parser
json_flex_parser$ rake compile
mkdir -p tmp/i686-linux/json_flex_parser/2.7.1
cd tmp/i686-linux/json_flex_parser/2.7.1
ruby -I. ../../../../ext/json_flex_parser/extconf.rb
creating Makefile
cd -
cd tmp/i686-linux/json_flex_parser/2.7.1
/usr/bin/make
compiling ../../../../ext/json_flex_parser/json_flex_parser.c
linking shared-object json_flex_parser/json_flex_parser.so
cd -
mkdir -p tmp/i686-linux/stage/lib/json_flex_parser
install -c tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so lib/json_flex_parser/json_flex_parser.so
cp tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so tmp/i686-linux/stage/lib/json_flex_parser/json_flex_parser.so
The bundle
command generated an empty C extension in ./ext/json_flex_parser/json_flex_parser.c
and the rake
command just
compiled it without errors. In the third section of this article, I'll override this file with the C parser generated by Flex. But for now, let just
setup a Flex file and test it, without any Ruby integration.
Setup the Flex rules
The Flex rules are specified in a dedicated file json_flex_parser.l
, which I put in a new directory src/json_flex_parser
, with the following content:
%{
#include <stdlib.h>
#include <stdio.h>
%}
/* Some Flex options */
%option reentrant
%option fast
%option nounput noyywrap noinput
/* Definitions that will be used in the rules */
DELIMITERS [,:{}\[\]]
DIGIT [0-9]
%%
/* --- The rules (simplified JSON subset) --- */
[ \t]+ { } /* Ignore spaces and tabs */
{DELIMITERS} { printf("=> %s\n", yytext); } /* JSON syntax: commas, colons, braces... */
\"[^"]*\" { printf("=> %s\n", yytext); } /* Strings */
-?{DIGIT}+ { printf("=> %s\n", yytext); } /* Integer numbers */
. { printf("=> %s\n", yytext); } /* Anything else */
%%
/* The C main function that will be used to test the rules with stdin */
int main(int argc, char *argv[]) {
printf("Enter JSON\n");
yyscan_t scanner;
yylex_init ( &scanner );
yylex ( scanner );
yylex_destroy ( scanner );
return EXIT_SUCCESS;
}
Each rule (lines 19-23
) is composed of a regular expression to identify a token and a C block to deal with the token. For simplification and shortness, I declared only 5 rules. It's not enough to have a complete JSON parser, but will be sufficient for this article. Links to more
complete sets of rules for JSON can be found in the FAQ at the end of the article.
Quick look at the rules:
- the first rule identify spaces and tabs and does nothing with them (they're just discarded).
- the 3 next rules identify lexical JSON tokens (marks, strings, numbers) and print each token on a separate line.
- the last rule handles "anything else" (because the input may not be a valid JSON), and also prints it.
On line 28
is the declaration of a main
method, so I can execute the parser as standalone: the parser will read the standard input
and apply the rules on it. There is no Ruby integration yet, it will be done in next section.
But first, I want to validate the Flex file. I run the flex
command to turn it into a C program. The C source will be generated into a
temporary tmp
directory that I create first.
src/json_flex_parser$ mkdir tmp
src/json_flex_parser$ flex --outfile=tmp/lexer.c --header-file=tmp/lexer.h json_flex_parser.l
I check the generated C in the tmp
directory, and compile the with gcc
.
src/json_flex_parser$ cd tmp
src/json_flex_parser/tmp$ ls
lexer.c lexer.h
src/json_flex_parser/tmp$ gcc -o lexer.out lexer.c
src/json_flex_parser/tmp$ ls
lexer.c lexer.h lexer.out
So far, so good. Now let's try it ! I execute the program, and provide it an example JSON like
{"name":"foo", "result":{"code":0} }
src/json_flex_parser/out$ ./lexer.out
Enter JSON
{"name":"foo", "result":{"code":0} }
=> {
=> "name"
=> :
=> "foo"
=> ,
=> "result"
=> :
=> {
=> "code"
=> :
=> 0
=> }
=> }
^C
As expected, the JSON as been split into significant tokens (braces, strings, numbers), the spaces have been discarded and each token is printed on a separate line.
The basic Flex rules are working as expected, now it's time to integrate this Flex file into a Ruby library. The directory tmp
can be deleted.
Integrate Flex into Ruby code
First, I update Rakefile
to declare 3 tasks:
- one for the compilation of the Flex file, to generate the C files into the
ext/json_flex_parser
directory - one for the compilation of those C files, to build the C extension
lib/json_flex_parser/json_flex_parser.so
- one final task for the execution of tests.
require "bundler/gem_tasks"
require "rake/testtask"
require "rake/extensiontask"
DIR_SRC = "src/json_flex_parser"
DIR_EXT = "ext/json_flex_parser"
DIR_LIB = "lib/json_flex_parser"
task :default => 'json_flex_parser:compile'
namespace 'json_flex_parser' do
# Task for the compilation of Flex file into C files
desc 'Compile Flex file'
task :flex do
exec "flex -8 -P json_flex_ --outfile=#{DIR_EXT}/lexer.c --header-file=#{DIR_EXT}/lexer.h #{DIR_SRC}/json_flex_parser.l"
end
CLEAN.include("#{DIR_EXT}/*.c")
CLEAN.include("#{DIR_EXT}/*.h")
task :compile => :flex
# Task for the compilation of the C extension
Rake::ExtensionTask.new do |ext|
ext.name = "json_flex_parser"
ext.ext_dir = DIR_EXT
ext.lib_dir = DIR_LIB
end
CLEAN.include('tmp')
CLEAN << "#{DIR_LIB}/*.so"
# Task for the execution of tests
Rake::TestTask.new do |t|
t.name = :test
t.libs << "lib"
t.test_files = FileList["test/**/*_test.rb"]
end
task :test => :compile
private
# Execute a system command
def exec(command)
puts command
puts %x[#{command}]
$?.success? || exit($?.exitstatus)
end
end
I validate the rake file is correct by listing the tasks.
json_flex_parser$ rake --tasks
rake build # Build json_flex_parser-0.1.0.gem into the pkg directory
rake clean # Remove any temporary products
rake clobber # Remove any generated files
rake install # Build and install json_flex_parser-0.1.0.gem into system gems
rake install:local # Build and install json_flex_parser-0.1.0.gem into system gems without network access
rake json_flex_parser:compile # Compile Flex file
rake json_flex_parser:compile:json_flex_parser # Compile json_flex_parser
rake json_flex_parser:test # Run tests
rake release[remote] # Create tag v0.1.0 and build and push json_flex_parser-0.1.0.gem to TODO: Set to 'http://mygemserver.com'
The first version of json_flex_parser.l
must be modified for the integration with the Ruby code: instead of the main
method that reads
from standard input and prints tokens, I want to declare a new Ruby class with a parse
method that will accept a Ruby string as input,
submit it to the C parser, and return an array containing all the tokens.
Let's see that how to do each of theses tasks individually, and next altogether in the final version of json_flex_parser.l
Declare a new Ruby class from C code
VALUE cExpressParserModule = rb_define_module("ExpressParser");
VALUE cExpressParserClass = rb_define_class_under(cExpressParserModule, "JSONFlexParser", rb_cObject);
rb_define_singleton_method(cExpressParserClass, "parse", parse, -1);
// Implementation of ExpressParser::JSONFlexParser#parse
static VALUE parse(int argc, VALUE *argv, VALUE self) {
...
}
These C lines create a Ruby module ExpressParser
and a Ruby class JSONFlexParser
with a class method parse
which will be
implemented in C.
It can be called from Ruby code with ExpressParser::JSONFlexParser.parse(input)
.
Instantiate a Ruby array from C code
struct extra_data_type {
VALUE tokens;
};
// Implementation of ExpressParser::JSONFlexParser#parse
static VALUE parse(int argc, VALUE *argv, VALUE self) {
// Instantiate new array
struct extra_data_type extra_data;
extra_data.tokens = rb_ary_new();
...
json_flex_lex_init_extra(&extra_data, &scanner)
}
The Ruby array must be available to every Flex rule, so first I declare a C structure extra_data_type
for holding the array. In the parse
method, a new struct is created and the Ruby array is initialized with rb_ary_new()
. Finally, the struct is given to json_flex_lex_init_extra
, so
every rule can access the Ruby array with yyextra->tokens
.
Push a string to the Ruby array
-?{DIGIT}+ {
VALUE rb_string = rb_str_new(yytext, yyleng));
rb_ary_push(yyextra->tokens, rb_string);
}
Inside a rule, the C string yytext
is converted into a Ruby string with rb_str_new
, which is than pushed into the array with rb_ary_push
.
The full implementation of json_flex_parser.l
Finally, the main
method is removed, now the yylex
is called from the parse
method. Also add some code to convert the Ruby input
into a C string available to the Flex scanner, by redefining YY_INPUT
:
%{
#include <ruby.h>
/* Utility methods to read Ruby input */
#define YY_INPUT(buf, result, max_size) do {\
VALUE arg[2];\
VALUE str;\
result = 0U;\
arg[0] = *((VALUE *)yyin);\
arg[1] = SIZET2NUM(max_size);\
str = rb_rescue2(read_wrapper, (VALUE)arg, read_wrapper_rescue, Qnil, rb_eEOFError, 0);\
if (str != Qnil) {\
(void)memcpy(buf, RSTRING_PTR(str), RSTRING_LEN(str));\
result = (size_t)RSTRING_LEN(str);\
}\
} while(0)
static VALUE read_wrapper(VALUE arg) {
VALUE *aarg = (VALUE *)arg;
return rb_funcall(aarg[0], rb_intern("readpartial"), 1, aarg[1]);
}
static VALUE read_wrapper_rescue(VALUE arg, VALUE unknown) { return Qnil; }
/* User-specific data for the rules: yyextra */
struct extra_data_type {
VALUE tokens;
};
#define YY_EXTRA_TYPE struct extra_data_type *
%}
/* Some Flex options */
%option reentrant
%option fast
%option nounput noyywrap noinput
/* Definitions that will be used in the rules */
DELIMITERS [,:{}\[\]]
DIGIT [0-9]
%%
/* --- The rules (simplified JSON subset) --- */
[ \t]+ { } /* Ignore spaces and tabs */
{DELIMITERS} { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Comma, colon... */
\"[^"]*\" { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Strings */
-?{DIGIT}+ { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Integer numbers */
. { rb_ary_push(yyextra->tokens, rb_str_new(yytext, yyleng)); } /* Anything else */
%%
static VALUE parse(int argc, VALUE *argv, VALUE self);
void Init_json_flex_parser(void) {
// Declaration of the Ruby module and class
VALUE cExpressParserModule = rb_define_module("ExpressParser");
VALUE cExpressParserClass = rb_define_class_under(cExpressParserModule, "JSONFlexParser",
rb_cObject);
rb_define_singleton_method(cExpressParserClass, "parse", parse, -1);
}
// Implementation of ExpressParser::JSONFlexParser#parse
static VALUE parse(int argc, VALUE *argv, VALUE self) {
yyscan_t scanner;
VALUE source = Qnil;
VALUE opts = rb_hash_new();
VALUE input = Qnil;
VALUE cStringIO = rb_const_get(rb_cObject, rb_intern("StringIO"));
// Instantiate new Ruby array
struct extra_data_type extra_data;
extra_data.tokens = rb_ary_new();
// Read input
rb_scan_args(argc, argv, "11", &source, &opts);
if (rb_respond_to(source, rb_intern("read"))) {
input = source;
} else if (rb_obj_is_kind_of(source, rb_cString) == Qtrue) {
input = rb_funcall(cStringIO, rb_intern("new"), 1, source);
} else {
rb_raise(rb_eTypeError, "no implicit conversion to String");
}
// Tokenize input
if (json_flex_lex_init_extra(&extra_data, &scanner) == 0) {
json_flex_set_in((FILE*)(&input), scanner);
yylex(scanner);
json_flex_lex_destroy(scanner);
}
return extra_data.tokens;
}
Earlier in the Rakefile
, I declared a task for a test. It's time to implement it and validate the C extension is working.
In test/json_flex_parser_test.rb
, send a JSON string to ExpressParser::JSONFlexParser
and check it returns the expected list of tokens:
require_relative 'test_helper'
require 'json_flex_parser'
class JsonFlexParserTest < Minitest::Test
def test_basic_json
input = '{"hello": "world"}'
expected = ['{', '"hello"', ':', '"world"', '}']
assert_equal(expected, ExpressParser::JSONFlexParser.parse(input))
end
end
Now I can run the test with rake json_flex_parser:test
: first the Flex file is compiled, next the C extension is built with the generated C files,
and finally the test class is executed.
Also first run rake clean
to remove the extension build during first section.
json_flex_parser$ rake clean
json_flex_parser$ rake json_flex_parser:test
flex -8 -P json_flex_ --outfile=ext/json_flex_parser/lexer.c --header-file=ext/json_flex_parser/lexer.h src/json_flex_parser/json_flex_parser.l
cd tmp/i686-linux/json_flex_parser/2.7.1
/usr/bin/make
compiling ../../../../ext/json_flex_parser/lexer.c
linking shared-object json_flex_parser/json_flex_parser.so
cd -
install -c tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so lib/json_flex_parser/json_flex_parser.so
cp tmp/i686-linux/json_flex_parser/2.7.1/json_flex_parser.so tmp/i686-linux/stage/lib/json_flex_parser/json_flex_parser.so
Run options: --seed 52840
# Running:
Finished in 0.000717s, 2789.7310 runs/s, 2789.7310 assertions/s.
1 runs, 1 assertions, 0 failures, 0 errors, 0 skips
Success!
What's next? & FAQ
I've created a tokenizer usable from Ruby using Flex rules. For the formatting process, the next step is to loop through the tokens to insert new lines and indentation at the correct places and recreate a human-readable version of the original input. In the next article, I show how to embed the gem into a Rails project.
Why not use an already existing Ruby parser like JSON.parse()
?
First because these formatters usually expect a valid input, whereas I want my formatter to be tolerant and be able to format any input, even if invalid or incomplete.
Also, Format Express is not limited to JSON, but also supports input that looks like JSON, like HJSON,
the dump of a JavaScript object, Groovy maps, ... So I need full control over the tokenizer.
How to build a full JSON parser?
The rules used in this article need to be expanded to match the complete JSON syntax (my rules do not handle escaped characters in strings, float numbers, booleans, ...).
There are several examples on the internet.
Why don't you use Bison?
Flex is usually paired with Bison, which role is to check the grammar of tokens extracted by Flex
(for example, validate matching braces, the pairs structure, ...). In my case, Bison is not required, because Format Express is not a validator
and must be able to format even malformed input.