Allowing CMake functions to return(value)
Introduction
It’s a story of implementing CMake feature that I call command reference
(similar to existing variable reference
), i.e., using result of command
invocation as an argument. Having this idea for a long time I never had enough
time to dig into it. Now, being unemployed I decided at least to try it before
looking for a next job. It was not as easy as I expected but I’m pretty satisfied
with the result.
It consists of two parts:
- First part contains motivation, design and results.
- Second part explains some implementation details, such as why new lexer and parser is needed.
Motivation
Most part of my career I used Visual Studio and when switched to Linux I was slightly shocked. Compared to MSVS, makefiles felt like bows and arrows against machine gun. Then I discovered CMake and it felt much better, instead of cryptic makefiles we got a distinct language with commands and variables. And since it’s just another language, the same rules apply to its code: meaningful names, small functions, separation of abstractions, etc. Unfortunately, many CMake files look like one very big function that mixes everything in it. CMake allows us to handle almost all those things right except one - it doesn’t have return values, thus, limiting the usefulness of function abstraction. As a result, some parts of your CMakeLists look bad.
Let’s look at some examples.
if(${CMAKE_CURRENT_LIST_DIR} STREQUAL ${CMAKE_SOURCE_DIR}) # is top level list?
if(WIN32) # surprisingly short name comparing to other CMake vars
if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
if((CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
OR (CMAKE_CXX_COMPILER_ID STREQUAL "AppleClang"))
if(CMAKE_SIZEOF_VOID_P EQUAL 8) # my favorite one, x64 check
The problem with such code is that it doesn’t express logic, only implementation details. It requires you to remember all those long and tricky variable names, magic values and relation between them.
We can move that into a function but it doesn’t solve the problem. We are lazy, nobody wants to write two lines instead of one:
check_is_top_level_list(is_top_level_list)
if(is_top_level_list)...
# is it better than one-liner?
if(${CMAKE_CURRENT_LIST_DIR} STREQUAL ${CMAKE_SOURCE_DIR})
Almost all feature tests become:
check_feature_available(RESULT is_feature_available)
if(is_feature_available)
It’s better than direct values manipulation but it’s ugly. Now you need to
check documentation for the name of this output argument, intuitive
candidates are
RESULT
, RESULT_VAR
, OUTPUT
, or just absence of such argument at all:
check_feature_available(is_feature_available)
(or check_feature_available()
with FATAL_ERROR
).
It also forces you to think about names for variables, many
of whom are used only once. In any popular language we can write all of the above
in a clear manner:
if(IsTopLevelProject()){}
if(IsWindowsBuild()){}
if(IsClang() || IsGcc()){}
if(IsX64Build()){}
if(IsFeatureAvailable()){}
Why should I know how those checks are performed?
Again, you can handle all that stuff, CMake has been successfully used for years. But why not make it simpler? CMake is a big part of C++ so why shouldn’t we make it easier for new people as we do with C++ itself?
Design
Why f(h()) isn’t possible
Initial idea was to allow something like get_name_by_id(get_id())
but it quickly
turned out to be wrong because of two reasons:
-
CMake syntax is too simple, it doesn’t have keywords, command names are not restricted, anything is a string(including parens). For example expression
if(x AND (y OR z))
means callif_impl("x", "AND", "(", "y", "OR", "z", ")");
whereAND
,OR
and parens are just plain strings that are handled in a specific way byif_impl
. The only requirement here is that parens should match, e.g.if(x AND (y))))
isn’t allowed. Because of that this is ambiguous:function(AND a b c) endfunction() if(x AND (y OR z)) # if_impl(x, AND, (, y, OR, z, )) or # if_impl(x, AND_impl(y, OR, z)) ?
You can extend this case to named arguments which, unlike bool operations, can have non-trivial names.
-
But the main reason is that the above form isn’t flexible enough. How to use it within the quoted argument or to mix it with plain strings?
function(h) endfunction() f(a_h() "b h()_c") # not possible
Meet the command reference
Syntax mimics variable reference: ${command_name( args... )}
(notice, there’s no
spaces before command name and after final paren). It works
just like you expect:
function(get_name)
return("Alex")
endfunction()
message(${get_name()}) # prints "Alex"
More generally:
function(f)
return(return-value-expr)
endfunction()
use_f(${f()})
# is equal to
set(__ret_var_name return-value-expr)
use_f(${__ret_var_name})
It can be used wherever variable reference can. Comments, nested calls and lists are also allowed, let’s mix it all together:
function(format_name first last)
return("First: ${first}, last: ${last}")
endfunction()
function(get_first_name)
return("John") # return quoted
endfunction()
function(get_last_name)
return(Doe) # return unquoted
endfunction()
function(get_first_and_last)
return([[John]] Doe) # return list
endfunction()
message(
${format_name( # pass separate args
${get_first_name()} # comments
${get_last_name()} #[[ inside
command
reference ]]
)}
) # First: John, last: Doe
message(
${format_name( # pass as a list, expands in two arguments
${get_first_and_last()}
)}
) # First: John, last: Doe
# return() becomes a function that returns its arguments
message(${CMAKE_${return("VERSION")}}) # 3.18.1-...
Downloads
Current implementation based upon CMake 3.18.1 release. You can build it from sources or use pre-built binaries:
Warning
Some CMake features or policies, especially related to syntax or variable expansion, might not work. One such policy I’m aware of is OLD part of CMP0053. Syntax related error messages are also slightly different. All other things should work, I’ve successfully built Google Test, Google Benchmark and fmt, using it. I’m not a CMake developer, integration is quite dirty in some places so don’t expect it to be production ready right away.
Part 2. Implementation details
At the beginning I naively supposed that if CMake can already parse single command invocation, it would be enough just to call that function recursively on every argument :) But it turned out to be a way more complex and required completely new lexer and parser. I’ve created it using Flex&Bison, you can find separate project that does parsing and pseudo-evaluation here.
Existing CMake parser and why it’s not enough
Current implementation is relatively simple(but not its code). It consists of
Flex-based scanner and hand-written parser.
Scanner detects separated arguments and their kinds,
it’s easy since we know how each argument starts and ends. Current parser
mostly verifies basic syntax rules like valid separations, parens matching etc.
For example
command(a "${b}")
is parsed as call("command").with_args(unquoted_arg{"a"},
quoted_arg{"${b}"})
. Notice that variable reference ${b}
is passed as a plain text.
During command execution each argument is parsed again with another parser that
can detect, verify and evaluate variable references. If such command appears in a
cycle it does this additional parsing on every iteration.
Also if you make a mistake inside reference, it won’t be detected until
expression is evaluated:
if(${ALWAYS_TRUE_IN_YOUR_ENV})
# no errors or warnings on your machine
message("hello world")
else()
# syntax error at run-time on another machine
message(${@:-:@})
endif()
Now, when we allow another command appear inside argument, argument separation is not so easy:
command("result: ${get_result("a" b)}")
You can see that highlighter marks “a” in black because it thinks that arguments are
"result: ${get_result("
, a
, " b)}"
. Existing CMake parser sees it in
the same way.
To separate arguments correctly we got to be able to parse recursively
when we meet command reference.
In terms of BNF existing syntax looks like(simplified):
command_invocation ::= identifier '(' argument* ')'
with command reference we got:
command_invocation ::= identifier '(' (argument | command_invocation)* ')'
with only difference that command reference might appear inside argument, not only as a separate one.
As you can see, now we need to parse it much deeper than existing parser does, there’s no sense in trying to extend it, also writing parser for recursive rules by hand is not trivial so I have no choice but to write both scanner and parser from scratch. Flex and Bison were chosen because they’re already used in CMake.
BNF for a new syntax
Let’s slightly update official BNF accordingly to new syntax:
command_invocation ::= identifier space* '(' arguments ')'
quoted_argument ::= '"' (quoted_element | reference)* '"'
unquoted_argument ::= (unquoted_element | reference)+
reference ::= var_reference | command_reference
var_reference ::= var_ref_open (variable_name | reference)* ref_close
command_reference ::= cmd_ref_open command_invocation ref_close
var_ref_open ::= "${" | "$ENV{" | "$CACHE{"
cmd_ref_open ::= "${"
ref_close ::= "}"
quoted_element ::= <check official docs>
unquoted_element ::= <check official docs>
variable_name ::= <check official docs>
Unlike existing implementation, I want to avoid parsing during execution and get
all details in one pass. Now, each quoted/unquoted argument consists of
string(quoted/unquoted_element+
) and reference. To get its real value
at run-time we need to evaluate and concatenate all its parts. For example,
a_${b}_c
has 3 elements: string("a_")
, var_ref("b")
, string("_c")
. At
run-time we get the value of b
and concatenate them together: a_B_VALUE_c
.
Expression representation and evaluation
Here’s brief overview of key expressions:
- call expression is a list of arguments.
- quoted/unquoted argument expression is a list of strings and references.
- variable reference expression is a list of strings and references
- command reference expression is similar to call expression.
Now we need a good representation that can store and evaluate such expressions efficiently.
AST
First approach was to use classic Interpreter pattern and compose expressions into a tree. Since each expression is
list-like we can represent them all as a std::vector<std::unique_ptr<IExpression>>
.
It works but even simple command becomes quite
involved, command(a b)
is represented roughly with
vector{ // vector of arguments
"command", // command name
vector{ // each argument is a vector itself
"a"
},
vector{
"b"
}
}
Things got worse when we add reference, command(a ${b}_c)
:
vector{
"command",
vector{
"a"
},
vector{
vector{ // reference is also a vector
"b"
},
"_c"
}
}
Too many vectors %)
RPN
Reverse Polish(or postfix) Notation is a notation when arguments comes before operator. It shines when you need to represent “linear” expression without branches, also it doesn’t need parens to express precedence:
Normal(infix) notation: a + b
RPN: a b +
Normal: (a + b) * c
RPN: a b + c *
Now command(a ${b}_c)
is represented with:
vector<IExpression>{
StringExpr{"command"}, // command name
StringExpr{"a"}, UnquotedArg{1}, // 1 means number of subexpressions to concat
StringExpr{"b"}, VarRefExpr{1}, UnquotedArg{1}, // same for VarRefExpr
CallExpr{3} // 2 means number of arguments including name
}
One vector instead of four with AST approach, regardless how complex expression is, win :)
It also fits nicely a bottom-up parser like Bison because of the order in which symbols are discovered. In example above, Bison will discover symbols exactly in their order in that vector, you can just push expressions without any knowledge about previous symbols or other context.
Evaluation
RPN is evaluated using stack. Each expression knows its arity (number of arguments), it pops them from stack and pushes back the result. But there’s a little problem here. CMake expands list strings into multiple arguments:
set(my_list a;b;c)
command(${my_list}) # called with 3 args: a, b, c
It means that if our CallExpr has arity = 1, at run-time it might become any
number including zero. Classical RPN evaluation doesn’t work here. To overcome
this we need to adjust definition of arity
: now arity means number of
expressions whose results should be taken as arguments. And we need additional
stack to track this results count. Consider RPN representation of the above example:
{
StringExpr{"command"},
StringExpr{"my_list"}, VarRefExpr{1}, UnquotedArgExpr{1},
CallExpr{2}
}
Take a look at both stacks before CallExpr evaluation for two cases:
- my_list expands into 3 arguments
results: {"command", "a", "b", "c"} results_count: {1, 3}
CallExpr arity is 2, thus actual arity is the sum of last two elements in results_count stack and that will be the final number of its arguments
1 +3 = 4
. - my_list expands into 0 arguments
results: {"commands"} results_count: {1, 0}
Here, actual arity is
1 + 0 = 1
.
Another small benefits of this implementation
Easy to change
Writing syntax rules in Bison makes it much easier to change, understand, review and support, then hand-written parser.
Symbol locations
Bison makes symbol locations tracking almost automatic. With simple action you only need to track lines manually.
Error messages
Bison’s out-of-the-box error messages are pretty good:
f(${@}) # 1.5 : syntax error, unexpected invalid token, expecting command name or
# reference opening or reference closing or variable name
BOM and line breaks handling
CMake supports BOM header but only UTF-8 is allowed. Instead of reading it by hand we can handle it easily with another rule in parser.
CMake converts all \r\n
into \n
during file reading by replacing Flex’s input routine.
Honestly, I can’t fully understand that code. Supposedly it just replaces \r\n
with \n
and memcpy()
the rest, I want something better. In many places we can
just use \r?\n
regexp endings in scanner rules. In theory it’s possible
that string literal might contain \r\r\n
which should become \r\n
(I’m talking
about raw bytes 0x0D 0x0A
, not escapes). To handle this I remove trailing
\r
(if any) when \n
is met in string literal on the fly in scanner. Since
rules are written to take input line-by-line it doesn’t involve much overhead.
These simple solutions allow to eliminate custom reading routines and tons of
memcpy()
calls.
Aftenotes
It’s not an official CMake feature of course. If you like it, let me or CMake devs know to increase chances of having it in future CMake versions.