pega-texto
A Parsing Expression Grammars (PEG) runtime engine in C

Single-file Parsing Expression Grammars (PEG) runtime engine for C.

To use it, copy pega-texto.h file to your project and #define PEGA_TEXTO_IMPLEMENTATION before including it in one C or C++ source file to create the implementation.

Usage example

#include <stdio.h>
// 1. #define PEGA_TEXTO_IMPLEMENTATION on exactly one C/C++ file and include pega-texto.h
// Optionally #define other compile-time options, check out pega-texto.h for documentation
#define PT_DEFINE_SHORTCUTS // shortcuts make creating grammars more readable
#define PT_DATA size_t // define action result data
#define PEGA_TEXTO_IMPLEMENTATION
#include "pega-texto.h"
// This will return the length for each line we parse
PT_DATA line_length(const char *str, size_t size, int argc, PT_DATA *argv, void *userdata) {
return size;
}
// This will return the longest line length
PT_DATA longest_line(const char *str, size_t size, int argc, PT_DATA *argv, void *userdata) {
size_t longest = 0;
for(int i = 0; i < argc; i++) {
if(argv[i] > longest) {
longest = argv[i];
}
}
return longest;
}
// Helpers for getting error locations
typedef struct text_location { int line; int column; } text_location;
text_location get_text_location(const char *str, size_t where) {
text_location location = { 1, 1 };
for(size_t i = 0; i < where; i++) {
if(str[i] == '\n') {
location.line++;
location.column = 1;
}
else {
location.column++;
}
}
return location;
}
// This will be called when matching an invalid quotation mark
void invalid_quote(const char *str, size_t where, void *userdata) {
text_location location = get_text_location(str, where);
printf("Error: invalid quotation mark in middle of cell at %d:%d\n", location.line, location.column);
}
// This will be called when no quotation mark is found closing an open quoted cell
void unmatched_quote(const char *str, size_t where, void *userdata) {
while(where > 1 && !(str[where] == '"' && str[where - 1] != '"')) where--;
text_location location = get_text_location(str, where);
printf("Error: Quotation mark starting cell at %d:%d is not closed\n", location.line, location.column);
}
// 2. Create some grammar rules, which are arrays of expressions
// Defining indices in an `enum` makes referencing rules clearer
enum rule_indices {
R_CSV,
R_LINE,
R_CELL,
R_QUOTED,
R_EOL,
};
// Rules must have a terminating `PT_END()` operation,
// using `PT_RULE(...)` ensures we don't forget about it
pt_rule CSV = PT_RULE(
// after the whole match succeeds, `longest_line` will be called,
// receiving all inner `line_length` results on argv
ACTION(longest_line,
ZERO_OR_MORE(CALL(R_LINE))
)
);
pt_rule LINE = PT_RULE(
// after the whole match succeeds, `line_length` will be called for each line
ACTION(line_length,
CALL(R_CELL),
ZERO_OR_MORE(B(','), CALL(R_CELL))
),
EITHER(
CALL(R_EOL),
B('\0') // EOF
)
);
pt_rule CELL = PT_RULE(
EITHER(
CALL(R_QUOTED),
ONE_OR_MORE(
ERROR_IF(invalid_quote, B('"')),
ANY_BUT(S("\",\r\n"))
)
)
);
pt_rule QUOTED = PT_RULE(
B('"'),
ZERO_OR_MORE(
EITHER(
L("\"\""),
ANY_BUT(B('"'))
)
),
EITHER(
B('"'),
ERROR(unmatched_quote)
)
);
pt_rule EOL = PT_RULE(
OPTIONAL(B('\r')),
B('\n')
);
// 3. Create your grammar
// Grammars are just arrays of rules
[R_CSV] = CSV,
[R_LINE] = LINE,
[R_CELL] = CELL,
[R_QUOTED] = QUOTED,
[R_EOL] = EOL,
};
int main(int argc, const char **argv) {
const char *csv_text = "first,second,third\n1,\"2\",3";
// 4. Call the match algorithm!
pt_match_result result = pt_match(G, csv_text, NULL);
switch(result.matched) {
printf("Match failed! Invalid CSV content\n");
break;
printf("Match failed! Out of memory\n");
break;
printf("Match failed! Error matched\n");
break;
printf("Match failed! NULL CSV content\n");
break;
default:
printf("Matched CSV content with %d bytes.\n", result.matched);
printf("Longest line is %lu bytes long\n", result.data);
break;
}
return 0;
}
pega-texto.h – Parsing Expression Grammar (PEG) runtime engine
PT_DECL pt_match_result pt_match(const pt_grammar grammar, pt_element_string str, const pt_match_options *const opts)
Try to match the string str with a PEG.
pt_expr * pt_grammar[]
Grammar typedef, a 2D array of expressions, or array of Rules.
Definition: pega-texto.h:217
@ PT_NULL_INPUT
Provided string is a NULL pointer.
Definition: pega-texto.h:127
@ PT_NO_MATCH
Subject string didn't match the given PEG.
Definition: pega-texto.h:121
@ PT_NO_STACK_MEM
Error while allocating memory for the Action stack.
Definition: pega-texto.h:123
@ PT_MATCHED_ERROR
Matched an Error Expression.
Definition: pega-texto.h:125
pt_expr pt_rule[]
Rule typedef, an array of expressions.
Definition: pega-texto.h:215
Match result: a {number of matched chars/match error code, action result} pair.
Definition: pega-texto.h:339
int matched
If non-negative, represents the number of characters matched; otherwise, it's an error code.
Definition: pega-texto.h:342
PT_DATA data
Resulting data from the last top-level Action.
Definition: pega-texto.h:348
Default data type for Actions to return.
Definition: pega-texto.h:138

Change log

  • 4.0.0 - Refactor project as a single header implementation, split Quantifier Expressions into At Least and At Most expressions, make PT_DATA type configurable, add Action Expressions to simplify implementation and make code more readable, reimplement match algorithm with a recursive approach, make Expressions be layed out contiguously in memory and remove heap based creation and destruction of them, make grammar literal definable at compile-time.
  • 3.0.0 - Change actions to receive the capture with pointer and size, instead of pointer to input string, start and end of capture; add Byte expression, change Character Class Expressions to use only functions defined in ctype.h, change Range Expressions to use 2 bytes instead of a NULL terminated string; use Grammars without malloc'ing them.
  • 2.1.0 - Populate pt_match_result.data.i with the first syntactic error code when syntactic errors occur.
  • 2.0.1 - Put extern "C" declarations in inner headers.
  • 2.0.0 - ABI change on pt_match_options, included Case Insensitive and Character Class Expressions (the old Custom Matcher), changed Custom Matcher Expressions to allow operating on strings, also receiving userdata.
  • 1.2.7 - Removed all the Action sequence computation, as Actions are already stacked in the right sequence. Running actions is now iterative, O(n) and use far less memory.
  • 1.2.6 - Fixed SEQ and OR expression macros on C++, turns out they behave differently about temporary lifetime of arrays.
  • 1.2.5 - Fixed SEQ and OR expression macros to compile on both C and C++ using preprocessor macros and initializer_list directly on macro-on.h.
  • 1.2.4 - Added extern "C" declaration on pega-texto.h for using in C++.
  • 1.2.3 - Fixed validation error code emmited when pt_is_nullable returned true, as it may find an error other than PT_VALIDATE_LOOP_EMPTY_STRING.
  • 1.2.2 - Added NULL string check on match.
  • 1.2.1 - Fixed validation error on empty SEQ and OR Expressions, which are valid with a NULL pointer.
  • 1.2.0 - Macros for Expressions to not own memory buffers, empty SEQ and OR Expressions don't allocate a 0-byte buffer anymore, fixed validation error on Non-terminal cycles.
  • 1.1.1 - Fixed validation error on Non-terminal cycles.
  • 1.1.0 - Added basic error handling support.
  • 1.0.0 - Expressions, Grammars, parsing, validation, actions.