Skip to content

ofabricio/bnf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BNF

A Go library for parsing text using a BNF-like grammar.

playground

Note

This is still in a very early, unusable state.

Install

go get github.com/ofabricio/bnf

Example 1

Parsing a simple expression. Go Playground

import "github.com/ofabricio/bnf"

func main() {

    INP := `6+5*(4+3)*2`

    BNF := `
        expr = ROOT(term '+' expr) | term
        term = ROOT(fact '*' term) | fact
        fact = '('i expr ')'i | numb
        numb = '\d+'r
    `

    b := bnf.Compile(BNF)
    v := bnf.Parse(b, INP)

    bnf.Print(v)

    // Output:
    // [Ident] +
    //     [Ident] 6
    //     [Ident] *
    //         [Ident] 5
    //         [Ident] *
    //             [Ident] +
    //                 [Ident] 4
    //                 [Ident] 3
    //             [Ident] 2
}

Homework: try adding support for whitespaces, minus, division and negative numbers.

Example 2

Parsing a simple JSON that supports only strings and no whitespaces. Go Playground

import "github.com/ofabricio/bnf"

func main() {

    INP := `{"name":"John","addresses":[{"zip":"111"},{"zip":"222"}]}`

    BNF := `
        val = obj | arr | str
        obj = GROUP( '{'i okv (','i okv)* '}'i ):Object
        arr = GROUP( '['i val (','i val)* ']'i ):Array
        okv = str ':'i val
        str = MATCH( '"' ( ANYNOT('"' | '\\') | '\\' any )* '"' )
    `

    b := bnf.Compile(BNF)
    v := bnf.Parse(b, INP)

    bnf.Print(v)

    // Output:
    // [Object]
    //     [Ident] "name"
    //     [Ident] "John"
    //     [Ident] "addresses"
    //     [Array]
    //         [Object]
    //             [Ident] "zip"
    //             [Ident] "111"
    //         [Object]
    //             [Ident] "zip"
    //             [Ident] "222"
}

Homework: try adding support for whitespaces, numbers, booleans and null.

Introduction

BNF is a notation used to describe the syntax of programming languages or other formal languages. It describes how to combine different symbols to produce a syntactically correct sequence of tokens.

It consists of terminal and nonterminal symbols. Nonterminal symbols are identifiers that identify a set of terminal and/or nonterminal symbols.

Consider this possible BNF for a math expression:

expr = num '+' num
num  = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Here the strings ('+', '0', ...) are terminal symbols; num is a nonterminal symbol.

Now consider this expression as input:

6+5*(4+3)

By applying the previous BNF to the input above we get as result:

[Group]
    [Ident] 6
    [Ident] +
    [Ident] 5

Not very exciting at first, but if we get a result it means the parsing succeeded and we have a syntactically correct sequence of tokens. Though we haven't exhausted the whole input here.

Noteworthy points:

  • The result is always a flat tree like the above. Functions are used to compose a tree; they are described in the next section.
  • The parser starts from upper left (expr = num ...) and goes rightward.
  • Terminal symbols emit tokens (a node in the tree).
  • When the parsing fails, a node of type Error is returned; it has the position where the parser failed.
  • There can be recursive calls, when a statement calls itself (see Example 1).

This is the struct returned by the bnf.Parse(b, src) function:

type AST struct {
    Type string // Identifies a group of tokens.
    Text string // Token text.
    Pos  int    // Token position.
    Next []AST  // Children nodes.
}

Quick Reference

Operator Description
'...' Match a plain text string. This emits a token.
'...'r The string is a Regular Expression.
'...'i Ignore the token (do not emit it).
ROOT(a b c) Make b a root token.
GROUP(a) Group the tokens.
ANYNOT(a) Match any character that is not a.
JOIN(a) Join nodes into one.
MATCH(a) Match nodes into one.
TEXT(a) Add a text node in the tree.
SAVE(a) Save a token to be loaded with LOAD().
LOAD() Load a token saved with SAVE().
FIND(a) Scan through the input text until a match is found and emit it.
REVERSE(a b) Reverse the token positions.
SUM(a) Sum aggregation.

Note: there is support for custom functions by adding them to bnf.DefaultFuncs map.

Default identifiers

These identifiers don't need to be defined in the BNF. If defined they will be overridden.

Identifier Description
WS Match a whitespace character '\s'ri
SP Match a space character ' 'i
ST Match a space or tab character '[ \t]'ri
NL Match a newline character '\n'ri
TB Match a tab character '\t'ri
ANY Match any character '.'ri
any Match any character '.'r
EOF Match if the scanner is at the end.
MORE Match if the scanner has more to scan.
SKIPLINE Skip a line.

Documentation

String

Strings '...' emit nodes whenever they match.

INP := `Hello World`

BNF := `
    root = 'Hello' ' ' 'World'
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] Hello
//     [Ident]  
//     [Ident] World

Regex

The r flag makes a String a regular expression instead.

INP := `Hello World`

BNF := `
    root = '\w+'r '\s'r '\w+'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] Hello
//     [Ident]  
//     [Ident] World

Ignore

The i flag prevents a node from being emitted.

INP := `Hello World`

BNF := `
    root = 'Hello' ' 'i 'World'
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] Hello
//     [Ident] World

And

Any sequence of symbols is considered an And expression. For example a b c reads as match a AND b AND c. All of them need to succeed for the matching to continue. And has precedence over Or.

INP := `abc`

BNF := `
    root = 'a' 'b' 'c'
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] a
//     [Ident] b
//     [Ident] c

Or

Any sequence of symbols separeted by a | is considered an Or expression. For example a | b | c reads as match a OR b OR c. If one of them succeeds the matching continues. And has precedence over Or.

INP := `cba`

BNF := `
    root = 'a' | 'b' | 'c'
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Ident] c

Quantifiers

Quantifiers have the same behavior as in a regular expression.

  • ? matches zero or one token.
  • * matches zero or many tokens.
  • + matches one or many tokens.
INP := `OneOneTwo`

BNF := `
    root = 'One'+
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] One
//     [Ident] One

Type

Use :type suffix in any node to rename its type.

INP := `2+3`

BNF := `
    root = num '+':Opr num
     num = '\w+'r:Val
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Val] 2
//     [Opr] +
//     [Val] 3

Functions

ANYNOT

This function matches any character that is not the argument.

INP := `A`

BNF := `
    root = ANYNOT('B')
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Ident] A

FIND

This function scans through the input text until a match is found and emit it.

INP := `The standard chunk of Lorem Ipsum used since the 1500s
is reproduced below for those interested. Sections 1.10.32 and
1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also
reproduced in their exact original form, accompanied by English
versions from the 1914 translation by H. Rackham.`

BNF := `
    root = FIND(ver | num)+
     num = '\d+'r
     ver = MATCH(num '.' num '.' num)
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] 1500
//     [Ident] 1.10.32
//     [Ident] 1.10.33
//     [Ident] 1914

GROUP

By default all matching tokens are emitted separatelly, one node for each matching token. This function groups these nodes together.

Try removing the GROUP functions in the example below and see how the result changes.

INP := `OneTwoThreeFourFiveSixSevenEight`

BNF := `
    root = 'One' GROUP('Two' 'Three') GROUP('Four' GROUP('Five' 'Six') 'Seven') 'Eight'
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] One
//     [Group]
//         [Ident] Two
//         [Ident] Three
//     [Group]
//         [Ident] Four
//         [Group]
//             [Ident] Five
//             [Ident] Six
//         [Ident] Seven
//     [Ident] Eight

JOIN

By default all matching tokens are emitted separatelly, one node for each matching token. This function joins these nodes into one. The order of the nodes matter.

In the example below, note:

  • How 1+1 are emitted separatelly by default.
  • How 2+2 are joined into one node.
  • How 3+3 are joined in a strange order due to the ROOT function, that changes the order of the nodes.
  • How the + gets removed from 4+4 due to the Ignore flag i.
INP := `1+1 2+2 3+3 4+4`

BNF := `
    root = (num '+' num) SP JOIN(num '+' num) SP JOIN(ROOT(num '+' num)) SP JOIN(num '+'i num)
     num = '\d+'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] 1
//     [Ident] +
//     [Ident] 1
//     [Ident] 2+2
//     [Ident] +33
//     [Ident] 44

MATCH

By default all matching tokens are emitted separatelly, one node for each matching token. This function matches these nodes into one. It takes the text between the first and the last matching token, returning a node with it.

It is more efficient than JOIN, but some operators don't work with it, for example the Ignore flag.

In the example below, note:

  • How 1+1 are emitted separatelly by default.
  • How 2+2 are joined into one node.
  • How 3+3 are joined into one node even though ROOT changes the order of the nodes.
  • How the + is not removed from 4+4 even though Ignore flag i is present.
INP := `1+1 2+2 3+3 4+4`

BNF := `
 root = (num '+' num) SP MATCH(num '+' num) SP MATCH(ROOT(num '+' num)) SP MATCH(num '+'i num)
  num = '\d+'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] 1
//     [Ident] +
//     [Ident] 1
//     [Ident] 2+2
//     [Ident] 3+3
//     [Ident] 4+4

REVERSE

This function reverses the token positions.

INP := `OneTwoThree`

BNF := `
    root = REVERSE('One' 'Two' 'Three')
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] Three
//     [Ident] Two
//     [Ident] One

ROOT

This function takes 3 arguments and when all of them match it makes the second argument a root node.

INP := `2+3`

BNF := `
    root = ROOT('2' '+' '3')
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Ident] +
//     [Ident] 2
//     [Ident] 3

SAVE and LOAD (Backreference)

These functions work together to allow for Backreference.

In this example we guarantee that the closing tags have the same name of the opening tags.

INP := `<a>hello<b>world</b></a>`

BNF := `
    tag = GROUP( MATCH('<' SAVE(w) '>') ( w | tag )* MATCH('</' LOAD() '>') )
      w = '\w+'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] <a>
//     [Ident] hello
//     [Group]
//         [Ident] <b>
//         [Ident] world
//         [Ident] </b>
//     [Ident] </a>

SUM

This function performs sum aggregation on numeric token values.

INP := `It is $2 for the tomatos and $3.55 for the potatos.`

BNF := `
    root = SUM(FIND(v)+)
       v = '$'i '\d+(\.\d+)?'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Ident] 5.55

TEXT

This function adds a text node in the tree; this is useful to add a default value.

It accepts either a plain string argument TEXT('example') or no argument TEXT() to add an empty node.

INP := `
    Key1=Value1
    Key2=
    Key3=Value3
`

BNF := `
    root = FIND( pair )+
    pair = val '='i ( val | TEXT('Default Value') )
     val = '\w+'r
`

b := bnf.Compile(BNF)
v := bnf.Parse(b, INP)

bnf.Print(v)

// Output:
// [Group]
//     [Ident] Key1
//     [Ident] Value1
//     [Ident] Key2
//     [Ident] Default Value
//     [Ident] Key3
//     [Ident] Value3

Releases

No releases published

Contributors

Languages