Thursday, October 15, 2020

Racket: Parsing propositional logic in 33 lines

Parsing propositional logic in 33 lines of Racket

I recently became interested in two seemingly disparate things: Scheme/Racket and symbolic logic, so I decided to have some fun by combining those two into a little project. The marriage is actually a little less weird than it seems, since Racket has very robust tools for lexing and parsing due to its focus on meta-programming. In this post, we'll go over some strategies for implementing a simple parser, with concepts that could easily be extended to parse other things like JSON or even your own programming language.

What is propositional logic?

Before we go any further, it would be good to introduce what we'll be parsing. Propositional logic is a branch of symbolic logic that defines a grammar and set of rules for stating propositions. With this, we can define an argument such as:

John goes to the park.Maria goes to the park.Therefore, John and Maria go to the park.\textnormal{John goes to the park.} \newline \textnormal{Maria goes to the park.} \newline \textnormal{Therefore, John and Maria go to the park.} \newlineJohn goes to the park.Maria goes to the park.Therefore, John and Maria go to the park.

Or symbolically:

P:=John goes to the parkQ:=Maria goes to the parkP  .  Q  ∴P∧QP := \textnormal{John goes to the park} \newline Q := \textnormal{Maria goes to the park} \newline P \; \ldotp \; Q \; \therefore P \land QP:=John goes to the parkQ:=Maria goes to the parkP.QPQ

Propositional logic is constrained to sentences that are observational statements like "John goes to the park." There are no variables, and no quantifiers like all or some.

The Symbols

With propositional logic informally introduced, let's look at the symbols we have at our disposal, which will need to be worked into our parser:

Name Symbol
Sentence Letters P,Q,…,ZP, Q, \ldots , ZP,Q,,Z
NOT ∼\sim
AND, OR ∧  ,∨\land \;, \lor,
IF, IFF →  ,  ⟺  \rarr \;, \iff,

Note that P→Q\text{P} \rarr \text{Q}PQ is read as "If P, then Q\text{If P, then Q}If P, then Q".

Parsing in Racket

Okay, that's enough theory, let's get to coding. The main libraries at our disposal are found in the parser-tools package, but the documentation for these tools was confusing to me, and setting up a simple parser was not very ergonomic. Fortunately, there is another way of building a parser: the Brag package.

Brag (Better Racket AST Generator) is a package that allows us to define a grammar in standard BNF form, then easily lex and parse that grammar. Simply install with raco pkg install brag so we can use it.

Step One: Defining our grammar

Let's create a new file in our directory called grammar.rkt and put #lang brag at the top. This is the file we will use to define our BNF grammar. For propositional logic, that grammar can be represented elegantly, like this:

#lang brag
sentence: ATOMIC | complex
complex: LPAR sentence RPAR | sentence connective sentence | NOT sentence
connective: AND | OR | IF | IFF

In Brag, any string in all caps is interpreted as a token to be included in our lexer, while those not in caps are types of expressions that we must define elsewhere in the grammar. We define the tokens ATOMIC, NOT, AND, OR, IF, IFF as analogues for the symbols we discussed earlier. We also have the tokens LPAR and RPAR to represent parentheses.

Here, we are successively breaking down the syntax of the language into its constituent parts. A sentence can be either ATOMIC (e.g PPP or QQQ) or complex. A complex expression is any sentence within parentheses, or two sentences joined by a connective, or a negation of a sentence. Finally, a connective can be AND, OR, IF, or IFF.

Step Two: Creating the lexer

Let's create another racket file in the same directory called parser.rkt. We also need to import brag/support, our grammar from grammar.rkt and the package br-parser-tools/lex, which is a fork of the default racket parser tools used by brag. To install this, simply raco pkg install br-parser-tools. You should now have this:

#lang racket
(require brag/support)
(require br-parser-tools/lex)
(require "grammar.rkt")

Now we can now define our tokenize function, which will take in an input string, and output a list of tokens. It will have the form of something like this:

(define (tokenize ip)
    (define lexer
        (lexer-src-pos
            
                [(eof) (void)]))
    (define (next-token) (lexer ip))
    next-token)

All we are doing here is setting up the imported function lexer-src-pos which will take a list of rules for matching tokens and return a function that lexes them. Then, by calling next-token we run this lexer function recursively on the input port until we reach the special token eof, at which point we return void.

We can now add in the rules for the tokens we want to lex for:

(define (tokenize ip)
    (define lexer
        (lexer-src-pos
                [(char-range #\P #\Z) (token 'ATOMIC lexeme)] 
                ["^" (token 'AND lexeme)] 
                        ["v" (token 'OR lexeme)] 
                        ["~" (token 'NOT lexeme)] 
                        ["->" (token 'IF lexeme)] 
                        ["<->" (token 'IFF lexeme)] 
                        ["(" (token 'LPAR lexeme)] 
                        (")" (token 'RPAR lexeme)) 
                [whitespace (token 'WHITESPACE lexeme #:skip? #t)] 
                [(eof) (void)])) 
    (define (next-token) (lexer ip)) 
    next-token)

Notice how each of these rules are enclosed in brackets, where the first expression is the pattern to lex, and the second expression is what to return with that pattern. You can return anything you want (or add side effects) but we will return the token struct built into Brag, since this will integrate well with the package's parsing capabilities.

The token struct takes two inputs: the name (which we defined in our grammar!), and the lexeme, which is the string returned from the pattern we just lexed. So, when we add ["^" (token 'AND lexeme)], we are lexing for the string "^" under the syntactic rules for the token AND defined in our grammar file.

Also note that the lex package comes with some functions useful for matching tokens like char-range and whitespace which we use here too.

Step Three: Parse and Display

We now have a function that can tokenize any input string, so let's parse those tokens now, which is really as simple as:

(define stx
        (parse (tokenize (open-input-string "P^Q")))) 
(syntax->datum stx) 

Parser complete! And as promised, the grammar and parser files amount to just 33 lines.

This produces a list of tokens, which for the input "P^Q" is

'(sentence 
    (complex 
      (sentence "P") 
      (connective "^") 
      (sentence "Q")))

This is exactly the data structure we want. The entire string is a sentence, with a complex sentence within it, containing the tokens P, ^, and Q.

Let's try it on something a bit more complicated: "(P->Q) <-> ~R":

'(sentence 
    (complex (sentence (complex 
      "(" (sentence (complex (sentence "P") (connective "->") (sentence "Q"))) ")"))
      (connective "<->") 
      (sentence (complex "~" (sentence "R")))))

Again, this nested structure is exactly what we want.

Where to go from here?

We can now parse any symbolic sentence in propositional logic. Parsing is a great first step, but on it's own, it doesn't do very much. With the data structures we produce though, we could create programs that visualize the logical tree, derive theorems and arguments with a solver, among others. The same strategies we used here could also be applied to parsing JSON or other structured languages.

And, for a more in-depth look at Brag, make sure to read the docs.

Questions or comments? Email me.



from Hacker News https://ift.tt/33SzD7b

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.