Parsing Simple Grammars in Scala With parboiled2

parboiled2 is a Macro-Based PEG Parser Generator written in Scala. It has become our preferred tool for creating parsers for simple grammars. It offers a fairly simple syntax for creating parsers and boasts significantly better performance over Scala Combinators.

To illustrate its use, we will use this post to build a parser for a simplified version of the Slack Assisted Search.

The simplifying assumptions that we will make about the language are as follows:

  • The search term will always be last.
  • It supports “from”, “in”, and “on”.
  • To start with, we will only allow one filter, and it must be the first element.

The Basic Parser

To begin, we will show setting up the parser and parsing out the search term. We will start by defining a simple Algebraic Data Type (ADT) for our Search Filters:

import org.joda.time.DateTime

sealed trait SlackSearch
case class SearchTerm(term: String) extends SlackSearch
case class SearchInChannel(search: SearchTerm, channel: String) extends SlackSearch
case class SearchMessagesFrom(searchTerm: SearchTerm, user: String) extends SlackSearch
case class SearchOnDate(searchTerm: SearchTerm, dateTime: DateTime) extends SlackSearch

This sealed trait hierarchy encodes the four cases that we are trying to support:

  • The SearchTerm represents a search for a string with no filters.
  • The SearchInChannel case class represents a search in the Channel whose name is represented by the channel string.
  • The SearchMessagesFrom case class represents a search in Messages from the user with the name in user.
  • The SearchOnDate case class represents a search for everything that matches the term on the given date.

With these definitions in hand, we can implement a simple parser that will extract a SearchTerm.

import org.parboiled2._

class SlackAssistedSearchParser(val input: ParserInput) extends Parser {
 def InputLine = rule(Search ~ EOI)
 def Search = rule(Term ~> SearchTerm)
 def Term: Rule1[String] = rule(capture(oneOrMore(CharPredicate.AlphaNum)))
}

This is a fairly simple parser. As the code suggests, the Term rule captures a string that consists of one or more alphanumeric characters. The Search rule matches a Term and then uses the action operator aka ~> to add a SearchTerm to parboiled’s value stack. The action operator takes a function whose arguments match the number of matched values. The result of this function is then placed on parboiled’s value stack to be returned. So in this case, our function is SearchTerm.apply, which will take the string captured by Term and return a SearchTerm.

We can run this parser in the REPL on some valid input and some invalid input:

scala> new SlackAssistedSearchParser("searchterm").InputLine.run()
res1: scala.util.Try[com.threatstack.blog.SlackSearchAst] = Success(SearchTerm(searchterm))

scala> new SlackAssistedSearchParser("search term").InputLine.run()
res2: scala.util.Try[com.threatstack.blog.SlackSearchAst] = Failure(ParseError(Position(7,1,8), Position(7,1,8), <8 traces>)) 

The first attempt at running the parser is successful and successfully parses a SearchTerm. The second attempt, which is invalid because of the whitespace, returns an error that indicates the source position of where the parser failed to match.

Whitespace

One thing to note is that this parser does not handle any whitespace that the user may accidentally insert. As parboiled’s readme points out, this is a distinct disadvantage when using PEG Parsers. The documentation recommends that users “match whitespace always immediately after a terminal”.

We can see how this looks when we add a simple whitespace rule to Term:

def Term: Rule1[String] = 
  rule { capture(oneOrMore(CharPredicate.AlphaNum)) ~ WhiteSpace }
 def WhiteSpace = rule(zeroOrMore(WhiteSpaceChar))
 def WhiteSpaceChar = CharPredicate(" ")

Channel and From Searches

Now that we have a simple parser, we can handle the next two cases: SearchInChannel and SearchMessagesFrom. We can add the following to the SlackAssistedSearchParser:

def SearchFilter: Rule1[SlackSearch] = rule {
 Search ~ InChannel  ~> SearchInChannel    |
 Search ~ FromFilter ~> SearchMessagesFrom |
 Search
}

def InChannel: Rule1[String] = rule("in" ~ ":" ~ Term)
def FromFilter: Rule1[String] = rule("from" ~ ":" ~ Term)

The InChannel rule must match the string “in” followed by the string “:” and finally followed by a Term. The FromFilter is basically the same — it just matches “from” instead of “in”. We also amend SearchFilter to handle these two new cases.To accomplish this, we’ll make use of the ordered choice operator also known as |. This operator attempts to match the first clause, and if it cannot, it tries again with the second clause.

We can construct some rules in the console and see how they parse:

scala>  new SlackAssistedSearchParser("elasticsearch in:bigthings").InputLine.run()
res3: scala.util.Try[com.threatstack.blog.SlackSearchAst] = Success(SearchInChannel(SearchTerm(elasticsearch),bigthings))

scala>  new SlackAssistedSearchParser("xcom from:wayne").InputLine.run()
res4: scala.util.Try[com.threatstack.blog.SlackSearchAst] = Success(SearchMessagesFrom(SearchTerm(xcom),wayne))

Sub Parsers

To support searching for messages on a given date, we’ll implement our own rudimentary date parser. It will handle the two formats that Slack handles: MM/DD/YYYY and YYYY/MM/DD.

While this is not going to be a huge addition, parsing dates may be useful to us elsewhere, or maybe we just feel that it clutters our existing parser. So we’ll create another parser, and use parboiled’s ability to run a parser inside a parser.

We’ll start by defining a parser for our two date formats:

class DateParser(val input: ParserInput) extends Parser {
  import CharPredicate._

  def InputLine = rule(Date ~ EOI)

  def Date: Rule1[DateTime] = rule {
    Month ~ "/" ~ Day ~ "/" ~ Year ~> dateTime _ |
    Year ~ "/" ~ Month ~ "/" ~ Day ~> ((year: Int, month: Int, day: Int) => dateTime(month, day, year))
  }

  def Day = Digit2
  def Month = Digit2
  def Year = Digit4

  def Digit2: Rule1[Int] = rule(capture(2.times(Digit)) ~> ((_: String).toInt))
  def Digit4: Rule1[Int] = rule(capture(4.times(Digit)) ~> ((_: String).toInt))
  
  private def dateTime(month: Int, day: Int, year: Int): DateTime =
    new DateTime(year, month, day, 0, 0, 0)
}

The meat of this parser is in “Date” where we differentiate the two date formats that we have to support. We also define the rules Day, Month, and Year to make the intent of the parser clearer.

We can run this parser on some date inputs by themselves:

scala> new DateParser("10/24/2017").InputLine.run()
res5: scala.util.Try[org.joda.time.DateTime] = Success(2017-10-24T00:00:00.000-04:00)

scala> new DateParser("2017/10/24").InputLine.run()
res6: scala.util.Try[org.joda.time.DateTime] = Success(2017-10-24T00:00:00.000-04:00)

We can then take this parser and use it in our SlackAssistedSearchParser:

def SearchFilter: Rule1[SlackSearchAst] = rule {
 Search ~ DateFilter ~> SearchOnDate       |
 Search ~ InChannel  ~> SearchInChannel    |
 Search ~ FromFilter ~> SearchMessagesFrom |
 Search
}

def DateFilter: Rule1[DateTime] = rule {
 "on" ~ ':' ~ runSubParser(new DateParser(_).InputLine)
}

This parser will attempt to use the sub parser (in our case, the date parser) to match input. This can be useful for breaking out parsers to be reusable or even just to separate the concerns of various parsers. We can try it out in the REPL and see that it will use the DateParser to parse dates:

scala> new SlackAssistedSearchParser("hello on:2016/12/10").InputLine.run()
res9: scala.util.Try[com.threatstack.blog.SlackSearchAst] = Success(SearchOnDate(SearchTerm(hello),2016-12-10T00:00:00.000-05:00))

Conclusion

In this post we have provided a very simple example of how to use parboiled. The domain was incredibly simplified, but because of parboiled’s flexibility, it would be possible to implement the full spec. This is actually a nice exercise for the reader, because it would leverage some of the parboiled operators that have not been covered in this post. It also shows some promise for sharing code with the front-end, because parboiled can be used with Scalajs.

In general, we’ve found parboiled2 to be a great way to handle little expression languages that need to be exposed either internally or externally. It’s performance and flexibility make it a valuable library. The full source code is available on Github.