Designing a Parsing Library in Scala

Parsing is something that many programmers dread. Many prefer to cram all their use cases into existing formats, such as JSON or YAML, even when effectively building a DSL in those formats. If one must parse, many prefer to use parser generators such as ANTLR. But for those that are already familiar with functional programming, using a functional parser combinator library can be a very convenient and high performance alternative.

This post will cover two aspects of a parsing library that may be unfamiliar to some programmers. First I will describe a pattern to improve performance of functional libraries by safely using mutability internal to an implementation, yet opaque to users. Second, once we have fast parsers, I will introduce some design changes to make the parser library safer to use and more able to give precise error messages.

How to go fast?

Li Haoyi in a talk on Fastparse in 2015 made much the same argument for parsing as above: it can be fast and it can be clearly written in normal scala code. Haoyi presented Fastparse, which has a nice API, and is also fast. In scala parsing libraries, there have been a number of libraries with nice APIs, but often the performance was very poor. Haoyi makes the point that you may say you don’t care about performance, but he showed some examples of parsers taking minutes that in fastparse ran in seconds, and it is hard to imagine not caring about that difference. His talk does not detail exactly how Fastparse is implemented, but he mentions internally he uses a number of mutable hacks.

Three years later in 2018 I gave a talk summarizing the main approach of this post. A full realization of this approach can be found in a single file implementation. The core idea of this design is the following:

  1. Encapsulate the mutable state in a private class hidden from users.
  2. Parsers have a protected method that receives the state and mutates to have the correct state after the current parser is run.
  3. The Parser class is sealed so only the implementations of that parser are in the same file, which allows us to ensure that all parsers are lawful and does not expose this mutation to the user.
  4. The user sees a single parse method which accepts an input, and returns an output, but internal to that method it allocates a State instance which will be mutated as the parsing is run.

A simplified version of this pattern is sketched below:

sealed trait Parser[A] {
protected def parseInternal(s: State): A

final def parse(str: String): Either[Error, (String, A)] = {
val state = new State(str)
val res = parseInternal(state)
if (state.isError) Left(state.getError)
else Right((str.substring(state.offset), res))

I want to stress why this design is safe: first, the method that has access to the mutable State value is protected so only subclasses can call it, secondly the trait is sealed so all instances of it are in the current file. Hence there is no way users will accidentally violate invariants required of modifications to State. For instance, parsers never go backwards on success, but an incorrect implementation could possibly update the mutable State offset variable incorrectly, and break that invariant if this were exposed.

Note, this pattern of a mutable inner state also works well for random generation monads, or modeling mutable references. See for example this code, which uses a similar pattern to make a monad to have faster mutable references than using an immutable map with a state monad.

While this pattern is very interesting for writing fast code that maintains, from a user’s perspective, pure functional code, I was dissatisfied with using Fastparse. I had grown concerned with a few issues:

  1. The Haskell community taught us that parsers compose in may ways. Specifically, they have instances of Defer, Alternative and Monad for parser, yet out of the box Fastparse doesn’t give us those instances or always reuse combinator names that a Cats user might be used to. This presents a barrier to learning and reading this code when used in a Cats-style codebase.
  2. Moving from Fastparse 1 to 2 requires a pretty major change and relies on macros. Scala 2 macros are slated for removal or deprecation in Scala 3 so it seemed porting code that used Fastparse 1 to 2 would be signing up for another major migration soon.
  3. By default, Fastparse parsers are backtracking. While it may seem like a feature to backtrack, parsers that use too much backtracking can easily become exponential in their runtime. Secondly, backtracking makes reporting exactly where an error occurred more complex because we backtrack on error. Fastparse supports cuts, which prevent backtracking, but in my experience the design should be inverted: cut by default, opt in to backtracking as needed.
  4. One of the most commonly used parser combinators is repetition: a given thing can be repeated n or more times. Unfortunately, this combinator is not safe to apply to a parser that can parse zero input and succeed. If one does so, at runtime you will run out of memory trying to allocate an infinitely long list. This kind of mistake is common when writing complex parsers, yet there is no static protection in most parsing libraries from this error.

Parsing Fast and Safe

While working in Haskell on a project a few years ago, I learned of the Trifecta parsing library. A main goal of Trifecta is to produce superior error messages to point more clearly to where the parsing has failed, and what is expected. In that design, Trifecta defines four possible outputs of a parser:

  1. “epsilon failure”: failing to parse without consuming any input.
  2. “committed failure”: failing to parse consuming 1 or more character of the input.
  3. “epsilon success”: success after consuming no input, e.g. a parser that returns a constant value.
  4. “committed success”: success after consuming 1 or more characters from input.

One of the most interesting combinators on parsers is alternation: either this parser or that. We first try parser 1 and if that fails, try parser 2. In many parsing libraries this will automatically backtrack on the failure of parser 1. Trifecta introduced to me the idea of alternation only moving to the second one on an “epsilon failure”. Experienced readers might relate this to LL(1) parsers: we can dispatch alternation by looking at only a single next character. Consider the case of parsing JSON. To see if the next item is null, a string, a boolean, a number, an array or a mapping you only need to peek at the next single character. If it does not match what you expect, you don’t consume it. In this case, it is very easy to write a JSON parser without any backtracking and only alternating to the next parser on an epsilon failure.

By contrast, if we start consuming characters to parse a mapping, and the first few characters are “[ ==” we know immediately on parsing the “=” character that there has been an error: we do not want to backtrack and try something new. This is happens naturally in Trifecta because you have consumed some characters to reach that point.

Sometimes we do want backtracking. It is unfortunate, but some parsing does require a certain amount of look-ahead, or can be much more conveniently expressed with such look-ahead. In those cases, Trifecta introduces a combinator that converts all “committed failure” cases into “epsilon failure” cases, so that normal alternation can work. In Trifecta this is called try but in Scala try is a keyword so instead I suggest backtrack.

Even before alternation, perhaps the most important combinator for parsing is sequencing or product: parse A then B. While porting my parsers from backtrack-by-default à la Fastparse to opt in backtracking à la Trifecta, I noticed a kind of combinator I wanted that I didn’t see how to get in Trifecta: a soft product.

Given parser A and parser B, if A returns a success, then if B returns an epsilon failure, then backtrack to before A, else if B has a committed failure return that failure, else if B has a success return the result of both parsers. If A returns a failure return that failure.

I call this a kind of soft-product. As an example, one might use this with A parsing some whitespace or optional connector to a postfix operator parsed by B. Succeeding to parse whitespace but failing to parse the operator shouldn’t fail the entire parse (although potentially this example is better dealt with by making parsers consume trailing whitespace).

To address repetition safety, the ability to safely write code that used parsing of N or more items, I introduced a subtype of Parser which I imaginatively call Parser1. A Parser1 is a parser that consumes 1 or more on success, a Parser1 can never have an epsilon success. We can require a Parser1 to use repetition combinators. It is interesting to note how Parser1 composes with Parser: if you sequence one of each, you get a Parser1 as a result since once characters are consumed they are never unconsumed. This is also true of a flatMap: if at either the first or second parsers is always a Parser1 the result is a Parser1. Leveraging these properties I could easily rewrite most of my parsers as Parser1 values.

In the Scala functional programming community we have embraced more fine grained divisions of our typeclasses than seems common in the Haskell community. For instance, in Cats, if we remove the function called pure, which is the ability to lift a value A => F[A], from Monad we have FlatMap, or if we do so to Applicative we have Apply. Similarly if we weaken Alternative we get MonoidK. The function A => Parser[A] is easy to write: consume no input and just return the argument of type A. Since Parser1 must consume input because it never has an epsilon success, we cannot write such a function A => Parser1[A] which has the other properties we want in relationship to existing Monad laws. So, while for Parser we have have instances of Alternative and Monad, for Parser1 we only have instances of MonoidK and FlatMap. It is interesting that we still have lawful abstractions for these more restricted parsers, and we have the ability to use them in repetitions which are ubiquitous in parsing with an added guarantee: we know we will never have the error of an infinite loop at runtime on an epsilon success.

Summary and Invitation for Collaboration

The main points I want to make here are:

  1. Reemphasize the performance wins of private mutability in code with pure and immutable APIs (see my talk).
  2. Call out the value of non-backtracking by default for faster parsers and the better error messages, (thanks Trifecta)!
  3. Call out the value of a Parser subtype which always consumes one or more characters on success for writing safer parsers: these parsers have interesting typeclass instances and compose well with fully general parsers (yay Parser1)!

I have implemented this parsing library with full test coverage and used it to replace my usage of Fastparse in a non-trivial case: parsing my python-like statically typed, pure and total functional language Bosatsu. It took a few days to port my parsing code to use my library since I had to locate all the places where I was using backtracking without realizing it. By making a few changes I was able to remove a few of those and am down to only two cases where I currently require non-trivial backtracking.

My library seems to match the performance of Fastparse 1, compiles on scalajs, runs with graalvm native-image and does not use Scala macros. Additionally it follows Cats naming standards and includes Cats typeclass instances. I think it could be an excellent start for a typelevel cats-parse project if anyone is interesting in collaborating with me on this. The best way to reach me is to find me on Twitter.