--- title: "Write You a Parser for Fun and Win" date: 2013-11-26T00:00:00+00:00 draft: false canonical_url: https://www.viget.com/articles/write-you-a-parser-for-fun-and-win/ --- As a software developer, you're probably familiar with the concept of a parser, at least at a high level. Maybe you took a course on compilers in school, or downloaded a copy of [*Create Your Own Programming Language*](http://createyourproglang.com), but this isn't the sort of thing many of us get paid to work on. I'm writing this post to describe a real-world web development problem to which creating a series of parsers was the best, most elegant solution. This is more in-the-weeds than I usually like to go with these things, but stick with me -- this is cool stuff. ## The Problem Our client, the [Chronicle of Higher Education](http://chronicle.com/), [hired us](https://viget.com/work/chronicle-vitae) to build [Vitae](http://chroniclevitae.com/), a series of tools for academics to find and apply to jobs, chief among which is the *profile*, an online résumé of sorts. I'm not sure when the last time you looked at a career academic's CV was, but these suckers are *long*, packed with degrees, publications, honors, etc. We created some slick [Backbone-powered interactions](https://viget.com/extend/backbone-js-on-vitae) for creating and editing individual items, but a user with 70 publications still faced a long road to create her profile. Since academics are accustomed to following well-defined formats (e.g. bibliographies), [KV](https://viget.com/about/team/kvigneault) had the idea of creating formats for each profile element, and giving users the option to create and edit all their data of a given type at once, as text. So, for example, a user might enter his degrees in the following format: Duke University ; Ph.D.; Biomedical Engineering University of North Carolina 2010; M.S.; Biology 2007; B.S.; Biology That is to say, the user has a bachelor's and a master's in Biology from UNC, and is working on a Ph.D. in Biomedical Engineering at Duke. ## The Solution My initial, naïve approach to processing this input involved splitting it up by line and attempting to suss out what each line was supposed to be. It quickly became apparent that this was untenable for even one model, let alone the 15+ that we eventually needed. [Chris](https://viget.com/about/team/cjones) suggested creating custom parsers for each resource, an approach I'd initially written off as being too heavy-handed for our needs. What is a parser, you ask? [According to Wikipedia](https://en.wikipedia.org/wiki/Parsing#Computer_languages), it's > a software component that takes input data (frequently text) and > builds a data structure -- often some kind of parse tree, abstract > syntax tree or other hierarchical structure -- giving a structural > representation of the input, checking for correct syntax in the > process. Sounds about right. I investigated [Treetop](http://treetop.rubyforge.org/), the most well-known Ruby library for creating parsers, but I found it to be targeted more toward building standalone tools rather than use inside a larger application. Searching further, I found [Parslet](http://kschiess.github.io/parslet/), a "small Ruby library for constructing parsers in the PEG (Parsing Expression Grammar) fashion." Parslet turned out to be the perfect tool for the job. Here, for example, is a basic parser for the above degree input: ```ruby class DegreeParser < Parslet::Parser root :degree_groups rule(:degree_groups) { degree_group.repeat(0, 1) >> additional_degrees.repeat(0) } rule(:degree_group) { institution_name >> (newline >> degree).repeat(1).as(:degrees_attributes) } rule(:additional_degrees) { blank_line.repeat(2) >> degree_group } rule(:institution_name) { line.as(:institution_name) } rule(:degree) { year.as(:year).maybe >> semicolon >> name >> semicolon >> field_of_study } rule(:name) { segment.as(:name) } rule(:field_of_study) { segment.as(:field_of_study) } rule(:year) { spaces >> match("[0-9]").repeat(4, 4) >> spaces } rule(:line) { spaces >> match('[^ \r\n]').repeat(1) >> match('[^\r\n]').repeat(0) } rule(:segment) { spaces >> match('[^ ;\r\n]').repeat(1) >> match('[^;\r\n]').repeat(0) } rule(:blank_line) { spaces >> newline >> spaces } rule(:newline) { str("\r").maybe >> str("\n") } rule(:semicolon) { str(";") } rule(:space) { str(" ") } rule(:spaces) { space.repeat(0) } end ``` Let's take this line-by-line: **2:** the `root` directive tells the parser what rule to start parsing with. **4-5:** `degree_groups` is a Parslet rule. It can reference other rules, Parslet instructions, or both. In this case, `degree_groups`, our parsing root, is made up of zero or one `degree_group` followed by any number of `additional_degrees`. **7-8:** a `degree_group` is defined as an institution name followed by one more more newline + degree combinations. The `.as` method defines the keys in the resulting output hash. Use names that match up with your ActiveRecord objects for great justice. **10:** `additional_degrees` is just a blank line followed by another `degree_group`. **12:** `institution_name` makes use of our `line` directive (which we'll discuss in a minute) and simply gives it a name. **14-18:** Here's where a degree (e.g. "1997; M.S.; Psychology") is defined. We use the `year` rule, defined on line 23 as four digits in a row, give it the name "year," and make it optional with the `.maybe` method. `.maybe` is similar to the `.repeat(0, 1)` we used earlier, the difference being that the latter will always put its results in an array. After that, we have a semicolon, the name of the degree, another semicolon, and the field of study. **20-21:** `name` and `field_of_study` are segments, text content terminated by semicolons. **23-25:** a `year` is exactly four digits with optional whitespace on either side. **27-29:** a `line` (used here for our institution name) is at least one non-newline, non-whitespace character plus everything up to the next newline. **31-33:** a `segment` is like a `line`, except it also terminates at semicolons. **35-39:** here we put names to some literal string matches, like semicolons, spaces, and newlines. In the actual app, the common rules between parsers (year, segment, newline, etc.) are part of a parent class so that only the resource-specific instructions would be included in this parser. Here's what we get when we pass our degree info to this new parser: ```ruby [{:institution_name=>"Duke University"@0, :degrees_attributes=> [{:name=>" Ph.D."@17, :field_of_study=>" Biomedical Engineering"@24}]}, {:institution_name=>"University of North Carolina"@49, :degrees_attributes=> [{:year=>"2010"@78, :name=>" M.S."@83, :field_of_study=>" Biology"@89}, {:year=>"2007"@98, :name=>" B.S."@103, :field_of_study=>" Biology"@109}]}] ``` The values are Parslet nodes, and the `@XX` indicates where in the input the rule was matched. With a little bit of string coercion, this output can be fed directly into an ActiveRecord model. If the user's input is invalid, Parslet makes it similarly straightforward to point out the offending line. ------------------------------------------------------------------------ This component of Vitae was incredibly satisfying to work on, because it solved a real-world issue for our users while scratching a nerdy personal itch. I encourage you to learn more about parsers (and [Parslet](http://kschiess.github.io/parslet/) specifically) and to look for ways to use them in projects both personal and professional.