headshot of Ryan Ryan Marcus, assistant professor at the University of Pennsylvania (Fall '23). Using machine learning to build the next generation of data systems.
      
    ____                       __  ___                          
   / __ \__  ______ _____     /  |/  /___ _____________  _______
  / /_/ / / / / __ `/ __ \   / /|_/ / __ `/ ___/ ___/ / / / ___/
 / _, _/ /_/ / /_/ / / / /  / /  / / /_/ / /  / /__/ /_/ (__  ) 
/_/ |_|\__, /\__,_/_/ /_/  /_/  /_/\__,_/_/   \___/\__,_/____/  
      /____/                                                    
        
   ___                   __  ___                    
  / _ \__ _____ ____    /  |/  /__ ___________ _____
 / , _/ // / _ `/ _ \  / /|_/ / _ `/ __/ __/ // (_-<
/_/|_|\_, /\_,_/_//_/ /_/  /_/\_,_/_/  \__/\_,_/___/
     /___/                                          
        
   ___  __  ___                    
  / _ \/  |/  /__ ___________ _____
 / , _/ /|_/ / _ `/ __/ __/ // (_-<
/_/|_/_/  /_/\_,_/_/  \__/\_,_/___/                                   
        

A Dirty JSON Parser

I’ve written a parser that does its best to parse invalid and malformed JSON. It is written in JavaScript, and is available on NPM. There’s also an online demo.

Why?

Recently, I had the unfortunate job of parsing some malformed JSON. This particular JSON had a lot of embedded HTML and newlines, and looked something like this:

{ "key": "<div class="item">an item
with newlines <span class="attrib"> and embedded
</span>html</div>",
  "key2": "<div class="item">..."
}

Since this was a very large JSON file, I did not want to correct it by hand. I noticed that some of the other JSON files I had to parse were also malformed, so I began to make a list of common JSON mistakes I’d seen.

Some of these errors made the JSON ambiguous, but generally a human being could see what the JSON was “supposed” to be. Of course, no parser will be able to correctly disambiguate ambiguous JSON all the time, but I thought it was worth a try.

The result is a hand-written LR(1) parser that can handle JSON with a mix of the above errors. The parsed result will be a JavaScript object, but it might not always be the JavaScript object one expects. You can play around with the demo that I wrote to investigate yourself.

I have spent no time optimizing the parser. In fact, it checks every rule whenever it finds a match regardless of the current terminal. So it’s probably pretty slow.

While I’m pretty sure that this parser will correctly handle all valid JSON (you can investigate my test cases in the GitHub repo), I’m not completely convinced. This parser should probably only be used in one-time-use programs that absolutely need to handle malformed JSON.

If you find a bug (or an interesting use case), let me know via the GitHub repo or Twitter.