Fast JSON Parsing with C and SIMD - 1 January 1, 2020 on tomleb's blog

This is a series that goes over the paper Parsing Gigabytes of JSON per Second by Geoff Langdale and Daniel Lemire.

The paper presents a very fast JSON parser and the techniques used to obtain high speed on a single core. The reference implementation simdjson is open source and is being developed on Github.

Introduction to SIMD instructions
Stage 1 - SIMD Wrapper
Stage 1 - Finding quotes and quoting regions
Stage 1 - Finding structural and whitespace characters
Stage 1 - Finding all atoms (structural and pseudo-structural characters)

The series will go over the most interesting part of the paper and will show the implementation of jsonic, a basic C implementation of the paper.

The series is not focused on the performance of jsonic and does not present benchmark analysis.

Introduction to SIMD instructions

This project was my first time working with SIMD code, and I’m definitely not an expert in optimization.

I have found Intel Intrinsics Guide extremely useful during the implementation and exploration phase of the code. I suggest keeping a tab opened for the rest of the series. It provides a list of the intrinsics with a comprehensive description and operation code. It also specifies which technology is required to use the instructions (eg: SSE3, AVX, AVX2, etc). If you’re not familiar with them, I will be explaining them below.

SIMD and intrinsics

What are those anyway ?

SIMD

SIMD stands for Single instruction, multiple data, which you can intuitively guess by its name what it does. It allows us to do a single operation on multiple operands.

As a simple example, say you have two arrays of 8-bit integers with 16 elements each, uint8_t a[16] and uint8_t b[16]. Now, you want to add the ith element of a with the ith element of b and put the result in a destination array c. In a non-SIMD world, we could do a simple for loop over the 16 elements, and during each iteration add a[i] to b[i] and put the result in c[i] as shown below. This would take 16 instructions for the addition.¹

int main(int argc, char *argv[]) {
	uint8_t a[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
	uint8_t b[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
	uint8_t c[16] = {0};
	for (int i = 0; i < 16; i++) {
		c[i] = a[i] + b[i]; // Each addition is an instruction
	}
}

However, in SIMD land, we can do this in a single instruction. For example, Intel’s SSE2 paddb instruction adds 16 8-bit integers to 16 other 8-bit integers in a single instruction.

Note: SIMD instructions depends on the CPU architecture and microarchitecture. This means that paddb is available only on x86 architecture and on CPU that supports the SSE2 technology. On ARM, a different instruction would be needed, as paddb is x86 only.

Intrinsics

When coding in high-level languages, there is no need to write the x86 instructions to, for example, add two integers. Instead, the compiler provides us a syntax² to do this.

Intrinsics follow the same concept: instead of writing instructions directly, we can use functions that do make use of those instructions.

Most of Intel’s intrinsics start with _mm, _mm256 and _mm512.

Below is the same example as above, but using SIMD.

#include <stdint.h> /* for uint32_t */
#include <immintrin.h> /* necessary to use the intrinsics */
int main(int argc, char *argv[]) {
	uint8_t a[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
	uint8_t b[16] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};
	uint8_t c[16] = {0};

	__m128i vector_a = _mm_load_si128((const __m128i *)a);
	__m128i vector_b = _mm_load_si128((const __m128i *)b);
	__m128i vector_c = _mm_add_epi8(vector_a, vector_b);
	_mm_store_si128((__m128i *)c, vector_c);
}

Above, we used the _mm_load_si128 and _mm_add_epi8 intrinsics to execute the movdqa and paddb instructions respectively.

Technology support

Over the years, CPU makers added new technologies to CPU microarchitecture. For example, Intel MMX, SSE, SSE2, SSE3, AVX and AVX2, in chronological order. This means that older CPUs might only support a few of the first, while newest CPUs supports all of them.

As an example, my daily workstation is a Thinkpad T420. This laptop is getting old, as it was first released around 2008. However, it supports Intel technology up to and including AVX.

You can find out what your machine supports by running the following command:

grep flags /proc/cpuinfo

This series will use only AVX, as I don’t have a more modern computer currently available.

Fast JSON Parsing

The parsing algorithm is separated into two stages: finding the atoms and building the json data structure.

The first stage of the paper is about finding the atoms in the JSON document. The atoms are all the structural elements (curly and square brackets, quotes, commas and quotes) and the pseudo-structural elements (numbers, boolean values, null values). We will see all the steps required to do this using a SIMD approach. We will skip Unicode validation however.

The second stage of the paper is building the json data structure. This stage uses the result from the first stage to build a json structure.

Before beginning to implement the first stage, we will build a simple SIMD wrapper over the various intrinsics that we are going to use. This will also act as an introduction to SIMD programming.

You can read the next post in the series by following this link: Part 2

Let’s keep this simple. ↩︎
Functions, really. ↩︎

Contribute to the discussion in my public inbox by sending an email to ~tomleb/public-inbox@lists.sr.ht [mailing list etiquette]