Fast JSON Parsing with C and SIMD - 2 on tomleb's blog

This is part 2 of the series that goes over the paper Parsing Gigabytes of JSON per Second by Geoff Langdale and Daniel Lemire.

The paper presents a very fast JSON parser and the techniques used to obtain high speed on a single core. The reference implementation simdjson is open source and is being developed on Github.

  1. Introduction to SIMD instructions
  2. Stage 1 - SIMD Wrapper
  3. Stage 1 - Finding quotes and quoting regions
  4. Stage 1 - Finding structural and whitespace characters
  5. Stage 1 - Finding all atoms (structural and pseudo-structural characters)

The series will go over the most interesting part of the paper and will show the implementation of jsonic, a basic C implementation of the paper.

The series is not focused on the performance of jsonic and does not present benchmark analysis.

SIMD Wrapper

As discussed in part 1, the JSON parser is separated into two stages. The first stage finds the structural and pseudo-structural elements in the JSON document using SIMD.

Before going into the actual implementation, we will be building a small SIMD wrapper. This will help us clarify the code and allow us to easily port the code to other architectures.

However, before writing any code, let’s get familiar with Intel’s intrinsics naming convention. Again, I highly suggest opening Intel Intrinsics Guide to look at the intrinsics while reading this series.

Naming convention

The data types __m128i, __m256i, __m512i are vector types that contain integer data with a total of 128, 256 and 512 bits respectively. The postfix i means that it is integer data. The integer data can be split logically in a few ways. For example, __m256i can contain either 32 8-bit integers, 16 16-bit integers, 8 32-bit integers or 4 64-bit integers. Depending on which intrinsics you use, it will consider the 256-bit as one of the options listed previously.

The intrinsics have a prefix of either _mm, _mm256 or _mm512 for their respective data types described above. Then, it is followed by the operation and finally by either epi8, epi16, epi32, epi64 or even epi128. The p means that it is packed data, the i means that we are operating on integer data and the number after describes how many bits each integers have.

For example, the intrinsic _mm_add_epi8 operates on data types 128-bit wide where each integers are 8-bit wide. In contrast, _mm256_add_epi32 operates on data types 256-bit wide where each integers are 32-bit wide.

Note that being able to read a binary number will be useful here so let’s have a simple example.

The value right after the b is the most significant bit (MSB for short) and the value completely on the right is the least significant bit (LSB for short). In this case, both are 0s.

Coding time

Let’s start by creating the directories and a simple build configuration. For the build system, I use meson but you are free to use whatever you are most comfortable with.

mkdir include/ src/ examples/
cat <<EOF >
  default_options: [

cc = meson.get_compiler('c')

]), language: 'c')

jsonic_lib = shared_library(
  include_directories: ['include'],

jsonic_dep = declare_dependency(
  link_with: jsonic_lib,
  include_directories: ['include'],

cat <<EOF > examples/
  dependencies: [jsonic_dep],

Without going into detail, we are building a library that resides in src/ and include/ and we have an executable in examples/ that makes use of that library. Note that I needed the argument -maxv to be able to use AVX intriscs.

Let’s move on to the header file include/simd.h defining some types that will help us later on.

We will interact mostly with 16 8-bit integers and 32 8-bit integers, so we make type aliases for them. We also define type aliases for masks where each 16 and 32 elements are represented by a single bit.

#ifndef ARCH_X86_SIMD_H
#define ARCH_X86_SIMD_H

#include <immintrin.h>
#include <stdint.h>

typedef uint16_t mask_16bit;
typedef uint32_t mask_32bit;
typedef __m256i u8x32;
typedef __m128i u8x16;

Next, we declare a few functions that we will use later on.

u8x16 u8x16_set1(char c);
u8x32 u8x32_set1(char c);

u8x16 u8x16_cmpeq(u8x16 a, u8x16 b);
u8x32 u8x32_cmpeq(u8x32 a, u8x32 b);

u8x16 u8x16_load_string(char *string);
u8x32 u8x32_load_string(char *string);

mask_16bit u8x16_bitmask(u8x16 a);
mask_32bit u8x32_bitmask(u8x32 a);

void u8x16_debug_print(u8x16 vector);
void u8x32_debug_print(u8x32 vector);

void mask_16bit_debug_print(mask_16bit mask);
void mask_32bit_debug_print(mask_32bit mask);

#endif /* ARCH_X86_SIMD_H */

The function u8x16_set1 sets the value of the 16 8-bit integers to the value c.

The function u8x16_cmpeq compares all ith 8-bit integers from a with the ith 8-bit integers from b. For each integers, the result is 0xFF if equal, otherwise it is 0x00.

The function u8x16 u8x16_load_string loads the first 16 characters (bytes) of string into a u8x16.

The function u8x16_bitmask maps every 16 8-bit integers into a single bit value. The ith bit is set to 1 if the most significant bit (MSB) of the ith integer is set to 1, otherwise the bit is set to 0.

The *_debug_print functions are incredibly useful to understand what is going on with the data.

Finally, let’s define the functions in src/simd.c. The PRINTF_ macros are used for the debug print statements. They were taken from William Whyte’s answer from Stack Overflow.

#include <stdint.h>
#include <stdio.h>

#include "simd.h"

#define PRINTF_BINARY_PATTERN_INT8 "%c%c%c%c%c%c%c%c"
#define PRINTF_BYTE_TO_BINARY_INT8(i)    \
    (((i) & 0x80ll) ? '1' : '_'), \
    (((i) & 0x40ll) ? '1' : '_'), \
    (((i) & 0x20ll) ? '1' : '_'), \
    (((i) & 0x10ll) ? '1' : '_'), \
    (((i) & 0x08ll) ? '1' : '_'), \
    (((i) & 0x04ll) ? '1' : '_'), \
    (((i) & 0x02ll) ? '1' : '_'), \
    (((i) & 0x01ll) ? '1' : '_')


We introduce a struct to be able to emulate some AVX2 instructions that are not available on my machine. It will allow us to perform operations on 256-bit data by splitting them into 128-bit, performing the operation, and then merging them again into 256-bit1.

struct hi_and_lo {
	u8x16 hi;
	u8x16 lo;

static struct hi_and_lo u8x32_extract(u8x32 a) {
	struct hi_and_lo hal;
	hal.hi = _mm256_extractf128_si256(a, 1); // a[255:128]
	hal.lo = _mm256_extractf128_si256(a, 0); // a[127:0]
	return hal;

static u8x32 hi_and_lo_pack(struct hi_and_lo hal) {
	return _mm256_set_m128i(hal.hi, hal.lo);

Finally, we define the functions we declared earlier. Note the use of the struct hi_and_lo for u8x32_cmpeq and u8x32_bitmask. If we had access to AVX2, we would be able to use the intrinsics directly.

u8x32 u8x32_set1(char c) {
	return _mm256_set1_epi8(c);

u8x16 u8x16_cmpeq(u8x16 a, u8x16 b) {
	return _mm_cmpeq_epi8(a, b);

u8x32 u8x32_cmpeq(u8x32 a, u8x32 b) {
	struct hi_and_lo hal_a = u8x32_extract(a);
	struct hi_and_lo hal_b = u8x32_extract(b);
	struct hi_and_lo hal_result;
	hal_result.hi = u8x16_cmpeq(hal_a.hi, hal_b.hi);
	hal_result.lo = u8x16_cmpeq(hal_a.lo, hal_b.lo);
	return hi_and_lo_pack(hal_result);

u8x32 u8x32_load_string(char *string) {
	return _mm256_load_si256((__m256i*)string);

mask_16bit u8x16_bitmask(u8x16 a) {
	return _mm_movemask_epi8(a);

mask_32bit u8x32_bitmask(u8x32 a) {
	struct hi_and_lo hal = u8x32_extract(a);
	mask_16bit hi = u8x16_bitmask(hal.hi);
	mask_16bit lo = u8x16_bitmask(hal.lo);
	uint32_t bitmask = (hi<<16)|lo;
	return bitmask;

The implementation of these functions are quite straightforward and map one to one directly with the intrinsics. One interesting thing to note, and unintuitive for me at least, is that _mm256_load_si256 (and by extension, u8x32_load_string) loads the bytes in a reverse order. The first byte will be in the least significant octet of the 256-bit, the second in the second, etc. This is confusing when printing bitmasks.

Below is a table that demonstrates the use of most functions.

For the last functions in our SIMD wrapper, we simply use the macros defined above. Note that most intrinsics operations require specific memory alignment. Intrinsics with 128-bit operands requires 16 byte aligned data and intrinsics with 256-bit operands requires 32 byte aligned data. We use the alignas specifier to force necessary alignment.

void u8x16_debug_print(u8x16 vector) {
	alignas(16) uint8_t integers[16];
	_mm_store_si128((__m128i*)integers, vector);
	for (size_t i = 0; i < 15; i++) {

void u8x32_debug_print(u8x32 vector) {
	alignas(32) uint8_t integers[32];
	_mm256_store_si256((__m256i*)integers, vector);
	for (size_t i = 0; i < 31; i++) {

void mask_16bit_debug_print(mask_16bit mask) {

void mask_32bit_debug_print(mask_32bit mask) {

For the sake of completeness, let’s implement a simple program in examples/main.c that make use of this wrapper.

#include <stdalign.h>
#include <stdio.h>

#include "simd.h"

int main(int argc, char *argv[]) {
        alignas(32) char _string[32] = "Hello, world!";
        u8x32 string = u8x32_load_string(_string);
        u8x32 o_cmpeq = u8x32_cmpeq(string, u8x32_set1('o'));
        mask_32bit o_mask = u8x32_bitmask(o_cmpeq);

You can now compile and run with the following commands:

# compile
meson build
ninja -C build
# run

Phew, we are done with the boilerplate for now. In the next part, we will use this simple SIMD wrapper to quickly find the quotes and quoted region of a JSON document without using branches, processing 32 bytes at a time.

The example below illustrate the problem and the end result. Note that we take into account the escaped quotes.

string:        { "key": "\"value\"" }
quotes:        0010001001000000000100
quoted region: 0011110001111111111000

  1. Honestly not sure why some 256-bit register operations were defined in AVX and some other in AVX2.. ↩︎

Contribute to the discussion in my public inbox by sending an email to ~tomleb/ [mailing list etiquette]