NaN Boxing in C#

This article shows how to use NaN Boxing to improve the performance of a JSON parser for C# named DynaJson.

About DynaJson

DynaJson is a fast JSON parser for C# compatible with DynamicJson. I feel DynamicJson is very useful but lack sufficient performance, so I developed a faster replacement.

DynaJson acts the same as DynamicJson, as shown below. It returns a dynamic type value representing a JSON object resulting from parsing a JSON string. We can access each member of the JSON object with the member access operator like a C# object. This manner is useful to process a large JSON with duck typing.

A benchmark result of DynaJson is shown below. The benchmark measured the time to process a large JSON file named citm_catalog.json1, whose size is about 1.7MB. The benchmark compared DynaJson with Utf8Json, Jil, Newtonsoft.Json, DynamicJson. The measurement ran on .NET Core 3.1 on Ubuntu 18.04 on Azure D2Ds_v4.

The result shows DynaJson is quite faster than others. DynaJson is optimized for this use case, so the result is natural.

An optimizati0n by NaN Boxing

JSON parsers generate a large number of objects representing JSON values on parsing a large JSON file. So DynaJson represents a JSON value as a 16 bytes structure type. Copying a 16 bytes structure is much faster than larger ones because it needs only one load and store instruction for 128bit SIMD operations.

The definition of the structure is shown below. It folds members into 16 bytes with an Explicit layout. The parser uses each of String, Array, and Dictionary alternately based on the value of Type, so the members can be laid out at the same offset. How about Number and Type? Type of an enum type overlaps the upper 4 bytes of Number. It can seemingly break Number.

NaN Boxing can realize the overlap safely. NaN, standing for Not a Number, represents zero divided by zero and the square root of a negative number in floating-point arithmetic See the Wikipedia entry for more details. One of the binary representations of NaN defined in IEEE 754 is 0xfff8000000000000. The lower 51 bits can have an arbitrary number, so NaN Boxing boxes data in the bits.

The definition of an enum type JsonType is shown below. JsonType has not only types of JSON values but also the values themselves for null, true, and false. The definition assigns a legitimate value as the upper 4 bytes of NaN to each member. The values never collide with the upper part of ordinary doubles. Any conjunctions with any lower 4 bytes become a legitimate double.

A usage of JsonType is shown below. ToValue is a method to extract a value JsonValue represents. If the value of obj.Type is not anything in the defined members, it must be an upper 4 bytes of an ordinary double, so ToValue returns obj.Number. This manner realizes the clear separation of Number and Type.

The following benchmark result shows the effect of NaN Boxing. The benchmark measured parsing time in each definition of InternalObject as a class, 20 bytes (8 + 8 + 4) structure without NaN Boxing, and 16 bytes one with NaN Boxing. The 20 bytes structure made the parser 20% faster than the class, and NaN Boxing improved 6.8% more.

Other usages of NaN Boxing

JavaScript engines such as Mozilla's SpyderMonkey and WebKit's JavaScriptCore use NaN Boxing, but Google's V8 doesn't use it. LuaJIT and mruby also use NaN Boxing. These implementations overlap a pointer with a double and compact the size of all values into 8 bytes. C# doesn't allow a managed pointer to overlap a double, so it is impossible to compact InternalObject into 8 bytes.

  1. This is one of the files used in Native JSON Benchmark. Many benchmarks for JSON parsers use this file.

Leave a Reply

Your email address will not be published.