Attio's CSV importer writes millions of records into customers' workspaces every month. Not only is there a large volume of data, but it's also often varied and unpredictable. Users of the importer are generally implementing custom processes that write data from outside our usual sources. When working with such processes, you'll eventually run into some edge cases.
The bug: gRPC errors
We spotted one such edge case when Attio's error monitoring service reported this issue coming from a gRPC call to Google’s Spanner, our primary database.
The first step for any bug is to try and replicate it. After some quick log queries and a conversation with the customer in question, we had a small CSV file we could use to reproduce the bug.
I fired up my debugger and began stepping through the code. Although it didn't throw directly when called, this truncation function caught my eye.
Looking at the output of truncateSortableValue for my CSV file, I found an odd value that involved a flag character. When .slice() landed in the middle of a flag emoji, you sometimes ended up with unexpected results.
Indeed, if I re-ran the import without the flag rows, I no longer saw the error. But what's really going on here? What is 🇬\uD83C?
Understanding JavaScript strings and UTF-16
On a day-to-day basis, you only need to interact with JavaScript strings at a surface level. Dig a little deeper, however, and you'll find there's lots more going on.
The first thing to understand is that JavaScript strings use Unicode. Unicode allows you to reference hundreds of thousands of characters and emojis using a unique number known as a code point. For example, 'a' has the code point U+0061 whereas '🙂' is U+1F642.
More specifically than just using Unicode, JavaScript strings use the UFT-16 encoding. This specific encoding (UTF-8 and UTF-32 are also available) determines how the numerical code points are encoded and decoded as bytes. The unit of storage for a particular encoding is known as a code unit. The 16 in UTF-16, refers to the fact that the code units are 16-bit values, and one or more code units are used for each code point. UTF-16 is optimized so that more common characters use less space i.e. can be encoded in one of these 16 bit code units. Other characters may end up using two units.
The last point to be aware of is that multiple unicode characters (with multiple code points) can be joined together to make a single readable glyph known as a grapheme cluster. Emojis are a common example of grapheme clusters and many emoji variants are formed by combining multiple sub-emojis together, usually with the zero width joiner (U+200D) character. You also see grapheme clusters used in accented characters (e.g. e + ´ = é).
When rendering human readable strings (e.g. for UI truncation), you probably want to think in terms of grapheme clusters. For example, if you separated '🇬🇧' in half you could end up rendering '🇬' (regional indicator symbol G) to a user.
For our database case, we’re more concerned with not chopping a string along code units before transmitting the data. For our gRPC call, we're using protobufs which encode strings using UTF-8. If we try to encode half a UTF-16 code unit into a protobuf message, we could hit difficulties on the other end when we try to decode.
Unfortunately, the string .slice() method we've been using thus far does work in terms of code units, not code points. As a result, it's not entirely clear how something like '🇬🇧'.slice(0, 1) will be handled.
Our Spanner database runs fully on Google servers and is closed source, meaning we can't use a step through debugger to find the source of this error. However, we do know requests are encoded using the protobufsjs library.
If we build up a toy example, we'll see that protobufjs and Node's native Buffer.from make attempts at encoding the split string. However, they both fail to decode the value back into a meaningful string on the other end.
While the Spanner decoder on the other end won't be using the JavaScript protobufjs library, this failure to decode is almost certainly the thing causing our error.
The fix
So, how do we fix this?
Well, it turns out that while indexes .split() and .slice() use code units, [Symbol.iterator]() uses code points.
With this knowledge in hand, adding a new function to safely get the first n characters of a string is trivial.
One small pull request later and our importer is error-free.
Interested in redefining one of the world’s most important software categories? Check out our careers page.
.png)

