How-To: Decode Byte-Code to UTF-8

What Does This Article Cover?

Decoding raw text byte code
Example
Considerations
Other related material

Decoding text byte code

The Intelligence Hub expects UTF-8 encoding when reading in File data. In case the file data is not encoded in UTF-8, an alternative is to read in the raw byte code and implement a custom decoder. The following covers an example of reading in a file in UTF-16 Big-Endian format, merging the bytes into 2-byte elements, and then decoding with JavaScript's built in fromCharCode() method.

Example

The following decoder can be pasted into a Custom Condition where the Source reads in UTF-16 Big-Endian file data.

function decodeUnicodeBigEndianToString(unicodeBigEndianBytes, startIndex = 2) {
    let utf16CodeUnits = [];

    // Iterate over the byte array in pairs
    for (let i = startIndex; i < unicodeBigEndianBytes.length; i += 2) {
        // Extract the current character's bytes
        const byte1 = unicodeBigEndianBytes[i];
        const byte2 = unicodeBigEndianBytes[i + 1];

        // Combine the bytes to form a 16-bit unsigned integer (UTF-16 code unit)
        const utf16CodeUnit = (byte1 << 8) | byte2;

        // Add the UTF-16 code unit to the array
        utf16CodeUnits.push(utf16CodeUnit);
    }

    // Create a string from UTF-16 code units
    var utf16String = String.fromCharCode.apply(null, utf16CodeUnits);

    return utf16String;
}

//Load the character bytecode array, the .numberArray property is unique to File inputs
var ig = this.currentValue.numberArray;

//Convert to unsigned int
var unicodeBigEndianBytes = new Uint8Array(ig);

//Decode byte array to Unicode big-endian bytes to string
var decodedString = decodeUnicodeBigEndianToString(unicodeBigEndianBytes);

//return decodedString
decodedString;

Considerations

Note whether "Include Metadata" is toggled on, as this will alter how you reference the file data. Here "Include Metadata" is off, so the file content is referenced this.currentValue.numberArray in the Condition Expression. If "Include Metadata" was toggled ON, then this should be this.currentValue.value.numberArray.

The example given in this article, only covers the BMP (Basic Multilingual Plane), i.e. the characters in the 2-byte range (0-65535). If you expect surrogate pairs (text requiring >2 bytes), then the decodeUnicodeBigEndianToString() will need to be amended. Feel free to reach out to us if you need to handle other text encodings, or have any questions.

How-To: Decode Byte-Code to UTF-8

What Does This Article Cover?

Decoding text byte code

Example

Considerations

Other related material: