How-To: Decode Byte-Code to UTF-8
What Does This Article Cover?
- Decoding raw text byte code
- Example
- Considerations
- Other related material
Decoding text byte code
The Intelligence Hub expects UTF-8 encoding when reading in File data. In case the file data is not encoded in UTF-8, an alternative is to read in the raw byte code and implement a custom decoder. The following covers an example of reading in a file in UTF-16 Big-Endian format, merging the bytes into 2-byte elements, and then decoding with JavaScript's built in fromCharCode() method.
Example
The following decoder can be pasted into a Custom Condition where the Source reads in UTF-16 Big-Endian file data.
function decodeUnicodeBigEndianToString(unicodeBigEndianBytes, startIndex = 2) {
let utf16CodeUnits = [];
// Iterate over the byte array in pairs
for (let i = startIndex; i < unicodeBigEndianBytes.length; i += 2) {
// Extract the current character's bytes
const byte1 = unicodeBigEndianBytes[i];
const byte2 = unicodeBigEndianBytes[i + 1];
// Combine the bytes to form a 16-bit unsigned integer (UTF-16 code unit)
const utf16CodeUnit = (byte1 << 8) | byte2;
// Add the UTF-16 code unit to the array
utf16CodeUnits.push(utf16CodeUnit);
}
// Create a string from UTF-16 code units
var utf16String = String.fromCharCode.apply(null, utf16CodeUnits);
return utf16String;
}
//Load the character bytecode array, the .numberArray property is unique to File inputs
var ig = this.currentValue.numberArray;
//Convert to unsigned int
var unicodeBigEndianBytes = new Uint8Array(ig);
//Decode byte array to Unicode big-endian bytes to string
var decodedString = decodeUnicodeBigEndianToString(unicodeBigEndianBytes);
//return decodedString
decodedString;
Considerations
Note whether "Include Metadata" is toggled on, as this will alter how you reference the file data. Here "Include Metadata" is off, so the file content is referenced this.currentValue.numberArray in the Condition Expression. If "Include Metadata" was toggled ON, then this should be this.currentValue.value.numberArray.
The example given in this article, only covers the BMP (Basic Multilingual Plane), i.e. the characters in the 2-byte range (0-65535). If you expect surrogate pairs (text requiring >2 bytes), then the decodeUnicodeBigEndianToString() will need to be amended. Feel free to reach out to us if you need to handle other text encodings, or have any questions.