Unicode – How to load the UTF16 encoded text file in Julia?

I have a text file (very sure) that is encoded in UTF16, but I don’t know how to load it in Julia. Do I have to load it as bytes and then use UTF16String To convert?
The easiest way is to read it as bytes and then convert:

s = open(filename, "r") do f
utf16(readbytes(f))
end

Please note that utf16 will also check Byte order mark (BOM), so it will deal with endianness issues and will not include the BOM in the result s.

If you really want to avoid copying data, and you know it is native- endian, then this is also possible, but you must explicitly write a NUL terminator (because Julia UTF-16 string data has a NUL code point internally, which is passed to the end of the C routine that expects NUL termination data:

s = open(filename, "r") do f
b = readbytes(f)
resize!(b, length(b)+2)
b[end] = b[end-1] = 0
UTF16String(reinterpret(UInt16, b))
end

However, typical UTF-16 text The file will start with BOM, in this case, the string s will include BOM as its first character, which may not be what you want.

I have A text file (very sure) is encoded in UTF16, but I don’t know how to load it in Julia. Do I have to load it as bytes and then use UTF16String for conversion?

The easiest way is to read it as a byte and then convert it:

s = open(filename, "r") do f
utf16(readbytes(f))
end

Please note that utf16 will also check the byte order mark (BOM), so it will handle byte order issues and will not be in the result s contains the BOM.

If you really want to avoid copying data, and you know it is native-endian, then this is also possible, but you must explicitly write a NUL terminator (because Julia UTF -16 string data has a NUL code point internally , The end of the C routine passed to expect NUL termination data:

s = open(filename, "r") do f
b = readbytes(f)
resize!(b, length(b)+2)
b[end] = b[end-1] = 0
UTF16String(reinterpret(UInt16, b))
end

However, a typical UTF-16 text file will start with a BOM, in this case, the string s will include the BOM as its first character, which may not be what you want. /p>

Leave a Comment

Your email address will not be published.