Using the Boost code facet for reading UTF8

C++ has code facets which make it possible to read UTF8 or another text encoding into the wstring containers. I couldn’t find much information on how to implement the facet or any sample code apart a utf-8 facet implementation in the Boost libraries.

The below minimal sample will read from a text file (named utf8.txt). For example if the text file utf8.txt contains a single line with the text “日本語”. When reading using standard streams the utf8 is treated a streams of bytes and report a longer length of 9. When reading the stream the

To print the string or individual utf8 chars will require converted back to utf-8 or terminal that support UTF16/32 (nw UCS-16/32??)

Tested on Windows and Linux. The Fedora boost package didn’t include the utf8_codecvt_facet.cpp file. The file can be obtained here

#include <iostream>

#include <fstream>

#include <locale> 

 

#define BOOST_UTF8_BEGIN_NAMESPACE

#define BOOST_UTF8_END_NAMESPACE

#define BOOST_UTF8_DECL

 

#include <boost/detail/utf8_codecvt_facet.hpp>

 

//This file wasn't present in Fedora boost package

// but was under Windows

//#include <libs/detail/utf8_codecvt_facet.cpp>

#include "utf8_codecvt_facet.cpp"

 

 

int main(int argc, char** argv)

{        

    std::wifstream wifs("utf8.txt");

    wifs.imbue(std::locale(std::locale(), new utf8_codecvt_facet));

    std::wstring wstr;

    wifs >> wstr;

    wifs.close();

    std::cout << "wstr.length()=" << wstr.length() << std::endl;

    std::ifstream ifs("utf8.txt");

    std::string str;

    ifs >> str;

    std::cout << "str.length()=" << str.length() << std::endl;

    ifs.close();

    return 0;

}

This entry was posted by Edobashira. Bookmark the permalink.

Leave a Reply