C++ has code facets which make it possible to read UTF8 or another text encoding into the wstring containers. I couldn’t find much information on how to implement the facet or any sample code apart a utf-8 facet implementation in the Boost libraries.
The below minimal sample will read from a text file (named utf8.txt). For example if the text file utf8.txt contains a single line with the text “日本語”. When reading using standard streams the utf8 is treated a streams of bytes and report a longer length of 9. When reading the stream the
To print the string or individual utf8 chars will require converted back to utf-8 or terminal that support UTF16/32 (nw UCS-16/32??)
Tested on Windows and Linux. The Fedora boost package didn’t include the utf8_codecvt_facet.cpp file. The file can be obtained here
#include <iostream>
#include <fstream>
#include <locale>
#define BOOST_UTF8_BEGIN_NAMESPACE
#define BOOST_UTF8_END_NAMESPACE
#define BOOST_UTF8_DECL
#include <boost/detail/utf8_codecvt_facet.hpp>
//This file wasn't present in Fedora boost package
// but was under Windows
//#include <libs/detail/utf8_codecvt_facet.cpp>
#include "utf8_codecvt_facet.cpp"
int main(int argc, char** argv)
{
std::wifstream wifs("utf8.txt");
wifs.imbue(std::locale(std::locale(), new utf8_codecvt_facet));
std::wstring wstr;
wifs >> wstr;
wifs.close();
std::cout << "wstr.length()=" << wstr.length() << std::endl;
std::ifstream ifs("utf8.txt");
std::string str;
ifs >> str;
std::cout << "str.length()=" << str.length() << std::endl;
ifs.close();
return 0;
}
No comments:
Post a Comment