1.問題提出
??? 在學編程序時,曾經有人問過“你可以編一個記事本程序嗎?”當時很不屑一顧,但是隨著學習MFC的深入,了解到記事本程序也并非易事,難點就是四種編碼之間的轉換。
對于編碼,這是一個令初學者頭疼的問題,特別是對于編碼的轉換,更是難以捉摸。筆者為了完成畢業設計中的一個編碼轉換模塊,研究了中文編碼和常見的字符集后,決定解決"記事本"程序的編碼問題,更進一步完成GB2312、Big5、GBK、Unicode 、Unicode big endian、UTF-8共6種編碼之間的任意轉換。
2.問題解決?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
(1)編碼基礎知識
a.了解編碼和字符集
這部分內容,我不在贅述,可參見CSDN Ancky的專欄中《各種字符集和編碼詳解》
博客地址:http://blog.csdn.net/ancky/article/details/2034809
b.單字節、雙字節、多字節
這部分內容,可參見我先前翻譯的博文《C++字符串完全指南--第一部分:win32?字符編碼》
博客地址:http://blog.csdn.net/ziyuanxiazai123/article/details/7482360
c.區域和代碼頁
這部分內容,可參見博客 ? ?? http://hi.baidu.com/tzpwater/blog/item/bd4abb0b60bff1db3ac7636a.html
d.中文編碼GB2312、GBK、Big5,這部分內容請參見CSDN? lengshine 博客中《GB2312、GBK、Big5漢字編碼
》,博客地址:http://blog.csdn.net/lengshine/article/details/5470545
e.Windows程序的字符編碼
這部分內容,可參見博客http://blog.sina.com.cn/s/blog_4e3197f20100a6z2.html 中《Windows程序的字符編碼》
(2)編碼總結
a.六種編碼的特點
六種編碼的特點如下圖所示:
b.編碼存儲差別
ANSI(在簡體中文中默認為GB2312)、Unicode、Unicode big endian 、UTF-8存儲存在差別。
以中文"你好"二字為例,他們存貯格式如下圖所示:
c.GB2312、Big5、GBK編碼的區別
三者中漢字均采用二個字節表示,但是字節表示的值范圍有所不同,如下圖所示:
(3)編碼轉換方式
6種編碼互相轉換,由排列組合知識知道共有30個方向的轉換.筆者采用的轉換方法,
多字節文件與Unicode文件轉換如下圖所示:
多字節文件之間轉換如下圖所示:
(4)編碼轉換使用的三個函數
a.MultiByteToWideChar
該函數完成多字節字符串向Unicode寬字符串的轉換.
函數原型為:
int MultiByteToWideChar(UINT CodePage,???????? // 代碼頁DWORD dwFlags,???????? // 轉換標志LPCSTR lpMultiByteStr, // 待轉換的字符串int cbMultiByte,?????? // 待轉換字符串的字節數目LPWSTR lpWideCharStr,? // 轉換后寬字符串的存儲空間int cchWideChar??????? // 轉換后寬字符串的存儲空間大小? 以寬字符大小為單位 ); b.WideCharToMultiByte 該函數完成Unicode寬字符串到多字節字符串的轉換,使用方法具體參見MSDN。 以上兩個函數可以完成大部分的字符串轉換,可以將其封裝成多字節和寬字節之間的轉換函數:c.LCMapString 依賴于本地機器的字符轉換函數,尤其是中文編碼在轉換時要依賴于本地機器, 直接利用上述a、b中敘述的函數會產生錯誤,例如直接從GB2312轉換到Big5,利用
- wchar_t*?Coder::MByteToWChar(UINT?CodePage,LPCSTR?lpcszSrcStr)??
- {??
- ????LPWSTR?lpcwsStrDes=NULL;??
- ????int???len=MultiByteToWideChar(CodePage,0,lpcszSrcStr,-1,NULL,0);??
- ????lpcwsStrDes=new?wchar_t[len+1];??
- ????if(!lpcwsStrDes)??
- ????????return?NULL;??
- ????memset(lpcwsStrDes,0,sizeof(wchar_t)*(len+1));??
- ????len=MultiByteToWideChar(CodePage,0,lpcszSrcStr,-1,lpcwsStrDes,len);??
- ????if(len)??
- ????????return?lpcwsStrDes;??
- ????else??
- ????{?????
- ????????delete[]?lpcwsStrDes;??
- ????????return?NULL;??
- ????}??
- }??
- ??
- char*?Coder::WCharToMByte(UINT?CodePage,LPCWSTR?lpcwszSrcStr)??
- {??
- ????char*?lpszDesStr=NULL;??
- ????int?len=WideCharToMultiByte(CodePage,0,lpcwszSrcStr,-1,NULL,0,NULL,NULL);??
- ????lpszDesStr=new?char[len+1];??
- ????memset(lpszDesStr,0,sizeof(char)*(len+1));??
- ????if(!lpszDesStr)??
- ????????return?NULL;??
- ????len=WideCharToMultiByte(CodePage,0,lpcwszSrcStr,-1,lpszDesStr,len,NULL,NULL);??
- ????if(len)??
- ????????return?lpszDesStr;??
- ????else??
- ????{?????
- ????????delete[]?lpszDesStr;??
- ????????return?NULL;??
- ????}??
- }???
MultiByteToWideChar函數將GB2312轉換到Unicode字符串,然后從Unicode字符串利用函數WideCharToMultiByte轉換成Big5,將會發生錯誤,錯誤的結果如下圖所示:測試程序運行效果如下圖所示:因此中文編碼轉換時適當使用LCMapString函數,才能完成正確的轉換. 例如:
(5)編碼實現 實現Coder類完成編碼轉換工作. Coder類的代碼清單如下:
- //簡體中文?GB2312?轉換成?繁體中文BIG5??
- char*?Coder::GB2312ToBIG5(const?char*?szGB2312Str)??
- {?????????
- ????????LCID?lcid?=?MAKELCID(MAKELANGID(LANG_CHINESE,SUBLANG_CHINESE_SIMPLIFIED),SORT_CHINESE_PRC);??
- ????????int?nLength?=?LCMapString(lcid,LCMAP_TRADITIONAL_CHINESE,szGB2312Str,-1,NULL,0);??
- ????????char*?pBuffer=new?char[nLength+1];??
- ????????if(!pBuffer)??
- ????????????return?NULL;??
- ????????LCMapString(lcid,LCMAP_TRADITIONAL_CHINESE,szGB2312Str,-1,pBuffer,nLength);??
- ????????pBuffer[nLength]=0;??
- ????????wchar_t*?pUnicodeBuff?=?MByteToWChar(CP_GB2312,pBuffer);??
- ????????char*?pBIG5Buff?=?WCharToMByte(CP_BIG5,pUnicodeBuff);??
- ????????delete[]?pBuffer;??
- ????????delete[]?pUnicodeBuff;??
- ????????return?pBIG5Buff;??
- }???
- //?Coder.h:?interface?for?the?Coder?class.??
- //??
- //??
- ??
- #if?!defined(AFX_ENCODING_H__2AC955FB_9F8F_4871_9B77_C6C65730507F__INCLUDED_)??
- #define?AFX_ENCODING_H__2AC955FB_9F8F_4871_9B77_C6C65730507F__INCLUDED_??
- ??
- #if?_MSC_VER?>?1000??
- #pragma?once??
- #endif?//?_MSC_VER?>?1000??
- //-----------------------------------------------------------------------------------------------??
- //程序用途:實現GB2312、big5、GBK、Unicode、Unicode?big?endian、UTF-8六種編碼的任意裝換??????
- //程序作者:湖北師范學院計算機科學與技術學院??王定橋???????????????????????????????????
- //核心算法:根據不同編碼特點向其他編碼轉換??
- //測試結果:在Windows7?VC6.0環境下測試通過???????????????????????????????????????????????????????
- //制作時間:2012-04-24???????????????????????????????????????????????
- //代碼版權:代碼公開供學習交流使用??歡迎指正錯誤??改善算法??
- //-----------------------------------------------------------------------------------------------??
- //Windows代碼頁??
- typedef?enum?CodeType??
- {??
- ????CP_GB2312=936,??
- ????CP_BIG5=950,??
- ????CP_GBK=0??
- }CodePages;??
- //txt文件編碼??
- typedef?enum?TextCodeType??
- {?????
- ????GB2312=0,??
- ????BIG5=1,??
- ????GBK=2,??
- ????UTF8=3,??
- ????UNICODE=4,??
- ????UNICODEBIGENDIAN=5,??
- ????DefaultCodeType=-1??
- }TextCode;??
- class?Coder????
- {??
- public:??
- ????Coder();??
- ????virtual?~Coder();??
- public:??
- ????//默認一次轉換字節大小??
- ????UINT??PREDEFINEDSIZE;??
- ????//指定轉換時默認一次轉換字節大小??
- ????void?SetDefaultConvertSize(UINT?nCount);??
- ????//編碼類型轉換為字符串??
- ????CString??CodeTypeToString(TextCode?tc);??
- ????//文件轉到另一種文件??
- ????BOOL?????FileToOtherFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo,TextCode??tcCur=DefaultCodeType);??
- ????//Unicode?和Unicode?big?endian文件之間轉換??
- ????BOOL?????UnicodeEndianFileConvert(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo);??
- ????//多字節文件之間的轉換??
- ????BOOL?????MBFileToMBFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo,TextCode??tcCur=DefaultCodeType);??
- ????//Unicode和Unicode?big?endian文件向多字節文件轉換??
- ????BOOL?????UnicodeFileToMBFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo);??
- ????//多字節文件向Unicode和Unicode?big?endian文件轉換??
- ????BOOL?????MBFileToUnicodeFile(CString?filesourcepath,CString?filesavepath,TextCode?tcTo,TextCode??tcCur=DefaultCodeType);??
- ????//獲取文件編碼類型??
- ????TextCode?GetCodeType(CString?filepath);??
- ????//繁體中文BIG5?轉換成?簡體中文?GB2312??
- ????char*?BIG5ToGB2312(const?char*?szBIG5Str);??
- ????//簡體中文?GB2312?轉換成?繁體中文BIG5??
- ????char*?GB2312ToBIG5(const?char*?szGB2312Str);??
- ????//簡繁中文GBK編碼轉換成簡體中文GB2312??
- ????char*?GBKToGB2312(const?char?*szGBkStr);??
- ????//簡體中文GB2312編碼轉換成簡繁中文GBK??
- ????char*????GB2312ToGBK(const?char?*szGB2312Str);??
- ????//簡繁中文GBK轉換成繁體中文Big5??
- ????char*?????GBKToBIG5(const?char?*szGBKStr);??
- ????//繁體中文BIG5轉換到簡繁中文GBK??
- ????char*?????BIG5ToGBK(const?char?*szBIG5Str);??
- ????//寬字符串向多字節字符串轉換??
- ????char*?????WCharToMByte(UINT?CodePage,LPCWSTR?lpcwszSrcStr);??
- ????//多字節字符串向寬字符串轉換??
- ????wchar_t*??MByteToWChar(UINT?CodePage,LPCSTR?lpcszSrcStr);??
- protected:??
- ????//獲取編碼類型對應的代碼頁??
- ????UINT?GetCodePage(TextCode?tccur);??
- ????//多字節向多字節轉換??
- ????char*??MByteToMByte(UINT?CodePageCur,UINT?CodePageTo,const?char*?szSrcStr);??
- ????//Unicode和Unicode?big?endian字符串之間的轉換??
- ????void???UnicodeEndianConvert(LPWSTR??lpwszstr);??
- ????//文件頭常量字節數組??
- ????const??static???byte?UNICODEBOM[2];??
- ????const??static???byte?UNICODEBEBOM[2];??
- ????const??static???byte?UTF8BOM[3];??
- ??};??
- ??
- #endif?//?!defined(AFX_ENCODING_H__2AC955FB_9F8F_4871_9B77_C6C65730507F__INCLUDED_)??
3.運行效果 在win7 VC 6.0下測試六種編碼的轉換測試通過,30個方向的轉換如下圖所示:
- //?Coder.cpp:?implementation?of?the?Coder?class.??
- //??
- //??
- ??
- #include?"stdafx.h"??
- #include?"Coder.h"??
- #include?"Encoding.h"??
- ??
- #ifdef?_DEBUG??
- #undef?THIS_FILE??
- static?char?THIS_FILE[]=__FILE__;??
- #define?new?DEBUG_NEW??
- #endif??
- ??
- //??
- //?Construction/Destruction??
- //??
- //初始化文件頭常量??
- /*static*/?const?????byte?Coder::UNICODEBOM[2]={0xFF,0xFE};??
- /*static*/?const?????byte?Coder::UNICODEBEBOM[2]={0xFE,0xFF};??
- /*static*/?const?????byte?Coder::UTF8BOM[3]={0xEF,0xBB,0xBF};??
- Coder::Coder()??
- {??
- ???PREDEFINEDSIZE=2097152;//默認一次轉換字節大小?2M字節??
- }??
- Coder::~Coder()??
- {??
- ????
- }??
- //繁體中文BIG5?轉換成?簡體中文?GB2312??
- char*?Coder::BIG5ToGB2312(const?char*?szBIG5Str)??
- {?????????
- ????????CString?msg;??
- ????????LCID?lcid?=?MAKELCID(MAKELANGID(LANG_CHINESE,SUBLANG_CHINESE_SIMPLIFIED),SORT_CHINESE_PRC);??
- ????????wchar_t*?szUnicodeBuff?=MByteToWChar(CP_BIG5,szBIG5Str);??
- ????????char*?szGB2312Buff?=WCharToMByte(CP_GB2312,szUnicodeBuff);??
- ????????int?nLength?=?LCMapString(lcid,LCMAP_SIMPLIFIED_CHINESE,?szGB2312Buff,-1,NULL,0);??
- ????????char*?pBuffer?=?new?char[nLength?+?1];??
- ????????if(!pBuffer)??
- ??????????return?NULL;??
- ????????memset(pBuffer,0,sizeof(char)*(nLength+1));??
- ????????LCMapString(0x0804,LCMAP_SIMPLIFIED_CHINESE,szGB2312Buff,-1,pBuffer,nLength);??
- ????????delete[]?szUnicodeBuff;??
- ????????delete[]?szGB2312Buff;??
- ????????return?pBuffer;??
- }??
- //?GB2312?轉?GBK??
- char*?Coder::GB2312ToGBK(const?char?*szGB2312Str)??
- {??
- ???????int?nStrLen?=?strlen(szGB2312Str);??
- ???????if(!nStrLen)??
- ???????????return?NULL;??
- ???????LCID?wLCID?=?MAKELCID(MAKELANGID(LANG_CHINESE,?SUBLANG_CHINESE_SIMPLIFIED),?SORT_CHINESE_PRC);??
- ???????int?nReturn?=?LCMapString(wLCID,?LCMAP_TRADITIONAL_CHINESE,?szGB2312Str,?nStrLen,?NULL,?0);??
- ???????if(!nReturn)??
- ??????????return?NULL;??
- ???????char?*pcBuf?=?new?char[nReturn?+?1];??
- ???????if(!pcBuf)??
- ??????????return?NULL;??
- ???????memset(pcBuf,0,sizeof(char)*(nReturn?+?1));??
- ???????wLCID?=?MAKELCID(MAKELANGID(LANG_CHINESE,?SUBLANG_CHINESE_SIMPLIFIED),?SORT_CHINESE_PRC);??
- ???????LCMapString(wLCID,?LCMAP_TRADITIONAL_CHINESE,?szGB2312Str,?nReturn,?pcBuf,?nReturn);??
- ???????return?pcBuf;??
- }??
- //?GBK?轉換成?GB2312??
- char*?Coder::GBKToGB2312(const?char?*szGBKStr)??
- {??
- ????int?nStrLen?=?strlen(szGBKStr);??
- ????if(!nStrLen)??
- ????????return?NULL;??
- ????LCID?wLCID?=?MAKELCID(MAKELANGID(LANG_CHINESE,?SUBLANG_CHINESE_SIMPLIFIED),?SORT_CHINESE_BIG5);??
- ????int?nReturn?=?LCMapString(wLCID,?LCMAP_SIMPLIFIED_CHINESE,?szGBKStr,?nStrLen,?NULL,?0);??
- ????if(!nReturn)??
- ????????return?NULL;??
- ????char?*pcBuf?=?new?char[nReturn?+?1];??
- ????memset(pcBuf,0,sizeof(char)*(nReturn?+?1));??
- ????wLCID?=?MAKELCID(MAKELANGID(LANG_CHINESE,?SUBLANG_CHINESE_SIMPLIFIED),?SORT_CHINESE_BIG5);??
- ????LCMapString(wLCID,?LCMAP_SIMPLIFIED_CHINESE,?szGBKStr,?nReturn,?pcBuf,?nReturn);??
- ????return?pcBuf;??
- }??
- //簡繁中文GBK轉換成繁體中文Big5??
- char*???Coder::GBKToBIG5(const?char?*szGBKStr)??
- {?????
- ????char?*pTemp=NULL;??
- ????char?*pBuffer=NULL;??
- ????pTemp=GBKToGB2312(szGBKStr);??
- ????pBuffer=GB2312ToBIG5(pTemp);??
- ????delete[]?pTemp;??
- ????return?pBuffer;??
- }??
- //繁體中文BIG5轉換到簡繁中文GBK??
- char*???Coder::BIG5ToGBK(const?char?*szBIG5Str)??
- {??
- ??????char?*pTemp=NULL;??
- ??????char?*pBuffer=NULL;??
- ??????pTemp=BIG5ToGB2312(szBIG5Str);??
- ??????pBuffer=GB2312ToGBK(pTemp);??
- ??????delete[]?pTemp;??
- ??????return?pBuffer;??
- }??
- //簡體中文?GB2312?轉換成?繁體中文BIG5??
- char*?Coder::GB2312ToBIG5(const?char*?szGB2312Str)??
- {?????????
- ????????LCID?lcid?=?MAKELCID(MAKELANGID(LANG_CHINESE,SUBLANG_CHINESE_SIMPLIFIED),SORT_CHINESE_PRC);??
- ????????int?nLength?=?LCMapString(lcid,LCMAP_TRADITIONAL_CHINESE,szGB2312Str,-1,NULL,0);??
- ????????char*?pBuffer=new?char[nLength+1];??
- ????????if(!pBuffer)??
- ????????????return?NULL;??
- ????????LCMapString(lcid,LCMAP_TRADITIONAL_CHINESE,szGB2312Str,-1,pBuffer,nLength);??
- ????????pBuffer[nLength]=0;??
- ????????wchar_t*?pUnicodeBuff?=?MByteToWChar(CP_GB2312,pBuffer);??
- ????????char*?pBIG5Buff?=?WCharToMByte(CP_BIG5,pUnicodeBuff);??
- ????????delete[]?pBuffer;??
- ????????delete[]?pUnicodeBuff;??
- ????????return?pBIG5Buff;??
- }???
- //獲取文件編碼類型??
- //Unicode編碼文件通過讀取文件頭判別??
- //中文編碼通過統計文件編碼類別來判別??判別次數最多為30次????
- //中文編碼的判別存在誤差??
- TextCode?Coder::GetCodeType(CString?filepath)??
- {??
- ????CFile?file;??
- ????byte??buf[3];//unsigned?char??
- ????TextCode?tctemp;??
- ????if(file.Open(filepath,CFile::modeRead))??
- ????{??????
- ????????file.Read(buf,3);??
- ????????if(buf[0]==UTF8BOM[0]?&&?buf[1]==UTF8BOM[1]?&&?buf[2]==UTF8BOM[2])??
- ????????????return?UTF8;??
- ????????else??
- ????????if(buf[0]==UNICODEBOM[0]?&&buf[1]==UNICODEBOM[1]?)??
- ????????????return?UNICODE?;??
- ????????else??
- ????????if(buf[0]==UNICODEBEBOM[0]?&&buf[1]==UNICODEBEBOM[1]?)??
- ????????????return?UNICODEBIGENDIAN;??
- ????????else??
- ????????{?????
- ????????????int?time=30;??
- ????????????while(file.Read(buf,2)?&&time?)??
- ????????????{?????
- ????????????????if?(?(buf[0]>=176?&&?buf[0]<=247)?&&?(buf[1]>=160?&&?buf[1]<=254)?)??
- ????????????????????????????tctemp=GB2312;????
- ????????????????else??
- ????????????????????if?(?(buf[0]>=129?&&?buf[0]<=255)?&&?(??(?buf[1]>=64?&&?buf[1]<=126)??||??(?buf[1]>=161?&&?buf[1]<=254)?)?)??
- ????????????????????????????tctemp=BIG5;??
- ????????????????????else??
- ????????????????????????if?(?(buf[0]>=129?&&?buf[0]?<=254)?&&?(buf[1]>=64?&&?buf[1]<=254))??
- ????????????????????????????tctemp=GBK;???
- ????????????????time--;??
- ????????????????file.Seek(100,CFile::current);//跳過一定字節??利于統計全文??
- ????????????}??
- ????????????return?tctemp;??
- ????????}??
- ????}??
- ????else??
- ????????return?GB2312;??
- }??
- //多字節文件轉換為UNICODE、UNICODE?big?endian文件??
- BOOL?Coder::MBFileToUnicodeFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo,TextCode?tcCur)??
- {??
- ???TextCode?curtc;??
- ???CFile?filesource,filesave;;??
- ???char?????*pChSrc=NULL;??
- ???char?????*pChTemp=NULL;??
- ???wchar_t??*pwChDes=NULL;??
- ???DWORD??filelength,readlen,len;??
- ???int????bufferlen,strlength;??
- ???UINT?CodePage;??
- ???//由于存在誤差??允許用戶自定義轉換??
- ???if(tcCur!=DefaultCodeType)??
- ???????curtc=tcCur;??
- ???else??
- ???????curtc=GetCodeType(filesourcepath);??
- ???if(curtc>UTF8?||?tcTo<?UNICODE?||?curtc==tcTo)??
- ???????return?FALSE;??
- ???//源文件打開失敗或者源文件無內容?后者保存文件建立失敗???均返回轉換失敗??
- ???if(!filesource.Open(filesourcepath,CFile::modeRead)?||?0==(filelength=filesource.GetLength()))??
- ???????return?FALSE;??
- ???if(?!filesave.Open(filesavepath,CFile::modeCreate|CFile::modeWrite))??
- ????????return?FALSE;??
- ???//預分配內存??分配失敗則轉換失敗??
- ???if(filelength<PREDEFINEDSIZE)??
- ???????bufferlen=filelength;??
- ???else??
- ???????bufferlen=PREDEFINEDSIZE;??
- ???pChSrc=new?char[bufferlen+1];??
- ???if(!pChSrc)??
- ????????????return?FALSE;??
- ???//根據當前文件類別指定轉換代碼頁??
- ???switch(curtc)??
- ???{??
- ???case?GB2312:??
- ???????CodePage=CP_GB2312;??
- ???????break;??
- ???case?GBK:??
- ???????CodePage=CP_GB2312;//特殊處理??
- ???????break;??
- ???case?BIG5:??
- ???????CodePage=CP_BIG5;??
- ???????break;??
- ???case?UTF8:??
- ???????CodePage=CP_UTF8;??
- ???????break;??
- ???default:??
- ???????break;??
- ????}??
- ???//UTF8文件跳過文件??
- ???if(UTF8==curtc)??
- ???????filesource.Seek(3*sizeof(byte),CFile::begin);??
- ???//寫入文件頭??
- ???if(UNICODEBIGENDIAN==tcTo)??
- ???????filesave.Write(&UNICODEBEBOM,2*sizeof(byte));??
- ???else??
- ???????filesave.Write(&UNICODEBOM,2*sizeof(byte));??
- ???//讀取文件??分段轉換知道結束??
- ???while(filelength>0)??
- ???{??
- ???????memset(pChSrc,0,?sizeof(char)*(bufferlen+1));??
- ???????if(filelength>PREDEFINEDSIZE)??
- ???????????len=PREDEFINEDSIZE;??
- ???????else??
- ???????????len=filelength;??
- ???????readlen=filesource.Read(pChSrc,len);??
- ???????if(!readlen)??
- ????????????break;??
- ???????//GBK轉換為GB2312處理??
- ???????if(GBK==curtc)??
- ???????{?????
- ???????????pChTemp=pChSrc;??
- ???????????pChSrc=GBKToGB2312(pChSrc);??
- ???????}??
- ???????pwChDes=MByteToWChar(CodePage,pChSrc);??
- ???????if(pwChDes)??
- ???????{??
- ???????????if(UNICODEBIGENDIAN==tcTo)??
- ???????????????UnicodeEndianConvert(pwChDes);??
- ???????????strlength=wcslen(pwChDes)*2;//這里注意寫入文件的長度??
- ???????????filesave.Write(pwChDes,strlength);??
- ???????????filesave.Flush();??
- ???????????filelength-=readlen;??
- ???????}??
- ???????else??
- ???????????break;??
- ???}??
- ???delete[]?pChSrc;??
- ???delete[]?pChTemp;??
- ???delete[]?pwChDes;??
- ???return?TRUE;??
- }??
- //??
- wchar_t*?Coder::MByteToWChar(UINT?CodePage,LPCSTR?lpcszSrcStr)??
- {??
- ????LPWSTR?lpcwsStrDes=NULL;??
- ????int???len=MultiByteToWideChar(CodePage,0,lpcszSrcStr,-1,NULL,0);??
- ????lpcwsStrDes=new?wchar_t[len+1];??
- ????if(!lpcwsStrDes)??
- ????????return?NULL;??
- ????memset(lpcwsStrDes,0,sizeof(wchar_t)*(len+1));??
- ????len=MultiByteToWideChar(CodePage,0,lpcszSrcStr,-1,lpcwsStrDes,len);??
- ????if(len)??
- ????????return?lpcwsStrDes;??
- ????else??
- ????{?????
- ????????delete[]?lpcwsStrDes;??
- ????????return?NULL;??
- ????}??
- }??
- ??
- char*?Coder::WCharToMByte(UINT?CodePage,LPCWSTR?lpcwszSrcStr)??
- {??
- ????char*?lpszDesStr=NULL;??
- ????int?len=WideCharToMultiByte(CodePage,0,lpcwszSrcStr,-1,NULL,0,NULL,NULL);??
- ????lpszDesStr=new?char[len+1];??
- ????memset(lpszDesStr,0,sizeof(char)*(len+1));??
- ????if(!lpszDesStr)??
- ????????return?NULL;??
- ????len=WideCharToMultiByte(CodePage,0,lpcwszSrcStr,-1,lpszDesStr,len,NULL,NULL);??
- ????if(len)??
- ????????return?lpszDesStr;??
- ????else??
- ????{?????
- ????????delete[]?lpszDesStr;??
- ????????return?NULL;??
- ????}??
- }???
- //Unicode?和Unicode?big?endian之間字節序的轉換??
- void?Coder::UnicodeEndianConvert(LPWSTR?lpwszstr)??
- {??????
- ?????wchar_t??wchtemp[2];????
- ?????long?????index;???
- ?????int?len=wcslen(lpwszstr);??
- ?????if(!len)??
- ?????????return;??
- ???//交換高低字節?直到遇到結束符??
- ???index=0;??
- ???while(?index<len)??
- ???{??
- ???????wchtemp[0]=lpwszstr[index];??
- ???????wchtemp[1]=lpwszstr[index+1];??
- ?????????
- ???????unsigned?char?high,?low;??
- ???????high?=?(wchtemp[0]?&?0xFF00)?>>8;??
- ???????low??=?wchtemp[0]?&?0x00FF;??
- ???????wchtemp[0]?=?(?low?<<8)?|?high;??
- ???????high?=?(wchtemp[1]?&?0xFF00)?>>8;??
- ???????low??=?wchtemp[1]?&?0x00FF;??
- ???????wchtemp[1]?=?(?low?<<8)?|?high;??
- ?????????
- ???????lpwszstr[index]=wchtemp[0];??
- ???????lpwszstr[index+1]=wchtemp[1];??
- ???????index+=2;??
- ???}??
- }??
- //Unicode和Unicode?big?endian文件向多字節文件轉換??
- BOOL?Coder::UnicodeFileToMBFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo)??
- {?????
- ????TextCode?curtc;??
- ????CFile?filesource,filesave;;??
- ????char????*pChDes=NULL;??
- ????char????*pChTemp=NULL;??
- ????wchar_t?*pwChSrc=NULL;??
- ????DWORD??filelength,readlen,len;??
- ????int????bufferlen,strlength;??
- ????UINT?CodePage;??
- ????curtc=GetCodeType(filesourcepath);??
- ????//文件轉換類型錯誤?則轉換失敗??
- ????if(curtc<=UTF8?||??tcTo>UTF8?||?curtc==tcTo)??
- ????????return?FALSE;??
- ????//源文件打開失敗或者源文件無內容?后者保存文件建立失敗???均轉換失敗??
- ????if(!filesource.Open(filesourcepath,CFile::modeRead)?||?0==(filelength=filesource.GetLength()))??
- ????????return?FALSE;??
- ????if(?!filesave.Open(filesavepath,CFile::modeCreate|CFile::modeWrite))??
- ????????return?FALSE;??
- ????//預分配內存??分配失敗則轉換失敗??
- ????if(filelength<PREDEFINEDSIZE)??
- ????????bufferlen=filelength;??
- ????else??
- ????????bufferlen=PREDEFINEDSIZE;??
- ????pwChSrc=new?wchar_t[(bufferlen/2)+1];??
- ????if(!pwChSrc)??
- ????????return?FALSE;??
- ????//預先決定代碼頁??
- ????switch(tcTo)??
- ????{?????
- ????case?GB2312:??
- ????????CodePage=CP_GB2312;??
- ????????break;??
- ????case?GBK:??
- ????????CodePage=CP_GB2312;//特殊處理??
- ????????break;??
- ????case?BIG5:???
- ????????CodePage=CP_GB2312;//特殊處理??
- ????????break;??
- ????case?UTF8:??
- ????????CodePage=CP_UTF8;??
- ????????break;??
- ????default:??
- ????????break;??
- ????????}??
- ????filesource.Seek(sizeof(wchar_t),CFile::begin);??
- ????while(filelength>0)??
- ????{??
- ????????memset(pwChSrc,0,sizeof(wchar_t)*((bufferlen/2)+1));??
- ????????if(filelength>PREDEFINEDSIZE)??
- ????????????len=PREDEFINEDSIZE;??
- ????????else??
- ????????????len=filelength;??
- ????????readlen=filesource.Read(pwChSrc,len);??
- ????????if(!readlen)??
- ????????????break;??
- ????????if(UNICODEBIGENDIAN==curtc)??
- ????????????UnicodeEndianConvert(pwChSrc);??
- ????????pChDes=WCharToMByte(CodePage,pwChSrc);??
- ????????//GBK無法直接轉換??BIG5直接轉換會產生錯誤??二者均先轉到GB2312然后再轉到目的類型??
- ????????if(GBK==tcTo)??
- ????????{??
- ????????????pChTemp=pChDes;??
- ????????????pChDes=GB2312ToGBK(pChDes);??
- ????????}??
- ????????if(BIG5==tcTo)??
- ????????{??
- ????????????pChTemp=pChDes;??
- ????????????pChDes=GB2312ToBIG5(pChDes);??
- ????????}??
- ????????if(pChDes)??
- ????????{?????
- ????????????strlength=strlen(pChDes);??
- ????????????filesave.Write(pChDes,strlength);??
- ????????????filesave.Flush();??
- ????????????filelength-=readlen;??
- ????????}??
- ????????else??
- ????????????break;??
- ????}??
- ????delete[]?pChDes;??
- ????delete[]?pChTemp;??
- ????delete[]?pwChSrc;??
- ????return?TRUE;??
- }??
- //多字節文件轉為多字節文件??
- //多字節轉為多字節時,一般先轉為UNICODE類型,再轉換到指定目的類型,實行兩次轉換??
- BOOL?Coder::MBFileToMBFile(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo,TextCode??tcCur)??
- {??
- ????BOOL?bret=FALSE;??
- ????TextCode?curtc;??
- ????CFile?filesource,filesave;??
- ????char????*pChDes=NULL;??
- ????char????*pChSrc=NULL;??
- ????DWORD??filelength,readlen,len;??
- ????int????bufferlen,strlength;??
- ????UINT???CodePageCur,CodePageTo;??
- ????//由于存在誤差??允許用戶自定義轉換??
- ????if(DefaultCodeType!=tcCur)??
- ?????????curtc=tcCur;??
- ????else??
- ????????curtc=GetCodeType(filesourcepath);??
- ????//轉換類型錯誤??則返回轉換失敗??
- ????if(curtc>UTF8?||?tcTo>UTF8?||?curtc==tcTo)??
- ????????return?FALSE;??
- ????//源文件打開失敗或者源文件無內容?后者保存文件建立失敗???均返回轉換失敗??
- ????if(!filesource.Open(filesourcepath,CFile::modeRead)?||?0==(filelength=filesource.GetLength()))??
- ????????return?FALSE;??
- ????if(?!filesave.Open(filesavepath,CFile::modeCreate|CFile::modeWrite))??
- ????????return?FALSE;??
- ????//預分配內存??分配失敗則轉換失敗??
- ????if(filelength<PREDEFINEDSIZE)??
- ????????bufferlen=filelength;??
- ????else??
- ????????bufferlen=PREDEFINEDSIZE;??
- ????pChSrc=new?char[bufferlen+1];??
- ????if(!pChSrc)??
- ????????????return?FALSE;??
- ????if(UTF8==curtc)??
- ????????filesource.Seek(3*sizeof(byte),CFile::begin);??
- ????CodePageCur=GetCodePage(curtc);??
- ????CodePageTo=GetCodePage(tcTo);??
- ????while(filelength>0)??
- ????{?????
- ????????memset(pChSrc,0,sizeof(char)*(bufferlen+1));??
- ????????if(filelength>PREDEFINEDSIZE)??
- ????????????len=PREDEFINEDSIZE;??
- ????????else??
- ????????????len=filelength;??
- ????????readlen=filesource.Read(pChSrc,len);??
- ????????if(!readlen)??
- ????????????break;??
- ????????pChDes=MByteToMByte(CodePageCur,CodePageTo,pChSrc);??
- ????????if(pChDes)??
- ????????{?????
- ????????????strlength=strlen(pChDes);??
- ????????????filesave.Write(pChDes,strlength);??
- ????????????filelength-=readlen;??
- ????????}??
- ????????else??
- ????????????break;??
- ????}??
- ????delete[]?pChSrc;??
- ????delete[]?pChDes;??
- ????return?TRUE;??
- }??
- //Unicode?和Unicode?big?endian文件之間轉換??
- BOOL?Coder::UnicodeEndianFileConvert(CString?filesourcepath,?CString?filesavepath,TextCode?tcTo)??
- {??
- ????TextCode?curtc=GetCodeType(filesourcepath);??
- ????if(curtc!=UNICODE?&&?curtc!=UNICODEBIGENDIAN)??
- ????????return?FALSE;??
- ????if(curtc==tcTo)??
- ????????return?FALSE;??
- ????CFile?filesource,filesave;;??
- ????wchar_t?*pwChDes;??
- ????DWORD?length;??
- ????if(!filesource.Open(filesourcepath,CFile::modeRead)?||?!filesave.Open(filesavepath,CFile::modeCreate|CFile::modeWrite))??
- ????????return?FALSE;??
- ????length=filesource.GetLength();??
- ????if(!length)??
- ????????return?FALSE;??
- ????pwChDes=new?wchar_t[(length/2)+1];??
- ????if(!pwChDes)??
- ????????return?FALSE;??
- ????memset(pwChDes,0,sizeof(wchar_t)*((length/2)+1));??
- ????filesource.Read(pwChDes,length);??
- ????UnicodeEndianConvert(pwChDes);??
- ????length=wcslen(pwChDes)*2;??
- ????if(UNICODE==tcTo)??
- ????????filesave.Write(&UNICODEBOM,2*sizeof(byte));??
- ????else??
- ????????filesave.Write(&UNICODEBEBOM,2*sizeof(byte));??
- ????filesave.Write(pwChDes,length);??
- ????filesave.Flush();??
- ????delete[]?pwChDes;??
- ????return?TRUE;??
- }??
- //文件轉到另一種文件??
- //6種格式文件兩兩轉換??共計30種轉換??
- BOOL?Coder::FileToOtherFile(CString?filesourcepath,?CString?filesavepath,?TextCode?tcTo,TextCode??tcCur)??
- {?????
- ????TextCode?curtc;??
- ????BOOL?bret=FALSE;??
- ????if(DefaultCodeType!=tcCur)??
- ????????curtc=tcCur;??
- ????else??
- ????????curtc=GetCodeType(filesourcepath);??
- ????if(curtc==tcTo)??
- ????????return?FALSE;??
- ????//UNICODE和UNICODE?big?endian文件之間轉換?共2種??
- ????if(curtc>=UNICODE&&?tcTo>=UNICODE)??
- ????????????bret=UnicodeEndianFileConvert(filesourcepath,filesavepath,tcTo);??
- ????else??
- ????????//多字節文件向?UNICODE和UNICODE?big?endian文件之間轉換?共8種??
- ????????if(curtc<UNICODE?&&?tcTo>=UNICODE)??
- ????????????bret=MBFileToUnicodeFile(filesourcepath,filesavepath,tcTo,curtc);??
- ????else??
- ????????//UNICODE和UNICODE?big?endian文件向多字節文件轉換?共8種??
- ????????if(curtc>=UNICODE?&&?tcTo<UNICODE)??
- ????????????bret=UnicodeFileToMBFile(filesourcepath,filesavepath,tcTo);??
- ????else??
- ????????//多字節文件之間轉換?共12種??
- ????????if(curtc<UNICODE?&&?tcTo<UNICODE)??
- ????????????bret=MBFileToMBFile(filesourcepath,filesavepath,tcTo,curtc);??
- ????return?bret;??
- }??
- //編碼類型轉換為字符串??
- CString?Coder::CodeTypeToString(TextCode?tc)??
- {??
- ??????CString?strtype;??
- ??????switch(tc)??
- ??????{??
- ??????case?GB2312:??
- ???????????strtype=_T("GB2312");??
- ???????????break;??
- ??????case?BIG5:??
- ??????????strtype=_T("Big5");??
- ???????????break;??
- ??????case?GBK:??
- ??????????strtype=_T("GBK");??
- ???????????break;??
- ??????case?UTF8:??
- ??????????strtype=_T("UTF-8");??
- ???????????break;??
- ??????case?UNICODE:??
- ??????????strtype=_T("Unicode");??
- ???????????break;??
- ??????case?UNICODEBIGENDIAN:??
- ??????????strtype=_T("Unicode?big?endian");??
- ???????????break;??
- ??????}??
- ??????return?strtype;??
- }??
- //多字節向多字節轉換??
- char*?Coder::MByteToMByte(UINT?CodePageCur,?UINT?CodePageTo,?const?char*?szSrcStr)??
- {??
- ????char????*pchDes=NULL;??
- ????char????*pchTemp=NULL;??
- ????wchar_t?*pwchtemp=NULL;??
- ????//三種中文編碼之間轉換??
- ????if(CodePageCur!=CP_UTF8??&&?CodePageTo!=CP_UTF8)??
- ????{??
- ????????switch(CodePageCur)??
- ????????{??
- ????????????case?CP_GB2312:??
- ????????????????{??
- ????????????????????if(CP_BIG5==CodePageTo)???
- ???????????????????????pchDes=GB2312ToBIG5(szSrcStr);??
- ????????????????????else??
- ???????????????????????pchDes=GB2312ToGBK(szSrcStr);??
- ????????????????????break;??
- ????????????????}??
- ????????????case?CP_BIG5:??
- ????????????????{?????
- ????????????????????if(CP_GB2312==CodePageTo)??
- ????????????????????????pchDes=BIG5ToGB2312(szSrcStr);??
- ????????????????????else??
- ????????????????????????pchDes=BIG5ToGBK(szSrcStr);??
- ????????????????????break;??
- ????????????????}??
- ????????????case?CP_GBK:??
- ????????????????{?????
- ????????????????????if(CP_GB2312==CodePageTo)??
- ????????????????????????pchDes=GBKToGB2312(szSrcStr);??
- ????????????????????else??
- ????????????????????????pchDes=GBKToBIG5(szSrcStr);??
- ????????????????????break;??
- ????????????????}??
- ????????}??
- ????}??
- ????else??
- ????{????//從UTF-8轉到其他多字節??直接轉到GB2312?其他形式用GB2312做中間形式??
- ?????????if(CP_UTF8==CodePageCur)??
- ?????????{????
- ????????????pwchtemp=MByteToWChar(CodePageCur,szSrcStr);??
- ????????????if(CP_GB2312==CodePageTo)??
- ????????????{??
- ????????????????pchDes=WCharToMByte(CP_GB2312,pwchtemp);??
- ????????????}??
- ????????????else??
- ????????????{??????
- ????????????????pchTemp=WCharToMByte(CP_GB2312,pwchtemp);??
- ?????????????????if(CP_GBK==CodePageTo)??
- ????????????????????pchDes=GB2312ToGBK(pchTemp);??
- ?????????????????else??
- ????????????????????pchDes=GB2312ToBIG5(pchTemp);??
- ????????????}??
- ?????????}??
- ?????????//從其他多字節轉到UTF-8??
- ?????????else???
- ?????????{??????
- ??????????????if(CP_GBK==CodePageCur)??
- ??????????????{?????
- ??
- ??????????????????pchTemp=GBKToGB2312(szSrcStr);??
- ??????????????????pwchtemp=MByteToWChar(CP_GB2312,pchTemp);??
- ??????????????}??
- ??????????????else??
- ????????????????pwchtemp=MByteToWChar(CodePageCur,szSrcStr);??
- ??????????????pchDes=WCharToMByte(CodePageTo,pwchtemp);??
- ?????????}??
- ????}??
- ????delete[]?pchTemp;??
- ????delete[]?pwchtemp;??
- ????return?pchDes;??
- }??
- //獲取編碼類型對應的代碼頁??
- UINT?Coder::GetCodePage(TextCode?tccur)??
- {??
- ??????UINT?CodePage;??
- ??????switch(tccur)??
- ??????{??
- ??????case?GB2312:??
- ??????????CodePage=CP_GB2312;??
- ??????????break;??
- ??????case?BIG5:??
- ??????????CodePage=CP_BIG5;??
- ??????????break;??
- ??????case?GBK:??
- ??????????CodePage=CP_GBK;??
- ??????????break;??
- ??????case?UTF8:??
- ??????????CodePage=CP_UTF8;??
- ??????????break;??
- ??????case?UNICODEBIGENDIAN:??
- ??????case?UNICODE:??
- ???????????break;??
- ????}??
- ??????return?CodePage;??
- }??
- //指定轉換時默認一次轉換字節大小??
- void?Coder::SetDefaultConvertSize(UINT?nCount)??
- {??????
- ?????if(nCount!=0)??
- ????????PREDEFINEDSIZE=nCount;??
- }??
![]()
GB2312轉換到GBK編碼效果如下圖所示:
UTF-8轉換到Big5編碼的效果如下圖所示:
本文代碼及轉碼程序下載 :http://download.csdn.net/user/ziyuanxiazai123
4.尚未解決的問題?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
(1)LCMapString函數的理解還不完全熟悉,其中參數偏多,理解需要一定基礎知識。
(2)為什么記事本程序的轉碼后存在些亂碼,亂碼是正確的嗎?因為我的程序使用了中間過渡形式,因此沒有任何亂碼。
(3)是否有更簡單和清晰的方式實現編碼轉換,待進一步研究。