2010年5月4日 星期二

PHP: How to convert ISO character (HTMLEntities) to UTF-8?

最近在寫針對 Amazon.com Bookstore 的 HTML Crawler 程式,

發現原來 Amazon.com 的網頁編碼居然不完全都是 UTF-8 編碼,

讓我抓下資料(ISO編碼)後,存不進預設編碼為 UTF-8 的資料庫,

於是乎,我向咕狗大神詢問了一番,獲得了以下的解答:

I’m facing problem in how to convert string from ISO to UTF8 previously. Due to the server configuration problem, all the UTF-8 char has been convereted into ISO (HTMLEntities) before it insert into db and those ISO character (HTMLEntities) break while showing in XML document. Now i found the solution to convert ISO character into UTF-8.

The original utf-8 chinese words: "你好"
Converted to ISO (HTMLEntities) : "& #20320;& #22909;"

For you to convert the ISO string "& #20320;& #22909;" to become utf-8 chinese words: "你好", you need to use Multibyte String function (or mbstring extension). Example below uses mb_convert_encoding function to convert ISO (HTMLEntities) characters to UTF-8.

?php $str =  "& #20320;& #22909;";
echo mb_convert_encoding($str, 'UTF-8', 'HTML-ENTITIES'); ?

0 意見: