PHP, XML, and Character Encodings: 【zz】

上一篇 / 下一篇  2007-03-02 11:37:44 / 个人分类:LAMP

http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss
f tS%}8u:|o0
&p9{ M~XG*wA0

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Geekism

木铎校园 BBS 社区6{ |z6s"Ck4B1T

Update:This code has been finalized and debugged, and is now shipped as part ofMagpieRSS 0.7! Sadness and rage no more!

+[` M5N6j ~;TS2_e0木铎校园 BBS 社区] i4i-AD!Ib

So I have this little program, calledFeed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called byMagpieRSS, the RSS and Atom parser used by FoF.

d;iztx0

se;ns$`:A6?gPq0Here’s how Magpie was creating the XML parser:木铎校园 BBS 社区 Sl6BqIM6K3vOo

木铎校园 BBS 社区X*}3x_ j1@

$parser = xml_parser_create();木铎校园 BBS 社区s N{uJTdd!{"p m

HP C"~r0Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

?6@$X5[[E?-i0

i,Q0["~_(d$`0$parser = xml_parser_create();
.e;H Hs{v2@3W c!\0xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
木铎校园 BBS 社区 lE+cLI

D9@ JGO Y$c0This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”木铎校园 BBS 社区 C Oe?PA VK

8f^Tx)|N7B^-yJ0So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some moredocumentationandbug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:木铎校园 BBS 社区(G_[?]8q

木铎校园 BBS 社区y n i+e!n$S2zi

$parser = xml_parser_create("EBCIDIC");
o C#ZD Y*k;m#P:l*Y0xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

/ob#C(Qc V0
木铎校园 BBS 社区1X^3p%v&A$o(`

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

Y!W5u%^0P}4d0

X$a }2WH2T}0h0That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

w-xV3S7Mfc0
木铎校园 BBS 社区5q _ o(u^!^9H

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

6R+N|8X6d!v$D9r8M0木铎校园 BBS 社区#Up$a)YGQ2{+h

if (preg_match($rx, $xml, $m)) {木铎校园 BBS 社区Q`2A9?SV2C:D
  $encoding = strtoupper($m[1]);木铎校园 BBS 社区 R:[eP!C8z
} else {
R1JRa9N-{(d0  $encoding = "UTF-8";
l'\K {*c0p9E:M0}木铎校园 BBS 社区^/g` z7[c r3ZV

木铎校园 BBS 社区u~mQ.O%_

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

hot%a3ef;D$q0

blc?gg(U0So the full code is now:

o^Kdk"C'E}0

R7wY:R7oo+e n0$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

o3vl3Ve2v5d4u0

^#XGNzH6Y\K^F0if (preg_match($rx, $xml, $m)) {木铎校园 BBS 社区+Lj$T[9Udc
  $encoding = strtoupper($m[1]);木铎校园 BBS 社区@n \4Ho:qp)]j
} else {木铎校园 BBS 社区I6o skW)I,M8[
  $encoding = "UTF-8";
L SF*A"H;K WJ0}

nA2{Zx){0

S6~v;UPC^%g0$parser = xml_parser_create($encoding);木铎校园 BBS 社区+@V i#U[ W({J
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");木铎校园 BBS 社区/n#i)|)IO s \~8U

b ]$Omm.DK0That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

(w#c'}5C&kT-AFO)~0木铎校园 BBS 社区1bc:\.eP.M6Xiy

Even PHP 5 won’t help here, when it is released: Itsort-ofsupports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

xV6{ q P g/b0木铎校园 BBS 社区9@J A5sHkv1u/a

So I searched the PHP docs some more, and came up with a potential solution:mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that saysencoding="utf-8"and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.木铎校园 BBS 社区s`]F*U

*g0cR,F-NO6t"`0$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

hR'Yb(J7^7[e0木铎校园 BBS 社区7U/Q3z1zj ~2z)?Vr

if (preg_match($rx, $source, $m)) {木铎校园 BBS 社区8X uk+|e
  $encoding = strtoupper($m[1]);
%bE2e)~6S*m0} else {木铎校园 BBS 社区~+p8W"lP ^
  $encoding = "UTF-8";木铎校园 BBS 社区c YR9Y{*DB `
}

#vH7u.ry~0

'k^9?/TY+F0if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {木铎校园 BBS 社区2l J%c R fu%}vx
  $parser = xml_parser_create($encoding);木铎校园 BBS 社区(pvKTn].KZ2T
} else {木铎校园 BBS 社区#["Hk0E5RJy

6^/\ b)f| e9EIz}0  if(function_exists('mb_convert_encoding')) {木铎校园 BBS 社区T8|K'D1V zmf1LZ0g
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);木铎校园 BBS 社区x9FV i)E d^
  }木铎校园 BBS 社区{9\X|u5F"k#K

iY,h)w0o6Lb*_R0  if($encoded_source != NULL) {木铎校园 BBS 社区4EK3}4j d:Vkc8Q
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
,lef a%FB0  }

\/Su/HQg0

fA;n9B%u1y"P*jt"V0  $parser = xml_parser_create("UTF-8");
|)@ Eg R0C0}木铎校园 BBS 社区 ]6r]0p5V\tS SL

木铎校园 BBS 社区@n?)i,HW

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");木铎校园 BBS 社区J)Lwx|Zk#F^'Z!PG j

木铎校园 BBS 社区w:bBqk YP

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.木铎校园 BBS 社区 e0w#]b,t


木铎校园 BBS 社区4`whc-?/ee3Y#c!i

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

Lru.i0H0

Cmgl4]2xHE#\!z0$parser = xml_parser_create("");

,N)mOFx(Y^d0
木铎校园 BBS 社区9I"I| ?9R8}.qh ?

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:木铎校园 BBS 社区6ET#]vshe(a

+BP?S Oc5T0$parser = xml_parser_create("");木铎校园 BBS 社区l9H~jA,Vo0OR
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

B{!fF6@5le j0
木铎校园 BBS 社区`vw0L8]#V$tH

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

WrWOoR0

*Fq-f\v0At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.木铎校园 BBS 社区:Rc+Ni.{:[0vd


0u&iV4E7X[9R z0
64 Feedbacks zu "PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss"

Phil Ringnalda

3pQR$na a W0If you really want to cover all the bases, don’t just look for mb_convert_encoding(): you can also look for (and rarely find) iconv() and recode(). Then, on *nix, try to shell_exec(’iconv…’), though you’ll fail in safe mode and most shared hosts disable shell_exec. But, there’s still one last hope! Most of them don’t realize that they should also disable the strange and terrible proc_open(), so you can actually fork a process to run iconv, and open input and output pipes to feed and read.木铎校园 BBS 社区"KoM d W-CAW0v

木铎校园 BBS 社区!sZ _u'I

Or, sigh, write ten lines of Python to call Mark’s Universal Feed Parser and return the output as a PHP include. Sometimes, PHP really ticks me off.

*xVOvRxN#m0

6M1O/N^-z6?%}XmL0
y i0G u%y(^T R0

steve

木铎校园 BBS 社区$P T5p rO

OOooooo… I didn’t know about those!木铎校园 BBS 社区ISKG6D

木铎校园 BBS 社区D G)j-}?7Gb

Or I could just create a web service:http://xml-transcoder.com/?to=utf-8&url=http://nasty.feed/in/EBCIDIC/or/some/such. Then you’d just seamlessly subscribe to the transcoded version of the feed.木铎校园 BBS 社区 J3U }J r[a_(O

木铎校园 BBS 社区[U8TtwC7|Q(K3l

dp4l9m.d@T9V0

isis

木铎校园 BBS 社区.\ e9[v)Ztb

good job and funny title.

O q0jlgEG0

^s T0o"dP` v0木铎校园 BBS 社区S z l/}5v6OZE

steve

G A9n:gNM+R1C0Thanks, isis. I used your feed (since it’s big5) as one of the test cases!

7?4vf z Y dS2C0

!c6CwqfV~(@0木铎校园 BBS 社区"U!e~2aS%|}8P

zonble

木铎校园 BBS 社区sh z%@,N{}zI$Z@7z

Steve. May you permit me to translate your post into Chinese and share it to the Chinese readers? I consider that there might be lots of people in Asia would like to know how to handle with the International characters while programing PHP.

~5z-qzfz0

tX\ Q|3H0木铎校园 BBS 社区g7p-@#b)Y3q3q6a eNC

steve

木铎校园 BBS 社区*_.Mh.H,dnR9Q%Ug#`

Yes! You certainly can! If you can wait until Monday, you can post your translation of this article along with a pointer to Feed on Feeds 0.1.7 which will contain the working code.

5a1S rF:t0
木铎校园 BBS 社区{V V{|n y%VQ!Dg,Q
木铎校园 BBS 社区 |,]U sB uf cE'k

Mark Wu

木铎校园 BBS 社区/j-j#M5WJ6T%j

Hi Steve:

T*O/KLAS0

M#DW(Yk jt8N0I just plan integrate the FoF into pLog, this tale really help me learn a lot about PHP’s stupid encoding …

;GBx{9H aO0木铎校园 BBS 社区^{0v/S}7j8`3M!U xM5l

Regards, mark

,k#g,L N9W \0

oGtW3CQ?jL"p0木铎校园 BBS 社区:?-kO+bGsK

Mark Wu

w2@:`5xi)D)W1r/x0Only one thing, my ISP does not support iconv … my god … How can I do?

#p:Qw"P#E8x j0
木铎校园 BBS 社区HV G'}L bhu Bz
木铎校园 BBS 社区3[H!O(C:hQ-N$L9O

steve

*wNLl0jk en0Uh oh… how about ‘mbstring’? If you don’t have iconv or mbstring, then you will only be able to work with feeds that are in UTF-8, ISO-8859-1, or ASCII.

q;n9hy6\3|0

vk;y'f0}pD0木铎校园 BBS 社区9d){"U$cI3Af1S _

Thomas Clavier

木铎校园 BBS 社区sy6@_@6aHn.C'Mk

for me it isn’t a good idea to search charset in xml header. it’s good to search in http header because for w3c if charset it isn’t specify in http header you can use xml charset.

sv'K.D9dY{0M\0

:Si.PS4REGg"J0http://www.w3.org/TR/WD-html40-970708/charset.html

|3S%_vVOWP0

,R"[-W] c G/iH0木铎校园 BBS 社区'I.b1` W#d q#N9|

steve

木铎校园 BBS 社区!C*H,vHHQ

To really get it right you have to checkboth. In practice, I haven’t yet found any feeds where checking the headers is necessary or even helpful.木铎校园 BBS 社区R n7N [1YA

木铎校园 BBS 社区4Do5bQ-O7nT

'I1l5p7^"J@5n"g0

Mark Wu

Pk@6z u#g0Hi Steve:

Q;@)~T*P]M,oB G0木铎校园 BBS 社区(QU.|5XGl

Thanks, It works,I already asked my ISP install iconv …. It works..now!!木铎校园 BBS 社区1y$G{@D2bX$n

z%X_dc0And, I just only use the hacked magpie-rss that shiped with FoF. With it, I can let pLog support the encoding, convert GB & BIG5 to UTF without change any code … really thanks!!

!g)Z(~q)bJw p*n5O0

G)`7h[ J-Jf0Please go my site to see the result, you really help me a lot!! ^-^.

H;y-HN-m0I&T ? y0木铎校园 BBS 社区["x F)U s1Z8[P]H

RSS Feeds by Site ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedulebysite木铎校园 BBS 社区M(`Y7\2|-bo'Ui4x

ar8wd `0v,C e0RSS Feeds by Time ==>http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedule木铎校园 BBS 社区Ty(zC:k9U#s

*wwU4{0N a?0And, may you allow me to submit the my code with your hacked magpie-rss version to pLog ??木铎校园 BBS 社区xw[%p]!o

+AMN!x^l0Thanks!木铎校园 BBS 社区:K;Q#x!_G,x DxN`#m

木铎校园 BBS 社区pp#rt+S

Regards, Mark木铎校园 BBS 社区Ckjo _xK2J

木铎校园 BBS 社区 yMF8u*JdGN

Sp ^\,n @7B0

steve

"z:HU)RNP+C@c0Mark: Great news! The code is GPL, so you can share it with anybody you like, as long as they follow the terms of the GPL as well. The author of Magpie is looking at these changes now, and refining them, and they may be included in a future “official” version of MagpieRSS.木铎校园 BBS 社区 f4RJ0~N| \G d$w

木铎校园 BBS 社区 rbymvB

-R q1Z0no[%mL0

Anonymous

木铎校园 BBS 社区a%Dl&_p$G

god. this is a boring blog.木铎校园 BBS 社区J bB SZ4r;L
possibly, because i’m stupid and know nothing of php encoding.
'S7b0m f2I^0but for all our sakes, go to a party and get hammered.

4] th`+[.U1o6TyH0

V+w"d}3h*Wws3ef*jl0木铎校园 BBS 社区 O;[GH?

David

/v(w]1p*rb'}v0Thomas makes a good point, but it’s even worse than that–not only are there character encodings specified by both HTTP and the XML prolog, but either or both of them could be completely wrong. I would imagine there are many feeds served as ISO-8859-1 that are actually windows-1252 (or whatever it is) and they just don’t happen to contain any of the characters that are different between the two formats, yet. When one of them does, your code might handle it, or maybe it’ll start spitting out gibberish again. And if someone gets UTF-8 and UTF-16 mixed up, I think you could wind up with everything shifted off by a byte, and now the feed is completely unreadable again.

p-hN%f4T2c9Rx%AE0

{MmE+k0Basically, until we can convince everyone to use UTF-8/16/32 for everything, this will be a bloody pain in the arse. It looks like PHP just makes it even more painful. I would second Phil’s recommendation: use Mark’s Universal Feed Parser, and when something breaks, just get him to fix it. :-D

oo-KJv2R1]$R0

#U&Y9U;ZI-b~:| K0
9xG&[8q+d8G x7gj8E*]0

steve

#O\1U5K&BP!j0At this point I’m still in the “get it right when the feed is right” stage. FoF still doens’t even do that, all the time. Once I’ve got that one licked, I may move on to “get it right even when the feed doesn’t”.木铎校园 BBS 社区LN6I!U LF yJj


C4CHJ4y7Vk#b&V0木铎校园 BBS 社区g"_`|\Z@

So Much Geek, So Little Time » Reject Incorrectly encoded Pingbacks

/m*{&f3s^kz'Z^0[…] ject Incorrectly encoded Pingbacks Filed under: Life — unteins @ 12:11 pmThis articlehas some info about […]木铎校园 BBS 社区P nIt1{p


R3q/C:G$\0木铎校园 BBS 社区 uu[ D:G u ?R9g

Mark Wu

木铎校园 BBS 社区 Mp4~x7a#S;S(E

Hi Steve:木铎校园 BBS 社区R d0Xo~?2i&D%x6V

木铎校园 BBS 社区(bXb`-iG

I wrote a blog about pLog RSSFetcher Plug-in. Thanks for such good work.

!L7^jmV5I0

'X)IH f:]bH0http://blog.markplace.net/index.php?op=ViewArticle&articleId=119&blogId=1木铎校园 BBS 社区Y f+HdW+R"`

T8KWf)o {0Regards, Mark木铎校园 BBS 社区x_e ];]g


Wie#{)MH8rO0木铎校园 BBS 社区H0V'Y.S(Q,m ]J a

Harry Fuecks

木铎校园 BBS 社区'V.w3xN o4OP{

Great post. Many thanks. Seems many php developers are largely oblivious to character encoding issues. Looking at some of the feed generation libraries out there, seems it’s a similar story - last time I looked onlyRssWriterpays any attention to UTF8. The author is Portugese I believe, which may be why…木铎校园 BBS 社区/X4Xv-U"|5D!bV

木铎校园 BBS 社区5nX5o$`gS

Gives me some ideas for further features forHTMLSax- right now it shouldn’t (not that I’ve tested carefully) choke on anything but also doesn’t support the user by taking care of encoding issues.木铎校园 BBS 社区)GJ*u2hP{


W#Sxa$GB ] }U:tGN0木铎校园 BBS 社区'PoYiD9o8m]\

inertia

_m8_V"Ty+Ds0hi Steve,木铎校园 BBS 社区D/^2_5K:e;N V

hNG!pm$Ok!o0I heard this good tool from isis, and install it on the share hosting to test. And one thing I can’t figure out is that after reading all docs carefully and asking my ISP had mbstring and iconv both complied, I still can’t fetcher big5 blogs, ex isis’s blog. Do you thnik where may I get worng?木铎校园 BBS 社区p9p,g u1F5wH$e#ua w

jq;t'gS:n5w0I know this porblem is guite “ambiguous”, say, the ISP didn’t compiled well ,or some installation step got wrong. but I also wish get some ideas form you.木铎校园 BBS 社区7D8n}9E}+{K/b:H:P

木铎校园 BBS 社区4i/n({L G"h @!}j6t

regards
a6Fq4@L:b;t6l0inertia

a T6i_o AU_5w0

T2n6c$y|w~0木铎校园 BBS 社区jE0Dt+DV4b

Peter Van Dijck\'s Guide to Ease

木铎校园 BBS 社区y v#v h HT){u


!M?7ZYxL*oJ0Steve Minutillo :: messy-78  PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss: a must read if you’re wrestling with PHP, XML and character encodings!…

|.Ip2ZLT\;{(y]0
木铎校园 BBS 社区@E|j L/v L!g
木铎校园 BBS 社区._8J]/k$_ s

MeriBlog : Meri Williams\' Weblog

6StuoB0R:Z4u#~M0Multitasking Mice Fertilitiy
.r\/N G3PP*J[0Interesting article over at OK/Cancel all about multitasking PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss — worth reading just for the title ;-) Very comprehensive guidebook to developing with web standards I love the…

f U1Ea5e v1G#eg0

@:t+Y8I ?$\!sc0
:Cn"r&Ve?m]Ni0

Grace and peace to you! » 2004 » June

Dv }|m8k0[…] Grace and peace to you!

W]F{0JoB0

eC$h j6o0»PHP, XML, and Character Encodings: a […]木铎校园 BBS 社区 O5T4{"z#`1}


'^8VT#z7X Ba0木铎校园 BBS 社区t/}3l&YB1e

Pete Prodoehl

x'g.r/l w-NU%e]{V0I too must suggest that using Mark’s Universal Feed Parser would be a good idea. In fact, since the process of harvesting the feeds, parsing them, and storing them in MySQL can be separated out from the whole UI/reading part of things, this is a great suggestion. I know it sort of makes fof a weird combo PHP/Python app, but it could be an option for those of use who don’t have a problem with that.木铎校园 BBS 社区.DPsw ` uZ?

木铎校园 BBS 社区 ti'H tCE:qH)]f
木铎校园 BBS 社区iSoU+k)u

LinuxBrit

w%R3Q!Eu z l Z;cr v0PHP, XML, and Character Encodings
.Zchwl0Man, PHP can be really unbelievably stupid sometimes :(

`(sOo Wpy#B0

o@/M$k7Uy$|0木铎校园 BBS 社区;A7N6aD.{2l/v)[8Q)x

Elaine

;f'pox/Aq0yipes…I started taking a look at the problem when I was mucking about with a personal variant of FoF and could never quite figure out what was going on. I always thought it was a problem with Magpie, but I had no idea it went that deep. thanks for all your work on FoF, and for sharing all the gory details!木铎校园 BBS 社区X`f[F8`m2G&X g)FX

木铎校园 BBS 社区sXB9jWtna

Y&lM6lV m0

Simon Jessey

木铎校园 BBS 社区*@$J]%t1`+y8W&F,`P`

Multibyte string functions are not part of the PHP default install, so many webhosts do not include it - my own webhost refused to add it. Just thought I’d let you know.木铎校园 BBS 社区2bX3|j` K4B_D

木铎校园 BBS 社区ln2^N3` n-X-La"o3N

$Qzb@{n1h0

Dominic Mitchell

b,r|4?(QM0What about UTF-16? Your regex won’t be able to pick up the XML declaration then…

:`$o5Dp3D0

6H:H q u8w/th1i0I love spanners. :-)木铎校园 BBS 社区aL-mP2b1u

dJWM)r$E?4|] V y0-Dom木铎校园 BBS 社区 V Epvq K)e


;K2I\op0木铎校园 BBS 社区8@,`6uW w J!\

steve

木铎校园 BBS 社区T@+C p[`v.weIs-B

inertia: I’m lucky enough to be on a host where mbstring and iconv are both included and work perfectly. I’ve actually never even compiled PHP myself, so I don’t really know what can go wrong there, other than the obvious “check phpinfo()”木铎校园 BBS 社区#H6U W#})bRFqB)A/h

)M L2` W)?1zX,R0Simon: I know. There’s nothing I can really do about that. In the next version of FoF the installer will inspect your system and tell you what it finds, so at least this will be less confusing.木铎校园 BBS 社区Ap,B5a'J-u

木铎校园 BBS 社区/M&s z4HIv

Dominic: I knew there’d be a hitch somewhere. Are you sure? Have you tried it with UTF-16? If it doesn’t work, is there any workaround?

N%RJ#[-ahGW0s0
木铎校园 BBS 社区`e&L sh)Ll

,irz4W?/B0

David

木铎校园 BBS 社区!tu-M%KlE

Thanks, excellent post. Good to know someone’s working so we don’t have to :)木铎校园 BBS 社区9D%A$Y!|VrO


;r{PRqU0木铎校园 BBS 社区%Ay*\"_wz*y&AG\?

snapping links » lookin’ good

ym'f)zHy4W&X S K&y0[…] n into the problem with character encoding (darn curly quotes)…so I went back to look atwhat Steve Minutillo had to say abou […]

Kj:]1a{,IRn0
木铎校园 BBS 社区2u9m]#HPyU5X

%N `4h_4V"\)n0

Andrew

c u5WW!sk f0Thanks for your post. It’s unfortunate that php’s support for i18n is so poor here. For what it’s worth (probably not much), EBCDIC is not spelled with an extra ‘I’, even though people pronounce it as if it did. Man, I hope nobody sends out news feeds in EBCDIC O:-).

%^5Z6hTGb'sa?,~0

eO~P9z6^1v+C~\m0
.ao6Lb8|a0

车东Blog^2

木铎校园 BBS 社区f&I7HbC!Mf4Pd

Lilina:RSS聚合器构建个人门户(Write once, publish anywhere)
)jTX)z3^p/N X0最近搜集RSS解析工具中找到了MagPieRSS 和基于其设计的Lilina;Lilina的主要功能:

DZJ'bi0木铎校园 BBS 社区%M!KZ'T j2L

1 基于WEB界面的RSS管理:添加,删除,OPML导出,RSS后台缓存机制(避免对数据源服务器产生过大压力),scrīptLet…木铎校园 BBS 社区9]"R0Z].a r@(}F


,G5l wVaXJ0木铎校园 BBS 社区5ZLGC&~;_5\

车东BLOG

'@H$IW |sRF#p8d0MagPieRSS中UTF-8和GBK的RSS解析分析:php中的面向字符编程详解
2~Y(K3NG:iwi#Dy0第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mtstring的支持,我今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为’MAGPIE_O…木铎校园 BBS 社区uc.W.@;p3@o!xd


"E[b0IX.L]0
kX wB"\-ir0

grace

G Q F&oP,i0how to do it even without using XML ? I just wan my page to be able to key in chinese character, submit to mysql in code form, then display on page the chinese character.木铎校园 BBS 社区Pg(HC0b

木铎校园 BBS 社区6@)b Z hq(XN q

What is the testing scrīpt which I can test for this?木铎校园 BBS 社区t!FfeH*I8RAj
Thank you

4kIf e]+K}E0
木铎校园 BBS 社区e l.ss z4~E
木铎校园 BBS 社区[iD`%u1@

steve

,|R2Nbe z6NV M0grace: I don’t have any links to tutorials on that, but what you’re talking about is fairly easy to do. In fact, the free weblog engine that powers this site, Wordpress, is capable of doing just that. You could download Wordpress and examine how international characters are handled.

cI s1f&_&A5p5C'Ne a0
木铎校园 BBS 社区qXmKg
木铎校园 BBS 社区WbQx2SZ K \6PU

Valery\'s Mindlog

木铎校园 BBS 社区 q9a%Ef;~/P

 木铎校园 BBS 社区Ah(g1[2tgw1`+I
      , ..  ,     00:01.       ,      (    “ ”,      ).  …

G4_S)p1m0Py"H?0
木铎校园 BBS 社区!}h z8? \

'r.ed NivO9Y0

Sascha Carlins Linkdump

木铎校园 BBS 社区F.IZ1[.a#?X'o

This page has been linkdumped
'|,DN~8TSB0Charsets…

~$CuCQ/Wz8a"O7Yo0

'jO u"S!h0`5J h0木铎校园 BBS 社区Wx-a3T {f

Andy

木铎校园 BBS 社区Bg.|,Q:`@v1N2ujC

I’m developing a library called mbstring emulator which emulate mbstring functions. I already published mbstring emulator for Japanese(supports Shift_JIS, EUC-JP, UTF-8),and now I’m working with western language version(iso-8859-1 and utf-8). I’d like to know what encoding do you need .木铎校园 BBS 社区 Y$~4xW$Ji.vX-H

木铎校园 BBS 社区X2{ K.q(JS
木铎校园 BBS 社区+nS:G a XP_/hJ

steve

木铎校园 BBS 社区/R#H;F]Ks

BIG-5 would be nice.

gp_ty+H7R7Q)z0
木铎校园 BBS 社区v.g(? a6x2X7_%_
木铎校园 BBS 社区 z\ X|!o$c.Eo!RI6E

ryh

_.I;o,~mY ~$\4at0Good job, I’m using this on my site.木铎校园 BBS 社区{x Cy o%HQT


;r'v4pep2opd0
n:Af#Yb:n a(u0

+CMS

木铎校园 BBS 社区b-m_ @)a,n ~

Andy your mbstring emulator is perfect. thx

:H k J5\9W0
木铎校园 BBS 社区tKt&K1AL

S2d D-waWB0

Charl47

c:J9fw)jV^0Hi steve
.v0lm3]1hax0I’m using magpierss 7.0. Where may i put your synthaxe exactly. I don’t unsderstand all your explain, i’m newbee in RSS and I’m french. So it’s difficult for me to translate this.木铎校园 BBS 社区2K&}i5T(LNvr~
Thank’s.

b"|g`,YIG e0
木铎校园 BBS 社区U+WW-Gd8j4?u

({$Kj].p8N"w0

steve

/Ur4N:^&{0If you are using MagpieRSS 0.7, then this code is already built in! This was written before MagpieRSS 0.7 was created.木铎校园 BBS 社区 R;q.j;_!IJ6u9sz


hf;K6^|@}#i0
9a-N"r)yiCZ0

nwestwood

'EUDXA|Z,gsK0I’m using magpieRSS to read multiple feeds and create 1 feed with news of interest to our industry and it works, mostly. When I write out the data I retrieve from magpie, I get “not well-formed” XML, for example & symbols in the URL’s that Feed Validator doesn’t like. Is there a way to get the data back and have it encoded correctly? or what do you suggest?

0@ y|iD1v:p{0木铎校园 BBS 社区x(Z)o `!f\

-thanks - Neal木铎校园 BBS 社区H+sP9qN3|$s6T

木铎校园 BBS 社区7g9q(XZ*[5y
木铎校园 BBS 社区`2`s&S.zZm

steve

木铎校园 BBS 社区)VptY7KVcA J

Magpie is working the way Feed on Feeds uses it. You could try asking on the magpierss mailing list with some more specifics on your problem.木铎校园 BBS 社区&Q;S/h@)p]:zC\

木铎校园 BBS 社区 a$v'PG.j9j[ Z

&I5YQTmX0

junesnow17

3G%Y _HB0我不知道我在這裡輸入中文字是否可以顯示出來木铎校园 BBS 社区Wd.o7v2a!R1G\
我的英文很差所以只好輸入中文

v/?6z5Pwfn0

%b!G? m h0F-L0因為各種原因,我在學習php的同時發現我現在安裝的php版本太舊,所以花了兩天時間更新到最新的版本木铎校园 BBS 社区r3V7j(U4U:TS

#Uu"S~6NJ-@0結果論壇中的會員名字全部變成亂碼@@

kQKVK0

rA5VClD;R!E.P.u0看了這篇文章,雖然問題沒有解決,我老公最終還是把資料庫還原到以前的版本,但是我十分感動,有人會在關注這件事情.木铎校园 BBS 社区 |\L&Y#Vi r9W/B#n

木铎校园 BBS 社区4c-N Z*Rx`

因為文字,我們使用方塊字的這些群體都被忽略,往往寫一些程式的時候就會因為文字的限制而弄得暈頭轉向.希望能早一天,有國人都可以使用的php.

8HU9D;INqX3e0木铎校园 BBS 社区Vgq],z Z

不要再讓我們覺得自處受制了.

2]4|p\E0Ug&l!g0
木铎校园 BBS 社区I#wd]'I xh)lio

/@8}!MIO/|0

Jari

木铎校园 BBS 社区)U g*E+@["g_

Hello Steve
:`1O['r[5e0I was looking for this scrīpt for a long time. It looks great. But I get a parse error , unexpected ‘*’ , in the line $rx=
0a l&ZH[3Ec-|0I made a copy/paste in Dreamweaver. What’s wrong in the syntax? Thanks to help me. Best regards木铎校园 BBS 社区u Uw*G R-[


{ G l%D0U V%g~0
[FC&T D'HxH)eO.n0

jari

木铎校园 BBS 社区U5v N Wt

Hello

yR8P t$d,b0

4pI0Jx G"i&sh0I tried magpie rss_parse.inc scrīpt that use your post. Doing copy with notepad I have now no syntax error. I want to read xml files in ISO or UTF-8 . I tried the function : function php4_create_parser to test the encoding of the input files. I use the regular expression : ‘//m’ with preg_match, but if I do an echo of this regular expression I get nothing. What’s wrong? How can I get the encoding of the xml files ? May you help me.木铎校园 BBS 社区w,p(V.J;|

木铎校园 BBS 社区-ba.j`F&Z*I9@L

Best regards木铎校园 BBS 社区)ZAc:l"~0Q#U


gW4m;R;T4M$a"?0木铎校园 BBS 社区_ nC5g)nEa

Javier

7W)n+dGN8e0Thanks so much man, this has solve a huge problem I had. Ive been searching for a solution for a while, parsing XML with PHP 4.x could be frustrating specially if your XML files need to use entities for different languages (in my case, spanish).木铎校园 BBS 社区i l-v)Y8m)j-U

木铎校园 BBS 社区ND4j ?+{&}4?)~

Ive learned a lot with this, thanks again ^_^木铎校园 BBS 社区9k7UX!|C-|;UH

木铎校园 BBS 社区5dw%^ @A
木铎校园 BBS 社区:P#f,L3N+[ k\

Asha

木铎校园 BBS 社区 @o&kHXi;|Z`MO

Hello,木铎校园 BBS 社区~ w2E!|;`[
I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?

&W f y&`6NO0

a;k0j7v6}D/s{0
/w*S_k|_"cL?9x8U0

K

9Xj$_1uyw0] `0Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16木铎校园 BBS 社区k3S]7Q)E vWM"Q


G2O5t8KB*C:uT0
nXkZ U^;u:]w x DRN0

K

Cby1q8G-Gmn9c%T0ok, my last comment was “mangled”木铎校园 BBS 社区/mt y6a,Hj

木铎校园 BBS 社区'[ D1u1O)h1Wv*p9k

Replace all 0-bytes from the string. That’s \ by the way

5dm$p*u/aR_1Wj0
木铎校园 BBS 社区1i&MlFM X-R

B2ooW(Z+Ni#e0

oink

木铎校园 BBS 社区 h"e3Vphj)}@L-R

i suggest to go a little further with the regular expression in a network world full of screwed up source codes:木铎校园 BBS 社区V0Q`VC2TeT

木铎校园 BBS 社区].{W'nI:^)XZ/f)M r

preg_match( “/]+encoding\s*=\s*[’”]?([\w._]+(-[\w._]+)*)[’”]?[^>]\?>/i”, $xml, $m )木铎校园 BBS 社区\Z3|kY(wJ*@

木铎校园 BBS 社区 ccV,N5^2{"B

it’s possible i screwed up here myself, double check advised, i didn’t test it yet.木铎校园 BBS 社区 ?Ss xh:_%d


rsS'a%] I M ZK0
{}N-n:t.q6R*h};R]0

oink

木铎校园 BBS 社区 s T:QO:c

well seems like your comment page doesn’t translate “smaller than”.

I,~&F#PR;T N.g0
木铎校园 BBS 社区R o E%P4@s;d@+nx
木铎校园 BBS 社区EL(jeq f9c

steve

^A} `U,_ r7m0Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.

5@1[ j)]9w1g0

l3?.P2c*m/rd0
+f Kj;z/_2N})x0

pepe

-{5k#H:m&oP#\"o0Y0hola egañádos木铎校园 BBS 社区[,ov^"f&?\9n


?)C1G7}s0
#k!mN3_*\&d)]"xB0

Andrew

CqS0J#Pv_%\0Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:
HE/@gg(D0www.andrewzahn.com/crissa木铎校园 BBS 社区7g9dL?T y

木铎校园 BBS 社区+B!c!n3pn+b;v}U

can you tell what could cause this?木铎校园 BBS 社区 G#l6ho%`3}o9R h4O
thanks

"QW6wOO O5L,C0mi0

2M0r2Bjv-L \k0
.J!tu6\Bh8@ ^4T0

Jason Judge

h4Y V@ xe)b%uz0Just a note on this bit:

{w5K5\_(sE0

:V aO3A?H0y0“If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”

k.c~-jEfu0木铎校园 BBS 社区 m{4G$s4O^`

It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.

.KV%x'Z%W}(R)@B0

_[H/b'j~.O:in,R0The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.

8S1y"R}x bgm?0木铎校园 BBS 社区i!r*VLo?8B"J9ya

The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.

"Js%v PKO0木铎校园 BBS 社区%r8I n-m q/Di3^8B(iXr

I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.

?i{(ha0

+[EU/PA0vOY0
9~ f+Hw$j&y*r1WE&u B R t0

matt

木铎校园 BBS 社区Q,M1jU)bY!|;Yt

Steve,木铎校园 BBS 社区s)[!qZ8e*z

z1C2`lD{&rj"N8e0Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).木铎校园 BBS 社区 ?/XR WjU

木铎校园 BBS 社区a9Q*k#|Y(p-E7SjJ

All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.

Www\G[ QK-?{0

ohiM(Jyuz0The Yahoo feed is UTF-8. Here is the link to the actual XML file:

9MG,{4n%a m/w5ME4m W j0

"y p;p n}w0link木铎校园 BBS 社区/v d0e.ze&x$A V

木铎校园 BBS 社区 ~N5r8?7Ir

Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descrīptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.

eyQ Q0u*j0
木铎校园 BBS 社区kvq l!mQB

-Aq_aY0

unclepiak ลุงเปี๊ยก

木铎校园 BBS 社区 yb4rxM

sound interesting ! ขออนุญาตทดสอบภาษาไทย木铎校园 BBS 社区[}/Y(I2dh,A&F\ o


+xLp B,C\^c+e0木铎校园 BBS 社区(V+{0Rba

Rasmus

,f!x;z0MpU Y(f8p+_R"JK0I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.木铎校园 BBS 社区8yS4YhhS'K


]-{*Ur1_fz0木铎校园 BBS 社区-\j;`'l*EM

Oliver

木铎校园 BBS 社区+@ L Z'dz.T

Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.

{S&O}:?O/|0
木铎校园 BBS 社区#P8G+hM$f

3a](O_C3iX%H-l Js8f ]0

Matt

木铎校园 BBS 社区/_ t|n1?7q[4o

If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…

+hGBb P0

9Ocal;w0when you include and define your Magpie params be sure to include this line:

CO7i6f4q0木铎校园 BBS 社区Tk {\YU gt:I

define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);

&^cO)mu3j*^w6]ie0

K5lKsy&O&gS7I#o O0and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:木铎校园 BBS 社区:U Y]v~QE

木铎校园 BBS 社区!?aW&{)OAR

(less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)木铎校园 BBS 社区uh~ O6q[Ke k CqoI

木铎校园 BBS 社区:F w y#wc| F-fR

of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe

DM O9zT@-Aaq0

V1P*V3Aj}#L0

TAG: LAMP

 

评分:0

我来说两句

显示全部

:loveliness: :handshake :victory: :funk: :time: :kiss: :call: :hug: :lol :'( :Q :L ;P :$ :P :o :@ :D :( :)

关于作者