Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 May : Re: url encode to utf-8

www.cryer.info
Managed Newsgroup Archive

Re: url encode to utf-8

Subject:Re: url encode to utf-8
Posted by:"ildg" (il..@163.com)
Date:25 May 2006 22:34:23

"Remy Lebeau \(TeamB\)" <no.spam@no.spam.com> wrote:
>
>"ildg" <ildg@163.com> wrote in message
>news:44765d4a$1@newsgroups.borland.com...
>
>> For example, I have a string named str that contains Chinese words.
>> When I use HttpEncode(str), I get this result: %BA%BA%D7%D6+%D6%D0%CE%C4.
>> But what I want is:
>> %E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87.
>
>You need to UTF-8 encode the Chinese text first before then passing the
>UTF-8 data, not the original Chinese data, to HttpEncode().  Don't try to
>HttpEncode() the Chinese data directly.
>
>> They are quite different, and it's generated by
>UrlEncoder.Encode(str,"utf-8") in java.
>
>The desired HTTP output that you show with UTF-8 applied is much longer than
>the text that is not UTF-8 encoded.  That tells you that extra encoding is
>taking place between the original Chinese data and the final HTTP data. Here
>is some proof for you.  Starting with your desired output string:
>
>    %E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87
>
>Decode the HTTP into 8-bit characters and you get the following bytes:
>
>    0xE6 0xB1 0x89 0xE5 0xAD 0x97 0x20 0xE4 0xB8 0xAD 0xE6 0x96 0x87
>
>Look at the bits and you get the following pattern:
>
>    11100110 10110001 10111001 11100101 10101101 10010111 00100000 11100100
>10111000 10101101 11100110 10010110 10000111
>
>That is UTF-8 encoded data!  It breaks down to the following, where the
>hyphens denote the separation between UTF-8 markers and actual character
>bits:
>
>    1110-0110 10-110001 10-111001
>    1110-0101 10-101101 10-010111
>    0-0100000
>    1110-0100 10-111000 10-101101
>    1110-0110 10-010110 10-000111
>
>Remove the UTF-8 markers, and you are left with the following bit pattern:
>
>    0110110001111001
>    0101101101010111
>    00100000
>    0100111000101101
>    0110010110000111
>
>Translate the bits into 16-bit characters you get the following:
>
>    $6C79$5B57$0020$4E2D$6587
>
>So the original Chinese string '$6C79$5B57$0020$4E2D$6587' encodes to
>'%E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87' with UTF-8 applied.
>
>Now, with all of that said, does your actual Chinese string match the
>unicode characters above?  If not, then there is more going on that needs to
>be looked at.  What does your original Chinese string actually look like?
>
>> So I come here for help to find out if there's any delphi equivalent.
>
>You have already been told what you use - UTF8Encode(), for example:
>
>     var
>        str: WideString;
>        htp: AnsiString;
>    begin
>        str := $6C79$5B57$0020$4E2D$6587;
>        http := HTTPEncode(UTF8Encode(str));
>        // http is now set to '%E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87'
>    end;
>
>If it is not working for you, then please show your actual code.
>
>
>Gambit
>
>


Thank you, Gambit. It's OK now.
From your answer, I get much more than I expected of my original problem.

Replies:

none

In response to:

www.cryer.info
Managed Newsgroup Archive