Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 May : Re: url encode to utf-8

www.cryer.info
Managed Newsgroup Archive

Re: url encode to utf-8

Subject:Re: url encode to utf-8
Posted by:"Remy Lebeau (TeamB)" (no.spam@no.spam.com)
Date:Thu, 25 May 2006 20:39:40

"ildg" <ildg@163.com> wrote in message
news:44765d4a$1@newsgroups.borland.com...

> For example, I have a string named str that contains Chinese words.
> When I use HttpEncode(str), I get this result: %BA%BA%D7%D6+%D6%D0%CE%C4.
> But what I want is:
> %E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87.

You need to UTF-8 encode the Chinese text first before then passing the
UTF-8 data, not the original Chinese data, to HttpEncode().  Don't try to
HttpEncode() the Chinese data directly.

> They are quite different, and it's generated by
UrlEncoder.Encode(str,"utf-8") in java.

The desired HTTP output that you show with UTF-8 applied is much longer than
the text that is not UTF-8 encoded.  That tells you that extra encoding is
taking place between the original Chinese data and the final HTTP data. Here
is some proof for you.  Starting with your desired output string:

    %E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87

Decode the HTTP into 8-bit characters and you get the following bytes:

    0xE6 0xB1 0x89 0xE5 0xAD 0x97 0x20 0xE4 0xB8 0xAD 0xE6 0x96 0x87

Look at the bits and you get the following pattern:

    11100110 10110001 10111001 11100101 10101101 10010111 00100000 11100100
10111000 10101101 11100110 10010110 10000111

That is UTF-8 encoded data!  It breaks down to the following, where the
hyphens denote the separation between UTF-8 markers and actual character
bits:

    1110-0110 10-110001 10-111001
    1110-0101 10-101101 10-010111
    0-0100000
    1110-0100 10-111000 10-101101
    1110-0110 10-010110 10-000111

Remove the UTF-8 markers, and you are left with the following bit pattern:

    0110110001111001
    0101101101010111
    00100000
    0100111000101101
    0110010110000111

Translate the bits into 16-bit characters you get the following:

    $6C79$5B57$0020$4E2D$6587

So the original Chinese string '$6C79$5B57$0020$4E2D$6587' encodes to
'%E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87' with UTF-8 applied.

Now, with all of that said, does your actual Chinese string match the
unicode characters above?  If not, then there is more going on that needs to
be looked at.  What does your original Chinese string actually look like?

> So I come here for help to find out if there's any delphi equivalent.

You have already been told what you use - UTF8Encode(), for example:

     var
        str: WideString;
        htp: AnsiString;
    begin
        str := $6C79$5B57$0020$4E2D$6587;
        http := HTTPEncode(UTF8Encode(str));
        // http is now set to '%E6%B1%89%E5%AD%97+%E4%B8%AD%E6%96%87'
    end;

If it is not working for you, then please show your actual code.


Gambit

Replies:

In response to:

www.cryer.info
Managed Newsgroup Archive