Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 May : Re: parse HTML result

www.cryer.info
Managed Newsgroup Archive

Re: parse HTML result

Subject:Re: parse HTML result
Posted by:"Ralf Junker - http://www.yunqa.de/delphi/" (delphi.at.yunqa.dot...@)
Date:Sat, 27 May 2006 18:31:18

Hello Bob,

you can easily build your own custom HTML spider with the help of some Delphi
components:

* A HTTP protocol component as in Indy, ICS, or Synapse.
* A HTML parser, such as DIHtmlParser.

After you have downloaded the HTML document, you use DIHtmlParser
(http://www.yunqa.de/delphi/htmlparser/) to extract your custom contents from
the page. DIHtmlParser returns individual HTML tokens and fully supports Unicode
and up to 130 different character encodings.

Extracting contents is simple if the information is in standard format. There
are various plugins extending DIHtmlParser to make your life easier. Have a look
at TDIHtmlGoogleRegader (http://www.yunqa.de/delphi/googlereader/), which
extracts search results from Google pages, as an advanced example. If you like
to outsource the parsing job, please contact me via private e-mail.

Regards,

Ralf

"Bob Bedford" <bob@bedford.com> wrote:

>We've a commercial website where we have many sport articles from various
>real shops.
>
>The main purpose of the site is to provide to our clients a way to sell
>their goods without the need to have a complex website like the one we are
>building, and also use the improvements we will do in the future. The site
>is written in PHP.
>
>Some of our clients have already their website (quite basic) hopefully sold
>by the same company with the same structure: a table wich contains 10
>articles per page (with many pages if more than 10 articles) and each line
>with a link to the article description. Those clients have asked us to get
>content from their website to avoid them to enter the values 2 times
>manually. So we must create a "robot" that goes on a web page (the main page
>of the articles) get all articles from the page (get content from the detail
>page linked by the article page) and then read on the values we must use to
>fill their article in our database.
>
>I've already worked with Indy and TBrowser (wich doesn't always work fine)
>but the results aren't very good. I mean sometimes the pages isn't even
>loaded and there is no error code.
>
>What I want is a way to get the pages from any URL, giving the name of the
>links I want and those I don't want (if such word in URL then read otherwise
>don't). I've found WinHTTrack wich is free and opensource (but I don't know
>much on C++). I'd like use such tool but can't read images for example (the
>site encoded the url that there is no .jpg file, it's a asp page that
>generate the image) or customize the robot. Having our own solution will
>give us full control.
>
>It's there any source code I can get to start building such personalized
>program ?. It's quite a website copier where I can customize what I want and
>what I don't. I must have source code for it in order to customize the way
>it work. I only want a start point, not necessarely a full feature working
>code. I'll add my own features.

---
The Delphi Inspiration
http://www.yunqa.de/delphi/

Replies:

none

In response to:

www.cryer.info
Managed Newsgroup Archive