Author Topic: "Screen Scraper" question (Read 2276 times)

Tuoni · « **on:** April 22, 2008, 07:26:19 AM »

I need to retrieve all the TÜV certificate numbers, and the models for which those certificates apply for a certain company. the website I've found is http://tuvamerica.com/tools/clientlists/certs.cfm and I was wondering if the following thought process is a ~~logical~~ possible one:

My software connects to the website
It enters the company name in the search field (e.g. "ABC") and clicks search
For each search result it gives back, it spawns a thread which follows the link to details and enters each model number into a hashmap which it then returns and terminates the thread...
Parent thread amalgamates the hashmaps (no problems here, all model numbers are unique)

I know I can grab the page render programmatically, but is what I'm thinking of possible?

I could build the hashmap by hand, I know (even as an XML/excel file which can be edited later), but I would like a way to do it programmatically as once I release this software I want it, basically, to be, as much as is possible, a black box (though I have taken into account URL changes, website layout changes etc)

Am I barking up the wrong tree here? Attached are a few screen shots I've taken from the website (note this was just the first company which came up on the list, not the one I'm working for...)

Tuoni · « **Reply #1 on:** April 23, 2008, 02:12:24 AM »

I've come up with a solution which I think should work - I have managed to work out the query that the server sends to render the page, so I will go about it by chucking it a list of certificate numbers and grab the server's response.

News:

Author Topic: "Screen Scraper" question (Read 2276 times)

Tuoni

"Screen Scraper" question

Tuoni

Re: "Screen Scraper" question