Help with scraping links.

parody · Jul 10, 2009

Hey guys,

Im relatively new to all this n Im scraping links on a paginated website with some 5600 links.

The page only displays 10 links at a time so there is some 550 pages displayed exactly like this [1] [2] [3] .... [550],

The easy part each page just has an additional number on the url. page1.html, page2.html etc etc...

Whats the easiest way to mass edit the links in a text file...

a php script with a +1 to each link ? A text editor ? lol

I guess it seems pretty newbish.. but if anyone has a hint.. id tea bag you for life

:conehead:

xmcp123 · Jul 10, 2009

Ok so this is going to be pseudo code(fetchpage doesn't exist). But I'm sure you can figure it out.

Code:

$baseUrl="http://www.website.com/";
$data="";
for($i=1; $i<1000 AND (stristr($data, "page".($i+1).".html")!==FALSE OR $i==1); $i++)
{
$data=fetchURL($baseURL."page".$i.".html");
//magic happens
}

1000 is just a break point. But pretty much it goes through, checks that the last page had a link to the next page(to make sure it exists), then does whatever. The or $i==0 is because $data won't have any value you the first time it hits the loop.

Stanley · Jul 10, 2009

Don't forget to throttle it, at least for a second or two.

parody · Jul 10, 2009

This is what i came up with to scrape links from cnet reviews just wanted to see if i could scrape all the laptop links to see if i could do it / learn...

Needless to say it shits it self!

PHP:

<?php 
$con = mysql_connect("localhost","root","");
if (!$con)
  {
  die('Could not connect now eat it: ' . mysql_error());
  }

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}


$userAgent = 'Boobiebot/2.1 (http://www.boobiebot.com/bot.html)';

$target_url="http://reviews.cnet.com/";
$data="";
for($i=1; $i<5 AND (stristr($data, "4566-3121_7-0-".($i+1).".html")!==FALSE OR $i==1); $i++)
{
$data = $target_url."4566-3121_7-0-".$i.".html";
}


// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$data);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/div/div/div[2]/div[3]/div/form/div/div[4]/a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$data);
	echo "<br />Link stored: $url";
}


?>

coffeebean · Jul 10, 2009

Looks like you need to put your curl stuff inside the for loop.

In the part where xmcp123 suggested looking in $data to make sure the next link actually exists, the stristr($data... part, your $data doesn't have the html you scraped yet. It just has the URL.

xmcp123 · Jul 10, 2009

coffeebean said:
Looks like you need to put your curl stuff inside the for loop.

In the part where xmcp123 suggested looking in $data to make sure the next link actually exists, the stristr($data... part, your $data doesn't have the html you scraped yet. It just has the URL.

That's why it says OR $i==1.
The first time through the loop will continue regardless of what's contained in data. Everytime afterward, data will have the HTML of the page before.

coffeebean · Jul 10, 2009

xmcp123 said:
That's why it says OR $i==1.
The first time through the loop will continue regardless of what's contained in data. Everytime afterward, data will have the HTML of the page before.

The way you wrote it, yes. But his loop is just resetting the URL, not actually fetching the data.

coffeebean · Jul 10, 2009

'Boobiebot/2.1 (http://www.boobiebot.com/bot.html)'

Funny, but when scraping you usually want to hide your tracks. Use something less conspicuous to avoid getting banned.

Search

Search

Help with scraping links.

parody

New member

xmcp123

New member

Stanley

Banned

parody

New member

coffeebean

New member

xmcp123

New member

coffeebean

New member

coffeebean

New member