Help with scraping links.

parody

New member
Jun 6, 2007
39
0
0
Hey guys,

Im relatively new to all this n Im scraping links on a paginated website with some 5600 links.

The page only displays 10 links at a time so there is some 550 pages displayed exactly like this [1] [2] [3] .... [550],

The easy part each page just has an additional number on the url. page1.html, page2.html etc etc...

Whats the easiest way to mass edit the links in a text file...

a php script with a +1 to each link ? A text editor ? lol

I guess it seems pretty newbish.. but if anyone has a hint.. id tea bag you for life :D :conehead:
 


Ok so this is going to be pseudo code(fetchpage doesn't exist). But I'm sure you can figure it out.
Code:
$baseUrl="http://www.website.com/";
$data="";
for($i=1; $i<1000 AND (stristr($data, "page".($i+1).".html")!==FALSE OR $i==1); $i++)
{
$data=fetchURL($baseURL."page".$i.".html");
//magic happens
}

1000 is just a break point. But pretty much it goes through, checks that the last page had a link to the next page(to make sure it exists), then does whatever. The or $i==0 is because $data won't have any value you the first time it hits the loop.
 
This is what i came up with to scrape links from cnet reviews just wanted to see if i could scrape all the laptop links to see if i could do it / learn...

Needless to say it shits it self! :p


PHP:
<?php 
$con = mysql_connect("localhost","root","");
if (!$con)
  {
  die('Could not connect now eat it: ' . mysql_error());
  }

function storeLink($url,$gathered_from) {
	$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
	mysql_query($query) or die('Error, insert query failed');
}


$userAgent = 'Boobiebot/2.1 (http://www.boobiebot.com/bot.html)';

$target_url="http://reviews.cnet.com/";
$data="";
for($i=1; $i<5 AND (stristr($data, "4566-3121_7-0-".($i+1).".html")!==FALSE OR $i==1); $i++)
{
$data = $target_url."4566-3121_7-0-".$i.".html";
}


// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$data);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
	echo "<br />cURL error number:" .curl_errno($ch);
	echo "<br />cURL error:" . curl_error($ch);
	exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body/div/div/div[2]/div[3]/div/form/div/div[4]/a");

for ($i = 0; $i < $hrefs->length; $i++) {
	$href = $hrefs->item($i);
	$url = $href->getAttribute('href');
	storeLink($url,$data);
	echo "<br />Link stored: $url";
}


?>
 
Looks like you need to put your curl stuff inside the for loop.

In the part where
xmcp123 suggested looking in $data to make sure the next link actually exists, the stristr($data... part, your $data doesn't have the html you scraped yet. It just has the URL.
 
Looks like you need to put your curl stuff inside the for loop.

In the part where
xmcp123 suggested looking in $data to make sure the next link actually exists, the stristr($data... part, your $data doesn't have the html you scraped yet. It just has the URL.
That's why it says OR $i==1.
The first time through the loop will continue regardless of what's contained in data. Everytime afterward, data will have the HTML of the page before.
 
That's why it says OR $i==1.
The first time through the loop will continue regardless of what's contained in data. Everytime afterward, data will have the HTML of the page before.

The way you wrote it, yes. But his loop is just resetting the URL, not actually fetching the data.