Scraping existing content

Status
Not open for further replies.

agua

New member
Jun 24, 2006
429
2
0
OK - keep in mind that I don't have good php skills :)

I have around 100 pages which all have a list of urls's on them (around 20) - which I want to scrape... is this possible, if so - how?
 


If you have ruby installed install the hpricot gem and try the following.

Code:
require 'rubygems'
require 'hpricot'

doc = open('yourfile.html') { |f| Hpricot(f) }
links = doc.search("//a[@class='LinkClass']")
links.each {|link| print "#{link.attributes['href']}\n" }
if your links don't have a specific class or you want all links change the doc.search line to the following

Code:
links = doc.search("//a")
or for php

Code:
<?php

$file = file("00004.html");
$page = "";
for($i=0;$i<count($file);$i++)
{
        $page .= $file[$i];
}
$urlpattern = "/<a[^>]+href=\"([^\"]+)/i";
   preg_match_all($urlpattern,$page,$matches);
print_r($matches[1]);
?>
I left out the looping code to you as far as going through each of the files but if you need a hand with that too let me know.
 
Status
Not open for further replies.