Scraping existing content

agua · Feb 26, 2007

OK - keep in mind that I don't have good php skills

I have around 100 pages which all have a list of urls's on them (around 20) - which I want to scrape... is this possible, if so - how?

casidnet · Feb 26, 2007

If you have ruby installed install the hpricot gem and try the following.

Code:

require 'rubygems'
require 'hpricot'

doc = open('yourfile.html') { |f| Hpricot(f) }
links = doc.search("//a[@class='LinkClass']")
links.each {|link| print "#{link.attributes['href']}\n" }

if your links don't have a specific class or you want all links change the doc.search line to the following

Code:

links = doc.search("//a")

or for php

Code:

<?php

$file = file("00004.html");
$page = "";
for($i=0;$i<count($file);$i++)
{
        $page .= $file[$i];
}
$urlpattern = "/<a[^>]+href=\"([^\"]+)/i";
   preg_match_all($urlpattern,$page,$matches);
print_r($matches[1]);
?>

I left out the looping code to you as far as going through each of the files but if you need a hand with that too let me know.

Search

Search

Scraping existing content

agua

New member

casidnet

New member