agua New member Jun 24, 2006 429 2 0 Feb 26, 2007 #1 OK - keep in mind that I don't have good php skills I have around 100 pages which all have a list of urls's on them (around 20) - which I want to scrape... is this possible, if so - how?
OK - keep in mind that I don't have good php skills I have around 100 pages which all have a list of urls's on them (around 20) - which I want to scrape... is this possible, if so - how?
C casidnet New member Nov 23, 2006 164 6 0 Feb 26, 2007 #2 If you have ruby installed install the hpricot gem and try the following. Code: require 'rubygems' require 'hpricot' doc = open('yourfile.html') { |f| Hpricot(f) } links = doc.search("//a[@class='LinkClass']") links.each {|link| print "#{link.attributes['href']}\n" } if your links don't have a specific class or you want all links change the doc.search line to the following Code: links = doc.search("//a") or for php Code: <?php $file = file("00004.html"); $page = ""; for($i=0;$i<count($file);$i++) { $page .= $file[$i]; } $urlpattern = "/<a[^>]+href=\"([^\"]+)/i"; preg_match_all($urlpattern,$page,$matches); print_r($matches[1]); ?> I left out the looping code to you as far as going through each of the files but if you need a hand with that too let me know.
If you have ruby installed install the hpricot gem and try the following. Code: require 'rubygems' require 'hpricot' doc = open('yourfile.html') { |f| Hpricot(f) } links = doc.search("//a[@class='LinkClass']") links.each {|link| print "#{link.attributes['href']}\n" } if your links don't have a specific class or you want all links change the doc.search line to the following Code: links = doc.search("//a") or for php Code: <?php $file = file("00004.html"); $page = ""; for($i=0;$i<count($file);$i++) { $page .= $file[$i]; } $urlpattern = "/<a[^>]+href=\"([^\"]+)/i"; preg_match_all($urlpattern,$page,$matches); print_r($matches[1]); ?> I left out the looping code to you as far as going through each of the files but if you need a hand with that too let me know.