GÇó <a title="[COLOR="Red"]cnn.com[/COLOR], news story 1" class="rightmenu" href="/video/streams/news1">[COLOR="yellow"]Earthquake in Toronto[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]apple.com[/COLOR], news story 2 class="rightmenu" href="/video/streams/technews">[COLOR="yellow"]New MacBook Pro Released[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]wickedfire.com[/COLOR], news story 3" class="rightmenu" href="/video/streams/gaywebmaster">[COLOR="Yellow"]WickedFire.com[/COLOR]</a><br>
How can I zoom in and extract specific info I've scraped?
i.e. if my scraped content looks like this:
Code:GÇó <a title="[COLOR="Red"]cnn.com[/COLOR], news story 1" class="rightmenu" href="/video/streams/news1">[COLOR="yellow"]Earthquake in Toronto[/COLOR]</a><br> GÇó <a title="[COLOR="red"]apple.com[/COLOR], news story 2 class="rightmenu" href="/video/streams/technews">[COLOR="yellow"]New MacBook Pro Released[/COLOR]</a><br> GÇó <a title="[COLOR="red"]wickedfire.com[/COLOR], news story 3" class="rightmenu" href="/video/streams/gaywebmaster">[COLOR="Yellow"]WickedFire.com[/COLOR]</a><br>
What's the best way to extract the URL and anchor text from each line?
C# + HTMLUnit / Selenium is amazing. And there's one word for hosting that comes to mind. Windows Azure
use xpath, your expression will be something like .//a[@class = "rightmenu"] and .//a[@class = "rightmenu"]/@title
You may want to add /text() to the first xpath, depends on what code you are using.
<?php
$str = file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');
$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $str);
$tidy = new tidy();
$tidy->parseFile($str);
$tidy->cleanRepair();
if(!empty($tidy->errorBuffer)) {
echo "The following errors or warnings occured:\n";
echo $tidy->errorBuffer;
}
else {
$str = $tidy;
}
$counter = 0;
//create an image for each image source we have
$mybody = $doc->getElementsByTagName('body');
foreach ($mybody as $bod){
$imgpara = $bod->getElementsByTagName('p');
foreach($imgpara as $imgp){
$tags2 = $imgp->getElementsByTagName('img');
foreach ($tags2 as $tagged) {
echo '<img src=' . $tagged->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
//echo "<br/>" . $tag->getAttribute('href') ;
}
}
}
//Scrape content in <p> tags
$paragraphs = $doc->getElementsByTagName('p');
foreach ($paragraphs as $pgraph) {
$info = $pgraph->nodeValue;
$counter++;
echo '<style = "text/css">div{margin:15px;} img{margin-right:15px;}</style><div id=' . $counter . '>' . $info . '</div>';
}
?>
"That man — just like you or I — has a right to say,
Ok so I have a few questions about hosting bots.
1. Would you host the bot on just a regular hosting account like hostgator?
2. Is it more effective to run the bot on your own computer and have the data uploaded to your site?
3. If you are creating database sites to drive traffic to money sites do you want them hosted separately on different accounts?
4. Also I'm moving toward the python way of doing this I saw some libraries on here just wondered if there were any more, I'm currently starting to read "Learn Python the Hard Way".
Thanks
hmm, someone recommended disabling javascript, but while that works on some more basic forms, others require it to function properly, and Free Webmail and Email by mail.com | Sign Up Now! actually hides the captcha if you don't have javascript on. I'm stuck.Just jumped into Ruby in the past couple of weeks.
As part of a bigger project, I'm currently trying to automate signing up for an email account (mail.com, to be specific) using Watir (will be switching it to celerity when it's all finished).
One thing I'm stuck on though, is the captchas. I can't work out how to firstly find the ID of the captcha (firefox web developer toolbar form details seems to pick it up, but it doesn't show in view source), and possibly how to then submit that to the deathbycaptcha API using the ruby gem (although that bit may be easy, I haven't gotten onto it yet)
1. You'll want something with SSH access, grab a VPS. (Linode are good).
2. Bots will be much faster running on a server, and much more reliable. Your home connection will be nothing compared to the speed of a servers.
3. I don't have an answer to this, I guess the more IP's the better?
4. For writing bots, check our the requests library, it's pretty fucking awesome. Also check out magicrequests to make things a little easier, but magicrequests has no documentation as I'm pretty sure it's only me and mattseh that use it.
Just jumped into Ruby in the past couple of weeks.
As part of a bigger project, I'm currently trying to automate signing up for an email account (mail.com, to be specific) using Watir (will be switching it to celerity when it's all finished).
One thing I'm stuck on though, is the captchas. I can't work out how to firstly find the ID of the captcha (firefox web developer toolbar form details seems to pick it up, but it doesn't show in view source), and possibly how to then submit that to the deathbycaptcha API using the ruby gem (although that bit may be easy, I haven't gotten onto it yet)
http://www.google.com/recaptcha/api/challenge?k=6LdKsrwSAAAAAHjmh-jQNZ7zskPDs1gsY-WNXAKK&ajax=1&cachestop=0.8372623089235276&lang=en
var RecaptchaState = {
site : '6LdKsrwSAAAAAHjmh-jQNZ7zskPDs1gsY-WNXAKK',
rtl : false,
challenge : '03AHJ_Vusbr47xGy8OoFiOL0EckiCNl4qbWo3T71G8vPh8_dPMSlu9FNiKCBcJtIcXNPwP6BDfoYZG1oSOgbO8iyHnCS_xOhl6SPhfP1gZXlluzFNLZGMRAniUgJ-FiKmX5vI5Ta_RaEewJkByCwk7QNh6rTSQ5ZrRR92KIAUvnhxLWirnVr0VjAs',
is_incorrect : false,
programming_error : '',
error_message : '',
server : 'http://www.google.com/recaptcha/api/',
lang : 'en',
timeout : 1800
};
Recaptcha.challenge_callback();
http://www.google.com/recaptcha/api/image?c=03AHJ_Vusbr47xGy8OoFiOL0EckiCNl4qbWo3T71G8vPh8_dPMSlu9FNiKCBcJtIcXNPwP6BDfoYZG1oSOgbO8iyHnCS_xOhl6SPhfP1gZXlluzFNLZGMRAniUgJ-FiKmX5vI5Ta_RaEewJkByCwk7QNh6rTSQ5ZrRR92KIAUvnhxL
<?php // Start the buffering // ob_start(); ?> <?php //Here are some test URLs from Google Trends //$str = file_get_contents('http://www.nydailynews.com/sports/olympics-2012/japan-kohei-uchimura-wins-olympic-gold-all-around-competition-american-danell-leyva-earns-bronze-article-1.1126690'); //$str = file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html'); //$str = file_get_contents('http://www.bbc.co.uk/sport/0/olympics/18905658'); $str = file_get_contents('http://www.eonline.com/news/334647/former-olympic-gymnast-nastia-liukin-scores-beauty-of-an-endorsement-named-face-of-tigi'); $desc = $_GET["description"]; $tagging = $_GET["tags"]; I disabled the form as it's not completed /* $form= '<form action="scrape.php" method="get"> Description: <input type="text" name="description" /> tags: <input type="text" name="tags" /> <input type="submit" /> </form>'; */
//navigation $links = '<div id="nav"><a href="">Home</a><a href="" style="margin-left:25px;">Popular News</a><a href="" style="margin-left:25px;">Next Article</a></div>'; /*These are manual description/tags, comment out if you use the form above.
These are the "headings" or whatever from articles to create "fair use" context like you're blogging the scraped data
*/ //$description = 'Well it looks like chik-fil-a is in the news regarding its CEOs stance on gay rights in America. Gay rights supporters are planning //a National Same-Sex Kiss Day at Chick-fil-A'; $description = 'The olympics are raging in full force. Former U.S. gymnast has just signed on to become TIGIs newest spokesperson, promoting hair products.';
//Tags generated for the web page, should make it automated for common terms in the article //$tagging = 'chick fil a, chick-fila, chick-fil-a, chickfila, equality, gay men, men kissing'; $tagging = 'olympics, publicity, TIGI, '; /* Here we grab the webpage source as xml */ $doc = new DOMDocument(); @$doc->loadHTML('<?xml encoding="UTF-8">' . $str); /*Clean it */ $tidy = new tidy(); $tidy->parseFile($str); $tidy->cleanRepair(); if(!empty($tidy->errorBuffer)) { echo "The following errors or warnings occured:\n"; echo $tidy->errorBuffer; } else { $str = $tidy; } /* Here we start outputting the HTML for the page we WANT TO MAKE*/ echo '<html><head>'.$nav.'</head><body><div id="headTop" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;">'.$links.'</div><p style="margin-left:75px;">'.$description.'</p>'; echo '<div id="wrapper" style="background:silver;width:90%;min-height:40%;margin-left:5%;">'; $counter = 0; //create an image for each image source we have $mybody = $doc->getElementsByTagName('body'); foreach ($mybody as $bod){ $imgpara = $bod->getElementsByTagName('p'); //$divss = $bod->getElementsByTagName('div')->item(0);
//Get the first div within the body tag
$divss = $bod->getElementsByTagName('div')->item(0); //$imgnotpara = $divss->getElementsByTagName('img'); //foreach($imgnotpara as $imgn ){ //echo '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />'; //}
//scrape images foreach($imgpara as $imgp){ $tags2 = $imgp->getElementsByTagName('img'); foreach ($tags2 as $tagged) { echo '<img src=' . $tagged->getAttribute('src') . ' style="float:left;clear:right;"/><br />'; //echo "<br/>" . $tag->getAttribute('href') ; } } } $imgCount = 0; //if(preg_match('[img]',$sql)){ $imgnotpara = $doc->getElementsByTagName('img'); foreach($imgnotpara as $imgn ){ $imgCount++; if($imgCount >2 && $imgCount < 4){ echo '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />'; } } //Scrape content in <p> tags $paragraphs = $doc->getElementsByTagName('p'); foreach ($paragraphs as $pgraph) { //For each p tag get it's content $info = $pgraph->nodeValue; $counter++; // echo $pgraph->nodeValue, PHP_EOL; if($counter > 1 && $counter < 20){ echo '<style = "text/css">div{margin:15px;} img{margin-right:15px;}</style><div id=' . $counter . '>' . $info . '</div>'; } } echo '<p style="margin-left:15px;">tags:'.$tagging.'</p></div><div id="foot" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;"><div id="otherNews" style="clear:right;width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews2" style="width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews3" style="width:30%;min-height:250px;border:solid 1px gray;"></div></div>'; echo '</body></html>'; //Search the doc for image tags, loop through and echo the source of each img $bodd = $doc->getElementsByTagName('body')->item(0); $tags = $bodd->getElementsByTagName('img'); //$tags = $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo '<img src=' . $tag->getAttribute('src') . '/>'; echo $tag->getAttribute('src') . "<br/>" ; } /* This checks the folder that our scrape.php is in, looks for any html files we've generated * if we've run this script at least once, there it should populate a link to it in the bottom * left div tag on our new page */ //path to directory to scan $directory = ""; //get all image files with a .html extension. $images = glob($directory . "*.html"); $otherNews1 = $doc->getElementById('otherNews'); //print each file name echo '<script type="text/javascript">var ON3 = document.getElementById("otherNews3").innerHTML+="<h4>Other news:</h4><br />";</script>'; foreach($images as $image) { echo '<script type="text/javascript">var txt = document.getElementById("otherNews3").innerHTML+="<a href=' . $image . '>'.$image.'</a><br />";</script><style type="text/css"> #otherNews3 a {margin:30px; } </style>'; //$otherNews1.innerHTML+='<a href=' . $image . '>'.$image.'</a><br />'; } ?> <?php /* This saves the generated page above into an html file */ // Get the content that is in the buffer and put it in your file // file_put_contents('NEWWEBPAGE'.$counter.'.html', ob_get_contents()); ?>