The Botting Q&A Thread: Ask Away



How can I zoom in and extract specific info I've scraped?

i.e. if my scraped content looks like this:
Code:
GÇó <a title="[COLOR="Red"]cnn.com[/COLOR], news story 1" class="rightmenu" href="/video/streams/news1">[COLOR="yellow"]Earthquake in Toronto[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]apple.com[/COLOR], news story 2 class="rightmenu" href="/video/streams/technews">[COLOR="yellow"]New MacBook Pro Released[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]wickedfire.com[/COLOR], news story 3" class="rightmenu" href="/video/streams/gaywebmaster">[COLOR="Yellow"]WickedFire.com[/COLOR]</a><br>

What's the best way to extract the URL and anchor text from each line?
 
How can I zoom in and extract specific info I've scraped?

i.e. if my scraped content looks like this:
Code:
GÇó <a title="[COLOR="Red"]cnn.com[/COLOR], news story 1" class="rightmenu" href="/video/streams/news1">[COLOR="yellow"]Earthquake in Toronto[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]apple.com[/COLOR], news story 2 class="rightmenu" href="/video/streams/technews">[COLOR="yellow"]New MacBook Pro Released[/COLOR]</a><br>
GÇó <a title="[COLOR="red"]wickedfire.com[/COLOR], news story 3" class="rightmenu" href="/video/streams/gaywebmaster">[COLOR="Yellow"]WickedFire.com[/COLOR]</a><br>

What's the best way to extract the URL and anchor text from each line?

use xpath, your expression will be something like .//a[@class = "rightmenu"] and .//a[@class = "rightmenu"]/@title
 
C# + HTMLUnit / Selenium is amazing. And there's one word for hosting that comes to mind. Windows Azure :)

I assume when you say HTMLUnit, you mean IKVM?

How does it compare to Selenium? I found Selenium works well, but can get a bit bloaty and slow if I close and open the browser to many times.
 
Anyone use autohotkey?

I've used it in the past for botting in video games, but never for the web. But it works with xpath, mysql, etc, can use regex, and is super simple to read/write from text file, parse html, etc.

I just got a decent understanding of php and mysql and want to figure out how to make money aggregating shit. Like uhh are you guys just throwing up tons of subdomain clones of pages you scrape? Or outputting feeds and aggregated "news" type sites? Or just grabbing information for people to query against that isn't otherwise tracked? I guess the "news" type aggregated sites of relevant information for various niches is what I'm going to be interested in.
 
Follow up after messing around all night:

So using php I can grab a webpage as a string, and scrape out info based on tags.

Now I'm trying to figure out how to monetize it. I assume you spin the content into your own pages under a subdomain. I was considering linking up with a thesaurus and changing 5-9 words per paragraph or page.

So now I'm thinking you scrape popular sites, like the front page of twitter or any listing of popular topics, grab the URL's and maybe anchor text from the posts, go to those pages, grab the content, spin it, post it, submit to search engines.

Pieces I have:

  • parse html into string
  • search string with combo of regex, preg_match, htmlDOMparse library
  • insert into mysql(still designing what to store and how but it echos out nice lists of urls, img sources, etc)
Couple wonders I have:

  1. ideas for dealing with local urls on sites? I guess more string logic
  2. submitting the new pages to search engines automatically????
^ 2) I believe you can adjust how often google crawls your site in webmaster tools, so maybe you don't have to programatically do anything if you set google to crawl that domain daily?


So I guess now I need to figure out what database columns I need to populate a template I make for displaying the content.



Do you do things in multiple phases?? Like 1 bot just scrapes new content, another populates the new pages, then another bot comes through and tags it or tweaks keywords, etc????


I know you can compare a string against a dictionary, maybe I need to make a dictionary of arrays/hashtable of "google-keywords-tool" generated keywords that go together. Then I could just append that entire array to content those keywords are relevant to.
 
Ok some specific questions after scraping a couple sites.

If I grab all the image tags, I get all the banners and shit. I can specify only images within paragraph tags, which works on some sites but not others where their images are just in divs.

For the content I'm simply grabbing everything within <p> tags, which works great but there are some things at the beginning and end of the strings I want to get rid of, particularly links to other parts of the site I'm scraping.

PHP:
<?php
$str = file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');

$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $str);

$tidy = new tidy();
$tidy->parseFile($str);
$tidy->cleanRepair();

if(!empty($tidy->errorBuffer)) {
    echo "The following errors or warnings occured:\n";
    echo $tidy->errorBuffer;
}
else {
$str = $tidy;
}

$counter = 0;
//create an image for each image source we have
$mybody = $doc->getElementsByTagName('body'); 
foreach ($mybody as $bod){
$imgpara = $bod->getElementsByTagName('p');

foreach($imgpara as $imgp){
    $tags2 = $imgp->getElementsByTagName('img');
        foreach ($tags2 as $tagged) {
        echo  '<img src=' . $tagged->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
        //echo  "<br/>" . $tag->getAttribute('href') ; 
        }    
    }    
}

//Scrape content in <p> tags
$paragraphs = $doc->getElementsByTagName('p');
foreach ($paragraphs as $pgraph) {
$info = $pgraph->nodeValue;
$counter++;
   echo '<style = "text/css">div{margin:15px;} img{margin-right:15px;}</style><div id=' . $counter . '>' . $info . '</div>';
}
?>

But the first paragraph it finds has crap I don't want, and the 34th+ paragraphs I do not want either. Only paragraph tags #2-#34 on that particular site.

But if I hardcode for that site, others will be different. Do you randomly sample sections of content you scrape? Or save the html, then further process it to snip out the sections you don't want based on a criteria????

Even passing it through the tidy html library I get some characters like
"That man — just like you or I — has a right to say,
 
Look into the Readability library for grabbing out a pages content. It started off as a JS lib, and has now been ported to every major language.

Regrading using separate workers / scripts to do the different parts, it all depends on the scale you're wanted to do this. Background jobs generally aren't written in PHP, although there are libraries out there that could do this I guess. If you aren't going large scale, just keeping it all in one script isn't really a problem. Also, rather than using file_get_contents(), look into using cURL and specifically the multi interface to it. It means you can make multiple requests simultaneously rather than sequentially. But again, it all depends on how much scale you're planning on, it probably isn't worth the effort for just the odd website.

Regarding the weird shit at the bottom in the text, look at your encoding. I don't have much experience with encoding in PHP, however I'm sure there will be a #php channel on freenode, who likely know this shit like the back of their hand.
 
Ok so I have a few questions about hosting bots.

1. Would you host the bot on just a regular hosting account like hostgator?

2. Is it more effective to run the bot on your own computer and have the data uploaded to your site?

3. If you are creating database sites to drive traffic to money sites do you want them hosted separately on different accounts?

4. Also I'm moving toward the python way of doing this I saw some libraries on here just wondered if there were any more, I'm currently starting to read "Learn Python the Hard Way".

Thanks
 
Ok so I have a few questions about hosting bots.

1. Would you host the bot on just a regular hosting account like hostgator?

2. Is it more effective to run the bot on your own computer and have the data uploaded to your site?

3. If you are creating database sites to drive traffic to money sites do you want them hosted separately on different accounts?

4. Also I'm moving toward the python way of doing this I saw some libraries on here just wondered if there were any more, I'm currently starting to read "Learn Python the Hard Way".

Thanks

1. You'll want something with SSH access, grab a VPS. (Linode are good).

2. Bots will be much faster running on a server, and much more reliable. Your home connection will be nothing compared to the speed of a servers.

3. I don't have an answer to this, I guess the more IP's the better?

4. For writing bots, check our the requests library, it's pretty fucking awesome. Also check out magicrequests to make things a little easier, but magicrequests has no documentation as I'm pretty sure it's only me and mattseh that use it.
 
Man I love this thread so much. So I've sort of been thinking about this for awhile, and had to ask:

For the guys who run their own services, and also obviously know how to code, what percentage of your services did you code yourself or did you outsource the entire thing and just handle the general look/feel?
 
Just jumped into Ruby in the past couple of weeks.

As part of a bigger project, I'm currently trying to automate signing up for an email account (mail.com, to be specific) using Watir (will be switching it to celerity when it's all finished).

One thing I'm stuck on though, is the captchas. I can't work out how to firstly find the ID of the captcha (firefox web developer toolbar form details seems to pick it up, but it doesn't show in view source), and possibly how to then submit that to the deathbycaptcha API using the ruby gem (although that bit may be easy, I haven't gotten onto it yet)
 
Just jumped into Ruby in the past couple of weeks.

As part of a bigger project, I'm currently trying to automate signing up for an email account (mail.com, to be specific) using Watir (will be switching it to celerity when it's all finished).

One thing I'm stuck on though, is the captchas. I can't work out how to firstly find the ID of the captcha (firefox web developer toolbar form details seems to pick it up, but it doesn't show in view source), and possibly how to then submit that to the deathbycaptcha API using the ruby gem (although that bit may be easy, I haven't gotten onto it yet)
hmm, someone recommended disabling javascript, but while that works on some more basic forms, others require it to function properly, and Free Webmail and Email by mail.com | Sign Up Now! actually hides the captcha if you don't have javascript on. I'm stuck. :(
 
I tried using Linode to scrape ezinearticles. It returns me captcha LOL. I guess it's not only me who scrape it from linode. So if you're using linode, you need to purchase proxies.

1. You'll want something with SSH access, grab a VPS. (Linode are good).

2. Bots will be much faster running on a server, and much more reliable. Your home connection will be nothing compared to the speed of a servers.

3. I don't have an answer to this, I guess the more IP's the better?

4. For writing bots, check our the requests library, it's pretty fucking awesome. Also check out magicrequests to make things a little easier, but magicrequests has no documentation as I'm pretty sure it's only me and mattseh that use it.
 
Just jumped into Ruby in the past couple of weeks.

As part of a bigger project, I'm currently trying to automate signing up for an email account (mail.com, to be specific) using Watir (will be switching it to celerity when it's all finished).

One thing I'm stuck on though, is the captchas. I can't work out how to firstly find the ID of the captcha (firefox web developer toolbar form details seems to pick it up, but it doesn't show in view source), and possibly how to then submit that to the deathbycaptcha API using the ruby gem (although that bit may be easy, I haven't gotten onto it yet)

mail.com uses recaptcha, it's not hard to extract the image, you just have to look in the right place.

There is a header call out to:
Code:
http://www.google.com/recaptcha/api/challenge?k=6LdKsrwSAAAAAHjmh-jQNZ7zskPDs1gsY-WNXAKK&ajax=1&cachestop=0.8372623089235276&lang=en

6LdKsrwSAAAAAHjmh-jQNZ7zskPDs1gsY-WNXAKK is the sites ID, all recaptcha sites have their own site ID.

If you hit that url it'll return something like this (example):

Code:
var RecaptchaState = {
    site : '6LdKsrwSAAAAAHjmh-jQNZ7zskPDs1gsY-WNXAKK',
    rtl : false,
    challenge : '03AHJ_Vusbr47xGy8OoFiOL0EckiCNl4qbWo3T71G8vPh8_dPMSlu9FNiKCBcJtIcXNPwP6BDfoYZG1oSOgbO8iyHnCS_xOhl6SPhfP1gZXlluzFNLZGMRAniUgJ-FiKmX5vI5Ta_RaEewJkByCwk7QNh6rTSQ5ZrRR92KIAUvnhxLWirnVr0VjAs',
    is_incorrect : false,
    programming_error : '',
    error_message : '',
    server : 'http://www.google.com/recaptcha/api/',
    lang : 'en',
    timeout : 1800
};

Recaptcha.challenge_callback();

The 'challenge' field is what is used to get the image as you'll notice from a call out here:

Code:
http://www.google.com/recaptcha/api/image?c=03AHJ_Vusbr47xGy8OoFiOL0EckiCNl4qbWo3T71G8vPh8_dPMSlu9FNiKCBcJtIcXNPwP6BDfoYZG1oSOgbO8iyHnCS_xOhl6SPhfP1gZXlluzFNLZGMRAniUgJ-FiKmX5vI5Ta_RaEewJkByCwk7QNh6rTSQ5ZrRR92KIAUvnhxL

This returns the image in question.

Note: It use to be if you hit the image url more than once it changed, don't know if it still is, but for the sake of argument do it all in one swipe.

In order to submit the captcha successfully you also must be keeping track of your cookies that you used when requesting the image.
 
Damn I haven't even played with php for months. Here's what I'd ended up making for scraping a website and making a new webpage with the info. Some sites it would scrape ads and shit so you have to adjust where it begins and ends scraping.

EDIT:: OFMG THE FUCKING SPACING IS LOST WHEN PUBLISHING to thread WTF

HTML:
<?php // Start the buffering // ob_start(); ?> <?php   //Here  are some test URLs from Google Trends //$str =  file_get_contents('http://www.nydailynews.com/sports/olympics-2012/japan-kohei-uchimura-wins-olympic-gold-all-around-competition-american-danell-leyva-earns-bronze-article-1.1126690');  //$str =  file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');   //$str =  file_get_contents('http://www.bbc.co.uk/sport/0/olympics/18905658');  $str =  file_get_contents('http://www.eonline.com/news/334647/former-olympic-gymnast-nastia-liukin-scores-beauty-of-an-endorsement-named-face-of-tigi');      $desc = $_GET["description"];  $tagging = $_GET["tags"];   I  disabled the form as it's not completed /* $form= '<form  action="scrape.php" method="get"> Description: <input type="text"  name="description" /> tags: <input type="text" name="tags" />  <input type="submit" /> </form>'; */

 //navigation $links = '<div id="nav"><a  href="">Home</a><a href=""  style="margin-left:25px;">Popular News</a><a href=""  style="margin-left:25px;">Next Article</a></div>';   /*These are manual description/tags, comment out if you use the form  above. 
These are the "headings" or whatever from articles to create "fair use" context like you're blogging the scraped data
*/ //$description = 'Well it looks like chik-fil-a is in the news  regarding its CEOs stance on gay rights in America. Gay rights  supporters are planning //a National Same-Sex Kiss Day at Chick-fil-A';  $description = 'The olympics are raging in full force. Former U.S.  gymnast has just signed on to become TIGIs newest spokesperson,  promoting hair products.'; 
//Tags generated for the web page, should make it automated for common  terms in the article //$tagging = 'chick fil a, chick-fila, chick-fil-a,  chickfila, equality, gay men, men kissing'; $tagging = 'olympics,  publicity, TIGI, ';  /* Here we grab the webpage source as xml */ $doc =  new DOMDocument(); @$doc->loadHTML('<?xml encoding="UTF-8">' .  $str);  /*Clean it */ $tidy = new tidy(); $tidy->parseFile($str);  $tidy->cleanRepair();  if(!empty($tidy->errorBuffer)) {     echo  "The following errors or warnings occured:\n";     echo  $tidy->errorBuffer; } else { $str = $tidy; }  /* Here we start  outputting the HTML for the page we WANT TO MAKE*/ echo   '<html><head>'.$nav.'</head><body><div  id="headTop" style="border:solid 1px  silver;width:90%;min-height:10%;margin-left:5%;">'.$links.'</div><p  style="margin-left:75px;">'.$description.'</p>'; echo   '<div id="wrapper"  style="background:silver;width:90%;min-height:40%;margin-left:5%;">';   $counter = 0;  //create an image for each image source we have $mybody  = $doc->getElementsByTagName('body');  foreach ($mybody as $bod){  $imgpara = $bod->getElementsByTagName('p'); //$divss =  $bod->getElementsByTagName('div')->item(0);
//Get the first div within the body tag
$divss = $bod->getElementsByTagName('div')->item(0); //$imgnotpara  = $divss->getElementsByTagName('img'); //foreach($imgnotpara as  $imgn ){ //echo  '<img src=' . $imgn->getAttribute('src') . '  style="float:left;clear:right;"/><br />'; //}

 //scrape images foreach($imgpara as $imgp){     $tags2 =  $imgp->getElementsByTagName('img');         foreach ($tags2 as  $tagged) {         echo  '<img src=' .  $tagged->getAttribute('src') . '  style="float:left;clear:right;"/><br />';         //echo   "<br/>" . $tag->getAttribute('href') ;         }          }       }  $imgCount = 0; //if(preg_match('[img]',$sql)){ $imgnotpara =  $doc->getElementsByTagName('img'); foreach($imgnotpara as $imgn ){  $imgCount++; if($imgCount >2 && $imgCount < 4){ echo   '<img src=' . $imgn->getAttribute('src') . '  style="float:left;clear:right;"/><br />'; } }  //Scrape content  in <p> tags $paragraphs = $doc->getElementsByTagName('p');  foreach ($paragraphs as $pgraph) { //For each p tag get it's content  $info = $pgraph->nodeValue; $counter++;    // echo  $pgraph->nodeValue, PHP_EOL;   if($counter > 1 && $counter  < 20){    echo '<style = "text/css">div{margin:15px;}  img{margin-right:15px;}</style><div id=' . $counter . '>' .  $info . '</div>';     } } echo  '<p  style="margin-left:15px;">tags:'.$tagging.'</p></div><div  id="foot" style="border:solid 1px  silver;width:90%;min-height:10%;margin-left:5%;"><div  id="otherNews"  style="clear:right;width:30%;min-height:250px;border:solid 1px  gray;float:right;"></div><div id="otherNews2"  style="width:30%;min-height:250px;border:solid 1px  gray;float:right;"></div><div id="otherNews3"  style="width:30%;min-height:250px;border:solid 1px  gray;"></div></div>'; echo '</body></html>';   //Search the doc for image tags, loop through and echo the source of  each img $bodd = $doc->getElementsByTagName('body')->item(0);  $tags = $bodd->getElementsByTagName('img'); //$tags =  $doc->getElementsByTagName('img'); foreach ($tags as $tag) { echo   '<img src=' . $tag->getAttribute('src') . '/>'; echo   $tag->getAttribute('src') . "<br/>" ; }  /* This checks the  folder that our scrape.php is in, looks for any html files we've  generated * if we've run this script at least once, there it should  populate a link to it in the bottom  * left div tag on our new page */  //path to directory to scan $directory = "";   //get all image files  with a .html extension. $images = glob($directory . "*.html");  $otherNews1 = $doc->getElementById('otherNews'); //print each file  name echo '<script type="text/javascript">var ON3 =  document.getElementById("otherNews3").innerHTML+="<h4>Other  news:</h4><br />";</script>'; foreach($images as  $image) {  echo '<script type="text/javascript">var txt =  document.getElementById("otherNews3").innerHTML+="<a href=' . $image .  '>'.$image.'</a><br />";</script><style  type="text/css"> #otherNews3 a {margin:30px;  } </style>';  //$otherNews1.innerHTML+='<a href=' . $image .  '>'.$image.'</a><br />'; }     ?> <?php /* This  saves the generated page above into an html file */ // Get the content  that is in the buffer and put it in your file //  file_put_contents('NEWWEBPAGE'.$counter.'.html', ob_get_contents());  ?>