The Botting Q&A Thread: Ask Away

Fuck me.


It outputs something like this:
mtnj1s.jpg


Code:
<?php
// Start the buffering //
ob_start();
?>

<?php

//various test websites
//$str = file_get_contents('http://www.nydailynews.com/sports/olympics-2012/japan-kohei-uchimura-wins-olympic-gold-all-around-competition-american-danell-leyva-earns-bronze-article-1.1126690');
//$str = file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');

//$str = file_get_contents('http://www.bbc.co.uk/sport/0/olympics/18905658');
$str = file_get_contents('http://www.eonline.com/news/334647/former-olympic-gymnast-nastia-liukin-scores-beauty-of-an-endorsement-named-face-of-tigi');

//$str = file_get_contents('http://forums.darkfallonline.com/forumdisplay.php?f=13');

/*
$desc = $_GET["description"];
$tagging = $_GET["tags"];

$form= '<form action="scrape.php" method="get">
Description: <input type="text" name="description" />
tags: <input type="text" name="tags" />
<input type="submit" />
</form>';
echo $form;
echo $desc;
echo $tagging;
*/

//navigation
$links = '<div id="nav"><a href="">Home</a><a href="" style="margin-left:25px;">Popular News</a><a href="" style="margin-left:25px;">Next Article</a></div>';

//Make it look like you're blogging the story. Should spin recurring keywords instead, this just prototype
//$description = 'Well it looks like chik-fil-a is in the news regarding its CEOs stance on gay rights in America. Gay rights supporters are planning //a National Same-Sex Kiss Day at Chick-fil-A';
$description = 'The olympics are raging in full force. Former U.S. gymnast has just signed on to become TIGIs newest spokesperson, promoting hair products.';

//$tagging = 'chick fil a, chick-fila, chick-fil-a, chickfila, equality, gay men, men kissing';
$tagging = 'olympics, publicity, TIGI, ';

$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $str);

$tidy = new tidy();
$tidy->parseFile($str);
$tidy->cleanRepair();

if(!empty($tidy->errorBuffer)) {
    echo "The following errors or warnings occured:\n";
    echo $tidy->errorBuffer;
}
else {
$str = $tidy;
}


echo  '<html><head>'.$nav.'</head><body><div id="headTop" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;">'.$links.'</div><p style="margin-left:75px;">'.$description.'</p>';
echo  '<div id="wrapper" style="background:silver;width:90%;min-height:40%;margin-left:5%;">';

$counter = 0;
//create an image for each image source we have
$mybody = $doc->getElementsByTagName('body'); 

foreach ($mybody as $bod){

$imgpara = $bod->getElementsByTagName('p');
//$divss = $bod->getElementsByTagName('div')->item(0);
$divss = $bod->getElementsByTagName('div')->item(0);
//$imgnotpara = $divss->getElementsByTagName('img');
//foreach($imgnotpara as $imgn ){
//echo  '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
//}

foreach($imgpara as $imgp){
    $tags2 = $imgp->getElementsByTagName('img');
        foreach ($tags2 as $tagged) {
        echo  '<img src=' . $tagged->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
        //echo  "<br/>" . $tag->getAttribute('href') ; 
        }
    
    }
    
}

$imgCount = 0;
//if(preg_match('[img]',$sql)){
$imgnotpara = $doc->getElementsByTagName('img');
foreach($imgnotpara as $imgn ){
$imgCount++;
if($imgCount >2 && $imgCount < 4){
echo  '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
}
}

//Scrape content in <p> tags
$paragraphs = $doc->getElementsByTagName('p');

foreach ($paragraphs as $pgraph) {

$info = $pgraph->nodeValue;
$counter++;
   // echo $pgraph->nodeValue, PHP_EOL; 
  if($counter > 1 && $counter < 20){
   echo '<style = "text/css">div{margin:15px;} img{margin-right:15px;}</style><div id=' . $counter . '>' . $info . '</div>';
    }
}
echo  '<p style="margin-left:15px;">tags:'.$tagging.'</p></div><div id="foot" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;"><div id="otherNews" style="clear:right;width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews2" style="width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews3" style="width:30%;min-height:250px;border:solid 1px gray;"></div></div>';
echo '</body></html>';
/*
//Search the doc for image tags, loop through and echo the source of each img
$bodd = $doc->getElementsByTagName('body')->item(0);
$tags = $bodd->getElementsByTagName('img');
//$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo  '<img src=' . $tag->getAttribute('src') . '/>';
echo  $tag->getAttribute('src') . "<br/>" ; 

}
//if (preg_match('/http/i', $str, $matches)) {
//echo "matched";
//}
*/


//path to directory to scan
$directory = "";
 
//get all image files with a .html extension.
$images = glob($directory . "*.html");
$otherNews1 = $doc->getElementById('otherNews');
//print each file name
echo '<script type="text/javascript">var ON3 = document.getElementById("otherNews3").innerHTML+="<h4>Other news:</h4><br />";</script>';
foreach($images as $image)
{

echo '<script type="text/javascript">var txt = document.getElementById("otherNews3").innerHTML+="<a href=' . $image . '>'.$image.'</a><br />";</script><style type="text/css"> #otherNews3 a {margin:30px;  } </style>';
//$otherNews1.innerHTML+='<a href=' . $image . '>'.$image.'</a><br />';
}


//-----------------------
/*
$dom = new DOMDocument;
$dom->loadHTMLFile('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');

$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body//img');

foreach($nodes as $node) {
echo  '<img src=' . $node->getAttribute('src') . '/>';
    printf(
        'Element %s - Image: %s%s',
        $node->nodeName,
        $node->getAttribute('src'),
        PHP_EOL
    );
}
*/
//---------------------------



/*
$paras = $doc->getElementsByTagName('p');
foreach ($paras as $par) {
echo "<br /> $par  ";
//echo  "<br/>" . $tag->getAttribute('href') ; 
}
*/



/*
// The "i" after the pattern delimiter indicates a case-insensitive search
if (preg_match_all('/<img[^>]+>/i', $str, $matches)) {
    echo "A match was found.";
    echo $matches;
    list($link) = split('[<]', $str);
echo "Month: $link; <br />\n";
} else {
    echo "A match was not found.";
}
*/

?>
<?php

// Get the content that is in the buffer and put it in your file //
file_put_contents('chickFila'.$counter.'.html', ob_get_contents());
?>
 


Fuck me.


It outputs something like this:
mtnj1s.jpg


Code:
<?php
// Start the buffering //
ob_start();
?>

<?php

//various test websites
//$str = file_get_contents('http://www.nydailynews.com/sports/olympics-2012/japan-kohei-uchimura-wins-olympic-gold-all-around-competition-american-danell-leyva-earns-bronze-article-1.1126690');
//$str = file_get_contents('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');

//$str = file_get_contents('http://www.bbc.co.uk/sport/0/olympics/18905658');
$str = file_get_contents('http://www.eonline.com/news/334647/former-olympic-gymnast-nastia-liukin-scores-beauty-of-an-endorsement-named-face-of-tigi');

//$str = file_get_contents('http://forums.darkfallonline.com/forumdisplay.php?f=13');

/*
$desc = $_GET["description"];
$tagging = $_GET["tags"];

$form= '<form action="scrape.php" method="get">
Description: <input type="text" name="description" />
tags: <input type="text" name="tags" />
<input type="submit" />
</form>';
echo $form;
echo $desc;
echo $tagging;
*/

//navigation
$links = '<div id="nav"><a href="">Home</a><a href="" style="margin-left:25px;">Popular News</a><a href="" style="margin-left:25px;">Next Article</a></div>';

//Make it look like you're blogging the story. Should spin recurring keywords instead, this just prototype
//$description = 'Well it looks like chik-fil-a is in the news regarding its CEOs stance on gay rights in America. Gay rights supporters are planning //a National Same-Sex Kiss Day at Chick-fil-A';
$description = 'The olympics are raging in full force. Former U.S. gymnast has just signed on to become TIGIs newest spokesperson, promoting hair products.';

//$tagging = 'chick fil a, chick-fila, chick-fil-a, chickfila, equality, gay men, men kissing';
$tagging = 'olympics, publicity, TIGI, ';

$doc = new DOMDocument();
@$doc->loadHTML('<?xml encoding="UTF-8">' . $str);

$tidy = new tidy();
$tidy->parseFile($str);
$tidy->cleanRepair();

if(!empty($tidy->errorBuffer)) {
    echo "The following errors or warnings occured:\n";
    echo $tidy->errorBuffer;
}
else {
$str = $tidy;
}


echo  '<html><head>'.$nav.'</head><body><div id="headTop" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;">'.$links.'</div><p style="margin-left:75px;">'.$description.'</p>';
echo  '<div id="wrapper" style="background:silver;width:90%;min-height:40%;margin-left:5%;">';

$counter = 0;
//create an image for each image source we have
$mybody = $doc->getElementsByTagName('body'); 

foreach ($mybody as $bod){

$imgpara = $bod->getElementsByTagName('p');
//$divss = $bod->getElementsByTagName('div')->item(0);
$divss = $bod->getElementsByTagName('div')->item(0);
//$imgnotpara = $divss->getElementsByTagName('img');
//foreach($imgnotpara as $imgn ){
//echo  '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
//}

foreach($imgpara as $imgp){
    $tags2 = $imgp->getElementsByTagName('img');
        foreach ($tags2 as $tagged) {
        echo  '<img src=' . $tagged->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
        //echo  "<br/>" . $tag->getAttribute('href') ; 
        }
    
    }
    
}

$imgCount = 0;
//if(preg_match('[img]',$sql)){
$imgnotpara = $doc->getElementsByTagName('img');
foreach($imgnotpara as $imgn ){
$imgCount++;
if($imgCount >2 && $imgCount < 4){
echo  '<img src=' . $imgn->getAttribute('src') . ' style="float:left;clear:right;"/><br />';
}
}

//Scrape content in <p> tags
$paragraphs = $doc->getElementsByTagName('p');

foreach ($paragraphs as $pgraph) {

$info = $pgraph->nodeValue;
$counter++;
   // echo $pgraph->nodeValue, PHP_EOL; 
  if($counter > 1 && $counter < 20){
   echo '<style = "text/css">div{margin:15px;} img{margin-right:15px;}</style><div id=' . $counter . '>' . $info . '</div>';
    }
}
echo  '<p style="margin-left:15px;">tags:'.$tagging.'</p></div><div id="foot" style="border:solid 1px silver;width:90%;min-height:10%;margin-left:5%;"><div id="otherNews" style="clear:right;width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews2" style="width:30%;min-height:250px;border:solid 1px gray;float:right;"></div><div id="otherNews3" style="width:30%;min-height:250px;border:solid 1px gray;"></div></div>';
echo '</body></html>';
/*
//Search the doc for image tags, loop through and echo the source of each img
$bodd = $doc->getElementsByTagName('body')->item(0);
$tags = $bodd->getElementsByTagName('img');
//$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo  '<img src=' . $tag->getAttribute('src') . '/>';
echo  $tag->getAttribute('src') . "<br/>" ; 

}
//if (preg_match('/http/i', $str, $matches)) {
//echo "matched";
//}
*/


//path to directory to scan
$directory = "";
 
//get all image files with a .html extension.
$images = glob($directory . "*.html");
$otherNews1 = $doc->getElementById('otherNews');
//print each file name
echo '<script type="text/javascript">var ON3 = document.getElementById("otherNews3").innerHTML+="<h4>Other news:</h4><br />";</script>';
foreach($images as $image)
{

echo '<script type="text/javascript">var txt = document.getElementById("otherNews3").innerHTML+="<a href=' . $image . '>'.$image.'</a><br />";</script><style type="text/css"> #otherNews3 a {margin:30px;  } </style>';
//$otherNews1.innerHTML+='<a href=' . $image . '>'.$image.'</a><br />';
}


//-----------------------
/*
$dom = new DOMDocument;
$dom->loadHTMLFile('http://latimesblogs.latimes.com/lanow/2012/08/chick-fil-a-kissing-protest-gladd.html');

$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body//img');

foreach($nodes as $node) {
echo  '<img src=' . $node->getAttribute('src') . '/>';
    printf(
        'Element %s - Image: %s%s',
        $node->nodeName,
        $node->getAttribute('src'),
        PHP_EOL
    );
}
*/
//---------------------------



/*
$paras = $doc->getElementsByTagName('p');
foreach ($paras as $par) {
echo "<br /> $par  ";
//echo  "<br/>" . $tag->getAttribute('href') ; 
}
*/



/*
// The "i" after the pattern delimiter indicates a case-insensitive search
if (preg_match_all('/<img[^>]+>/i', $str, $matches)) {
    echo "A match was found.";
    echo $matches;
    list($link) = split('[<]', $str);
echo "Month: $link; <br />\n";
} else {
    echo "A match was not found.";
}
*/

?>
<?php

// Get the content that is in the buffer and put it in your file //
file_put_contents('chickFila'.$counter.'.html', ob_get_contents());
?>

That's some terrible code, even by PHP standards...
 
Fuck me.

It outputs something like this:

What the fuck is this...

Here is a function I used a lot when writing bots in php (I removed all the fancy stuff because I'm sure it would just confuse you).

Code:
    /**
     *  Very simple CURL scraper
     */
    function scraper($url)
    { 
        $c = curl_init($url);
        curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($c, CURLOPT_AUTOREFERER, true);
        curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
        return curl_exec($c);
    }
Have a look at it, learn from it and go look up the curl documentation.
 
lol 2/3 of what I posted is commented out. If you remove the comments there's not much extra....

It grabs a page, finds the body, finds the tags inside the body, grabs the main image and description and makes a new html page...

And when was chick fila in the news, a year ago? I spent about 2 weeks with php after needing it for a highscore board, got distracted for several days and went back to better shit.
 
Jesus you're very supportive in this forum!

shindig's code works, that's all that matters, job done. That's why I love PHP.

Everytime I've to code scrapers in Java or Objective C, I pine for the simplicity of PHP.
 
Jesus you're very supportive in this forum!

shindig's code works, that's all that matters, job done. That's why I love PHP.

Everytime I've to code scrapers in Java or Objective C, I pine for the simplicity of PHP.
Thx man, everyone's a badass on the internet, my javascript and C# are much better than my php. The thing I like about php is I can make a new text file on bluehost, write some code, save and run, it's very "functional" and lots of documentation for just about anything you want to do. And working with mysql is quick, it gives a lot of bang for the buck.

Here's some simple but clean js I just whipped up to play some animated sprite sheets/tilesets.
Code:
var elephant = { //Elephant
    id:'elephantLeft',
    x:0, //X pos to draw on screen
    y:550, //Y pos to draw on screen
    width:256, //Size of frame/tile on sprite sheet
    height:256, 
    src:'http://mysite.com/spriteImgs/elephant2.png', 
    numFrames:6, //How many frames on the spritesheet/tileset
    startFrame:1, //Frame to start with
    pSpeed: 100  //Speed to play animation
};

function MakeSprite(obj){  //Constructor
    var Sprite = {              //Sprite Object
        init: function(){      
            setInterval(Sprite.update, obj.pSpeed);            
        },
            name:obj.id,
            posX:obj.x,
            posY:obj.y,
            width:obj.width,
            height:obj.height,
            imgSrc: obj.src,
            numFrames: obj.numFrames,
            startFrame: 1,
            update: function(){            
                var bufferb = 256;
                var topb = 0;
                var leftb = Sprite.width * Sprite.startFrame;
                $('#'+ Sprite.name+'').css('backgroundPosition','-'+leftb+'px -'+topb+'px');                
                Sprite.startFrame++;        
                    if(Sprite.startFrame > Sprite.numFrames){
                        Sprite.startFrame = 1;
                    }
                }
        };    
        var newdiv = document.createElement('div');
        newdiv.setAttribute('id', Sprite.name); 
        newdiv.style.position="absolute";
        newdiv.style.width = Sprite.width+'px';
        newdiv.style.height = Sprite.height+'px';
        $(newdiv).css("z-index","100");            
        newdiv.style.left = Sprite.posX+'px';        
        newdiv.style.top = Sprite.posY+'px';                    
        newdiv.style.backgroundImage = "URL('"+Sprite.imgSrc+"')";        
        document.body.appendChild(newdiv);
        Sprite.init();        
};
Then can just say MakeSprite(name_of_objectToMake); in the html page.

This is a snippet from a sprite generator app I made last week which uses the alpha channel to remove the background and isolate objects as they run through animations, then pack the sequence of individual frames of the animation into a single atlas
Code:
IEnumerator ScreenshotEncode(){    
        toggleGUI();
        theModel.animation.Play(animToPlay);        
        for(int i=0;i<frameNum;i++){        
        yield return new WaitForEndOfFrame(); 
        Texture2D texture = new Texture2D(Screen.width, Screen.height, TextureFormat.ARGB32 , false);         
        texture.ReadPixels(new Rect(0, 0, Screen.width, Screen.height), 0, 0);            
        texture.Apply ();            
        TextureScale.Point (texture, mWidth, mHeight);    
        texture.Apply();        
        RemoveColor(colorToRemove, texture);            
        if(texture != null){
            frameArray[i]= texture;                    
        }
        yield return new WaitForSeconds(delayBetweenShots); 
        packedTexture.PackTextures(frameArray,0,1024);
        byte[] atlasBytes = packedTexture.EncodeToPNG();       
        File.WriteAllBytes(Application.dataPath + "/../textureAtlas-" + count + ".png", atlasBytes);
        count++;    
        }    
        toggleGUI();
    }
    Texture2D RemoveColor(Color c, Texture2D imgs){
        Color[] pixels = imgs.GetPixels(0, 0, imgs.width, imgs.height, 0);        
        for(int p = 0; p < pixels.Length; p++){ 
                if(pixels[p] == c){                    
                    pixels[p] = new Color(0,0,0,0);                        
                    }    
        }             
        imgs.SetPixels(0, 0, imgs.width, imgs.height, pixels, 0);
        imgs.Apply ();
        return imgs;
        }
And it works like:
[ame="http://www.youtube.com/watch?v=4kY2Ag-Oj-w"]Animated Sprite Sheet Generator I made - YouTube[/ame]

Makes imgs like: (each frame can be as big as your screensize, I just used 256x256px when I generated it)
skeleEastDown-1024x146.png
 
Not sure if I'm asking in the right thread and section. Feel free to move my post if it doesn't fit here...

A few questions:

- First, and mainly, I'm looking for a good explanation (might as well be a tutorial or ebook) of how bots, scraping, automation, etc. relate to SEO. I know the technological part pretty well (eg. I speak several programming languages, can write bots and scrapers, etc.), and understand a bit of SEO, but not enough to connect all the dots.
- Secondly, I consider scraping other people's content, mass creating accounts, etc. at least a gray area that could potentially lead to legal issues. Thus, my question, what security mechanisms do you take and how do these play together? Do you use proxy servers for all requests? Do you only use free hosts that don't require payments? Do you disguise your IP/identity when uploading your scripts/files to your server?
 
- First, and mainly, I'm looking for a good explanation (might as well be a tutorial or ebook) of how bots, scraping, automation, etc. relate to SEO. I know the technological part pretty well (eg. I speak several programming languages, can write bots and scrapers, etc.), and understand a bit of SEO, but not enough to connect all the dots.

I haven't really kept up with this kind of stuff since 2008 but I imagine it's the same. Here are some SEO examples:

* Crawling for places to drop backlinks. (solving captchas, registering for forums, writing comments, vandalizing wikipedia)
* Scraping content to rotate on a blog that exists as trash for bottom tier backlinks.
* Scrape recipes from some website and feed it into your own database so your recipe site has more content and gains search engine clout.

Links are still the backbone of search engine algorithms and they're easy to programmatically generate.

Bots are just some code and an http client in a loop.

- Secondly, I consider scraping other people's content, mass creating accounts, etc. at least a gray area that could potentially lead to legal issues.

I imagine the majority of bot/automated activity is driven by really dull things:

* registering accounts on dead ezboard forums
* scraping livejournal entries that turn up in a search for "corned beef" and then reposting them on your corned beef affiliate link money site
* making spammy twitter accounts

And it's not like the links they're dropping are to high-brow business web presences but instead to faceless urls like "http://viagra.nxu.ru/i282" and "http://corned-beef-thyroid-cure.blogspot.com".

In other words, nobody is busting out digital forensics to track down the corned beef bandit. And instead of trying to hunt down twitter account automators, Twitter just improves its detection algorithms.

A lot of the above is really just based off intuition though. I've never had to do any screen scraping or mechanical traversal in a serious product.
 
Does anyone have any recommendations on buying proxies? Total n00b here

How does one implement a proxy in your scraper? I'm learning cURL in php right now...
 
Does anyone have any recommendations on buying proxies? Total n00b here

How does one implement a proxy in your scraper? I'm learning cURL in php right now...


Ok, so I figured out the curl_setopt for the proxy tunneling.

My question about recommended proxies still stands though.
Also, another question is how to write the code to mimic a browser for the server log. Insights are appreciated.
 
Just figured I would through it out another vote for python / splinter. You can go headless and the python syntex is easy as shit to understand with an even simpler api from the spliter package.