Basic/NOOB tutorial on scripting web posting bots in php and curl

gutterseo

▬▬▬▬▬▬▬&
Feb 27, 2009
1,271
27
48
OK before I start this tutorial it is going to be for complete newbs to php to people who may know a little but have no experience in writing bots. If you already have programming knowledge you will probably not benefit from reading this but I have seen people asking a lot of question about automation recently and thought a primer might be useful for some people. In this post I am not going to give out completed code as you'll learn nothing from that. Hopefully though if you follow along and use google for anything you dont understand you should be able to write a working bot.

OK to start with your going to either need a webserver with PHP and Curl installed. I recommend when starting out that you test scripts from your local machine. The way I do it is by running a Linux distro with apache and LAMP. Hint. If you don't understand what I just said Google the following things:

"Installing Apache (Linux distro)"
"Installing LAMP Apache (Linux distro)"
"Installing Curl Apache"

If you followed all those things it should take you roughly one hour but congratulations you can now test all your php scripts on your local machine.

I cant be bothered to explain the very basics of PHP here. Go here PHP Tutorial - Introduction and do the first few tutorials, it wont take you more than 1-2 hours to get the basic syntax of PHP down, for example; what are strings, basic loops, functions, how to read files etc. This may all sound terribly complicated but its really not. Get yourself a cup of coffee, close your other tabs and open your text editor of choice. Try messing around with the examples they give on that site and get your first scripts to do some stupid things.

OK, now that the boring shit is out of the way hopefully you have an idea of what web service you would like to automate. I am going to take for an example an article submission bot. The theory behind this can apply to almost any other posting bot but for the sake of this tutorial thats what I'm gonna do.

Before you even begin to code your new project you need to get an exact idea of what it has to do. How do you get that? by doing it manually of course. So take a pen and paper and go submit a new article. Write down everything you have to do to post the article. Your list should look something like this:

Go to Login page
Enter Username
Enter Password
Click Login
Click Submit Article
Fill out all the details associated with posting article
Click Submit Article
View the page that says article submitted

Not too many things OK this looks good. Are list looks pretty good but we need to be a lot more specific if were going to automate this with a script. So next what your going to do is install a plugin for firefox called Live HTTP headers. What this does is allow you to see all of the information your browser is sending to the remote server.

So now open up live HTTP headers and do the exact same thing again, logging in and posting an article. You want to do this in two steps. Firstly when you login copy the header (data from live HTTP headers to a text file). This will show information like this:

Code:
http://www.goarticles.com/cgi-bin/member.cgi 
 
POST /cgi-bin/member.cgi HTTP/1.1 
Host: www.goarticles.com 
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009090216 Ubuntu/9.04 (jaunty) Firefox/3.0.14 
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
Accept-Language: en-us,en;q=0.5 
Accept-Encoding: gzip,deflate 
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 
Keep-Alive: 300 
Connection: keep-alive 
Referer: http://www.goarticles.com/ulogin.html
Cookie: CGISESSID=aseshidhere 
Content-Type: application/x-www-form-urlencoded 
Content-Length: 72 
email=myemail&password=mypw&SUBMIT=Login
This is just the very topmost section of data in live HTTP headers, normally its all you need. Upon viewing this you can learn what page the data gets sent to and by what method. In this case its post and beside that is the dest url. You can also see that a cookie gets set (more on this later). You can also see that email and password aren't the only data being sent there is also "&SUBMIT=Login" this is hidden when viewed in a browser but must be sent.

Before I show a sample script for doing this I need to briefly touch on Curl. Curl is great it takes care of cookies and posting the data. It also emulates the user agent(browser OS your using) and refferer(page the data is being sent from).

In all my time of using Curl I have rarely had to modify the Curl fuction I use. If you done the tutorials I mentioned above you should know what a function is and how to execute it. Below is the Curl functions I use. This is not my own code but is copied from Harry who had the darkseoprogramming blog (now closed). There are two functions in the code below. The first one is used when you need to post data and the second one is just for requesting a page(useful for when you want to check if something submitted correctly).

Code:
<?php
function post($page, $fields)
{
    $file_cookie = "cookies/cookies.tmp";
    $reffer = "copy your refferer from live HTTP headers here";
    $ch = curl_init($page);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_REFERER, $reffer);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)");
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $fields);
    

    $response = curl_exec($ch);
    curl_close($ch);

    //echo curl_error($ch);
    return $response;
}

function scrape_page($page)
{
    // cookie path
    $file_cookie = "cookies/cookies.tmp";
    $reffer = "copy your refferer from live HTTP headers here";
    $ch = curl_init($page);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_REFERER, $reffer);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)");


    $response = curl_exec($ch);
    curl_close($ch);

    //echo curl_error($ch);
    return $response;
} 
?>
OK, our sample code for logging in will look like this

Code:
<?php
require_once("curlfunctions.php");
$site = "www.articlesite.com";
$variables = "username";
//repeat using appropriate variable for your username, password etc.
//If your feeling adventurous get your script to load these from a file

//were now constructing a string the same as the one from Live HTTP headers
$poststring = "email=" . $email . "&password=" . $password . "&SUBMIT=Login";

//The page the data is being sent to is $site . "/cgi-bin/member.cgi";
//We are loading the info into our curl function and setting $result as the page we would get when we login in.
$result = post($site . "/cgi-bin/member.cgi", $poststring);

//Now to check if were logged in we need a regex, note the delimiters for the regex.
$regex = "/Welcome username/";
//check if its contained in the page returned, this will check if the regex string was found in the result page, if a match was found it will be saved in $matches
preg_match($regex, $result, $matches);

//now to check if it was
if(empty($match)){ 
echo "posting failed";
}else{
echo "success";}
?>
Tinker with this till you get it to work. Remember if its not working double check to make sure you are sending the exact data in the script as in live HTTP headers. If you got this far you should be able to repeat the same thing again to post the article. This time your code should be a bit longer and your regex different. Thats basically all there is to a simple posting bot. If you have any questions post them below and Ill do my best to help you.
 
  • Like
Reactions: OrganicOne


I appreciate your time, but your not explaining the intracasies to scraping, not that my tutorials are a tell all but I think they give a better over all view. I also laugh at your use of regex and preg_match, because they are terribly slow compared to what you can write (the worse than lazy route).

Noobies Guide on How to Scrape: Part 1 – Intro & Tools | MadPPC
Noobies Guide on How to Scrape: Part 2 – URLs, URL Variables, and using Live HTTP Headers | MadPPC
Noobies Guide on How to Scrape: Part 3 – Basics of Assessing Your Target | MadPPC
Noobies Guide on How to Scrape: Part 4 – cURL | MadPPC
Noobies Guide on How to Scrape: Part 5 ? A Basic Scraper | MadPPC
 
  • Like
Reactions: OrganicOne
I appreciate your time, but your not explaining the intracasies to scraping, not that my tutorials are a tell all but I think they give a better over all view. I also laugh at your use of regex and preg_match, because they are terribly slow compared to what you can write (the worse than lazy route).
Thanks for the criticism Rage9. I wasn't trying to give a comprehensive view of PHP programming. Its true I didn't cover the intricacies of scraping but was trying to give a quick'n dirty intro to posting bots. To show how easy it is to actually code something yourself.

One thing why do you laugh at me calling a variable $regex??? Its just a variable name. I sure as hell don't mind you dropping links to your own tut's, I'm sure they are more comprehensive than this, I wrote it up in half an hour. I don't see why you need to try and provoke me in the process.
 
I appreciate your time, but your not explaining the intracasies to scraping, not that my tutorials are a tell all but I think they give a better over all view. I also laugh at your use of regex and preg_match, because they are terribly slow compared to what you can write (the worse than lazy route).

Noobies Guide on How to Scrape: Part 1 – Intro & Tools | MadPPC
Noobies Guide on How to Scrape: Part 2 – URLs, URL Variables, and using Live HTTP Headers | MadPPC
Noobies Guide on How to Scrape: Part 3 – Basics of Assessing Your Target | MadPPC
Noobies Guide on How to Scrape: Part 4 – cURL | MadPPC
Noobies Guide on How to Scrape: Part 5 ? A Basic Scraper | MadPPC

Sweet.

I've written quite a few scrapers in my time and did a few that scraped several of the article directories - spun the articles and then dumped them back into WP... pretty simple stuff.

There is a problem with the link on the last one, but it was easily found on your site.
 
Thanks for the criticism Rage9. I wasn't trying to give a comprehensive view of PHP programming. Its true I didn't cover the intricacies of scraping but was trying to give a quick'n dirty intro to posting bots. To show how easy it is to actually code something yourself.

One thing why do you laugh at me calling a variable $regex??? Its just a variable name. I sure as hell don't mind you dropping links to your own tut's, I'm sure they are more comprehensive than this, I wrote it up in half an hour. I don't see why you need to try and provoke me in the process.

Because that is the nature of WF, if you don't like it GTFO. That and I have a huge programming e-penis, and it often gets the best of me, especially while intoxicated like when I popped off last night.
 
No prob. I'v been to your blog before and think its good, infact I think I have it bookmarked(genuine complement there). I was a bit tweaked this morning too, amphetamines + no sleep.
 
Hey thanks gutterseo, I was actually doing some noob php tutorials last night and then found this. I think that I have a pretty good grasp around the code, but there is no way I could code one myself without looking up syntax and reference source code yet.

Also rage9, bookmarked your blog. Good stuff in there too.
 
Well, wrote my first scraper today with the help of Rage9's blog and gutterseo's post.

My fun did not last very long...

googleban.jpg
 
Well, wrote my first scraper today with the help of Rage9's blog and gutterseo's post.

My fun did not last very long...

googleban.jpg

AHAHAHA!!!! Yeah, your going to want to use multiple proxies and/or throttle the scrapper.
 
You wrote way too much code & too long of a tutorial for what should be a much more simply done with file_get_contents() and proper regular expressions. With PHP, you really want to take advantage of multi-dimensial array regex as those will be able to handle masses amounts of data quickly and easily (preg_match_all). You weren't wrong for posting this, but just remember for the next tutorial you post :P
 
I appreciate your time, but your not explaining the intracasies to scraping, not that my tutorials are a tell all but I think they give a better over all view. I also laugh at your use of regex and preg_match, because they are terribly slow compared to what you can write (the worse than lazy route).

Noobies Guide on How to Scrape: Part 1 – Intro & Tools | MadPPC
Noobies Guide on How to Scrape: Part 2 – URLs, URL Variables, and using Live HTTP Headers | MadPPC
Noobies Guide on How to Scrape: Part 3 – Basics of Assessing Your Target | MadPPC
Noobies Guide on How to Scrape: Part 4 – cURL | MadPPC
Noobies Guide on How to Scrape: Part 5 ? A Basic Scraper | MadPPC

That is a little bit better, but defenitely code-excessive as well. Why do you need so many damn cURL vars? Jesus. file_get_contents works great for anything I have done & very quickly. (you are right about in *SOME* cases where you would have to throw a different referer or useragent, but that is rare)
 
AHAHAHA!!!! Yeah, your going to want to use multiple proxies and/or throttle the scrapper.

..and again, RARELY would you need to do that. For being a n00by introduction to scraping, all of the precautions are unncessary. Maybe that should be an Intermediate Guide to PHP Scraping or something :)
 
That is a little bit better, but defenitely code-excessive as well. Why do you need so many damn cURL vars? Jesus. file_get_contents works great for anything I have done & very quickly. (you are right about in *SOME* cases where you would have to throw a different referer or useragent, but that is rare)

#1 because you should know how do do it, and #2 it's much more flexible. Using curl will give you a solid base on writing automated software in general.

Sure if you wanted to just scrape (and didn't have to make a POST request knock yourself out with file_get_contents. If you want to scrape, make POST requests, modify header data, use proxies, etc, you'd use cURL.

Or are you just afraid of the power and flexibility? Either way doesn't matter, you just keep using file_get_contents.
 
That is a little bit better, but defenitely code-excessive as well. Why do you need so many damn cURL vars? Jesus. file_get_contents works great for anything I have done & very quickly. (you are right about in *SOME* cases where you would have to throw a different referer or useragent, but that is rare)

this thread sorta went off topic. It wasn't meant to be on scraping. It was about form filling. Where a cookie is important. Please fill me in if I'm wrong but I don't know any easier way to manage cookies than with Curl. Did I leave out a location to store cookies in my code^ I'm not even going to bother to check.

If your having trouble with scraping the google try something like this the end of your loop
Code:
sleep(5, 10);
This will cause the script to wait between 5 and 10 seconds between requests. Yes there is better ways to do this, using proxies is one.