The Botting Q&A Thread: Ask Away

dchuk

Senior Botter
Oct 30, 2008
7,044
224
0
San Diego,CA
serpiq.com
Alright, this subforum is pretty dead most of the time, so let's try and liven things up a bit. There seems to be a wide array of botters around here, from more heavy duty guys to the occasional imacros and ubot guys.

Feel free to ask any questions you might have about scraping, account creators, captcha breaking, etc. I'll try and answer as many things as I can, and I'm going to rely on a few resident senior botters to fill in the gaps.

Ground rules:

1) Nothing illegal (phishing, cookie stuffing, hacking sites, etc...keep it PG folks)
2) No asking for bots. If someone wants to give code up, cool. this isn't a BST thread though
3) Newbie questions are ok, BUT DO NOT BE AN IDIOT. If there are things you've always wanted to ask about how to get into this stuff, go ahead and ask, but you better be damn sure someone hasn't already asked it or you will get flamed.

alright, let the chatting commence. I've been doing a shit ton of stuff lately with jruby and celerity, as well as quite a bit on the server side with php and codeigniter.

eliquid, mattseh, tainted, bofu, rage9, uplinked, shady, eli, stanley, and all other botters...I expect some posts in here :)
 


My question: when automating account signups, how do you save the captcha image to your server, then redisplay it without messing up sessions? I use php and I can't seem to make the initial request to the signup page and grab the image all in one step.

Sorry if that was worded poorly, but I'm sure you get the point.
 
My question: when automating account signups, how do you save the captcha image to your server, then redisplay it without messing up sessions? I use php and I can't seem to make the initial request to the signup page and grab the image all in one step.

Sorry if that was worded poorly, but I'm sure you get the point.

I would assume you're using cURL, so...

When you request a page that has a captcha image on it, that site drops a cookie on your machine. When you then use cURL to download the actual captcha image (in a second request) you're going to (most of the time) get a new captcha image, with a new cookie. That cookie won't match up to your previous request, so you won't ever get success.

When you make the second request for the captcha image, you need to pass the cookies back to the server, so it looks like you're refreshing the page rather than making a whole new request.

Towards the bottom of this page there is an example of how to use the cookiejar to send and receive with cookies attached: An in-depth overview of PHP And Curl

here's another good example of how to do this: http://reversedentropy.net/php/cookies-curl-and-php-without-a-jar
 
My question: when automating account signups, how do you save the captcha image to your server, then redisplay it without messing up sessions? I use php and I can't seem to make the initial request to the signup page and grab the image all in one step.

Sorry if that was worded poorly, but I'm sure you get the point.

Using curl with PHP you can save the cookie during the first request to fetch all the HTML on the page. You will then parse out what you want from the page and issue another request with curl. When you make the second request make sure to send the cookie along that was received in the first request.

libcurl - PHP Binding examples <= great examples in php
 
When you make the second request for the captcha image, you need to pass the cookies back to the server, so it looks like you're refreshing the page rather than making a whole new request.

Using curl with PHP you can save the cookie during the first request to fetch all the HTML on the page. You will then parse out what you want from the page and issue another request with curl. When you make the second request make sure to send the cookie along that was received in the first request.

libcurl - PHP Binding examples <= great examples in php

Thanks for the replies guys. I've used the cookie jar to save cookies and then reuse them for that second request, but I feel like the sites I've tried this with only serve a particular image once per session (even with the cookies). Then the next time you make any kind of request with the same cookie to whatever is serving the image, it resets so that the first load of the signup page shows that one particular image the first time and if it is reloaded in any way you get a new image. Maybe I need to mess around with it some more, I haven't done it in a while, so I'll report back.

Next Question/Topic:
I've gotten pretty good at regex the past year and it has been beautiful for scraping, but usually things are confined to one particular CMS or even a particular site. What do you guys use to extract the main text/post from a particular page so that it can be dynamic across multiple sites, templates, and CMS's?

Lately I've been using the Alchemy semantic api for gathering the bulk of the content from a particular page and it works decently, but it still sucks to rely on an API for something like that. Any advice here?
 
Next Question/Topic:
I've gotten pretty good at regex the past year and it has been beautiful for scraping, but usually things are confined to one particular CMS or even a particular site. What do you guys use to extract the main text/post from a particular page so that it can be dynamic across multiple sites, templates, and CMS's?

Lately I've been using the Alchemy semantic api for gathering the bulk of the content from a particular page and it works decently, but it still sucks to rely on an API for something like that. Any advice here?

Not trivial at all. I've got code that succeeds maybe 60-70% of the time. The probblem is a lot of websites are made with terrible html. Not all content is within proper tags etc. It gets down to assigning scores to probabilities that a certain lump of text is the main content (based on things like length, number of paragraphs) etc. A nice site to develop that against is BBC or NYTimes.
 
1 solution would be to just limit yourself to one set of content tags.

If you going after the main body/text ( meat and potatoes ), then JUST go after anything in a <p> tag. Even bad formatted HTML from webmasters will generally keep main body content in just <p> tags. Some might have align or style in the <p> tag but just use regex to dumb it down to the <p> itself.

Also search for things between <br /> tags if the page doesnt have <p> tags. You can also limit that the content between <p> ( or <br /> ) must be over 50 characters to be a valid paragraph and anything under that will prob be garbage you can throw out.

Ive been scraping/botting since, well, curl was integrated in PHP which was forever ago. If you limit yourself to what I just posted you will find a lot of success. Granted there will be failures but nothing is 100%. I use a ton of PHP as well as macro/bot tools for automation and scraping. Try this out on some of your CMS's and you will see some good results.

Most times I can use something as uBot, winautomation, jitbit, skillui, automation anywhere, etc for the frontend and PHP/Mysql for backend.
 
mattseh and eliquid both make solid points, the majority of the internet is a cesspool of shitty shitty code that is not standardized nor consistent. You have a few options in my opinion:

1) Defer to existing libraries. I use nokogiri in Ruby, it parses most pages very well. There is an equivalent for every major language.

2) use google's cache. They spend all day every day crawling the vast shitty internet, so their cache might help you out in terms of some sort of cleaned up code/page.
 
if your worried about really shitty HTML ( im not though ) you can you use Tidy with PHP. There are also several other libs in PHP for cleaning up HTML and scraping out what you want.

I prefer to keep my code simple though. Just scan and extract all shit in <p> tags for getting the main content and if <p> isnt found then using <br /> for things over 50 chars. in length. If nothing is still found, fuck that target and go to the next one, which is also why I extract 100 serps at once too.
 
Something like this will get you a basic captcha where the url is always the same and save it locally:

Code:
function getCaptcha($navurl, $captchaurl) {

    $fileh = fopen('captcha.jpg', 'w');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $navurl);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, FALSE);
    curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookies.txt");
    //go to page to log cookie session id
    curl_exec($ch);

    curl_setopt($ch, CURLOPT_URL, $captchaurl);
    curl_setopt($ch, CURLOPT_FILE, $fileh);
    curl_exec($ch);
    fclose($fileh);
    curl_close($ch);
}

Recaptcha is a bit more tricky, as you'd have to visit the page first, and extract the captcha url to make the call.

As for poor HTML, I haven't run into it a ton, but when I have you just have to get creative. There normally is some way to extract the data you want.
 
Someone had asked what I use for scraping. I typically write my own code to hande it, check out my blog madppc.com I have a 5 part tutorial on the basics of scraping. I dont ever use regex unless I really must because its slower, and I have no real reason to learn it beyond the basics. There is also, for PHP, a library called PHP simple DOM and it is great as you can basically use jQuery type statements to extract data. The big issue with it is that it takes up A LOT of memory. A simple scraper will take up 30 MB+ of memory, which really made it not ideal IMHO. Ive run into issues with putting these types of scapers on some hosting as PHP is typically not set up to allocate that much memory. It can be changed but Ive never considered it that much of a trade off honestly.

My typical theory is to start from the top of the page and break the page up into managable sections and work my way down. I make a lot of use of the PHP explode function. As with any scraper care must be taken to make sure you can account for any page "state". This is most evident when scaping the serps but there are a lot of sites that will throw odd curve balls at you and if youre not looking for it, itll fuck your shit up.

Aplogize for any spelling mistakes or whatnot typed this from my phone.
 
Someone had asked what I use for scraping. I typically write my own code to hande it, check out my blog madppc.com I have a 5 part tutorial on the basics of scraping. I dont ever use regex unless I really must because its slower, and I have no real reason to learn it beyond the basics. There is also, for PHP, a library called PHP simple DOM and it is great as you can basically use jQuery type statements to extract data. The big issue with it is that it takes up A LOT of memory. A simple scraper will take up 30 MB+ of memory, which really made it not ideal IMHO. Ive run into issues with putting these types of scapers on some hosting as PHP is typically not set up to allocate that much memory. It can be changed but Ive never considered it that much of a trade off honestly.

My typical theory is to start from the top of the page and break the page up into managable sections and work my way down. I make a lot of use of the PHP explode function. As with any scraper care must be taken to make sure you can account for any page "state". This is most evident when scaping the serps but there are a lot of sites that will throw odd curve balls at you and if youre not looking for it, itll fuck your shit up.

Aplogize for any spelling mistakes or whatnot typed this from my phone.

From what you just said you really need regex. It really is powerful and worth knowing. Hell one of the great things about PowerMTA is the ability to use regex for backoffs.
 


Not trivial at all. I've got code that succeeds maybe 60-70% of the time. The probblem is a lot of websites are made with terrible html. Not all content is within proper tags etc. It gets down to assigning scores to probabilities that a certain lump of text is the main content (based on things like length, number of paragraphs) etc. A nice site to develop that against is BBC or NYTimes.


I rolled my own and got a very good hitrate... I based it on word density vs tag density. For every distinct word on the page. Assign a positive score to every word, a slightly positive word to tags like <p> <br> <i> <strong> that usually occurs within content/ Assign a negative score to other tags (like divs). Then create a heatmap by giving each word position an average density of the n previous and n following words. From there you just have to look for sudden and high rates of change in the density. Once you've found the boundaries of a target zone look for a common parent tag between the start and end...

Rate each candidate zone by density of the total zone and perhaps modify to take length into account then pick the best candidate.

The main thing that needs tweaking is the 'n' distance... Too long and you will always find the right boundaries but you may be a few words/tags off. Too short and you definately get readable content but you can end up with lots of short segments rather than finding the entire content section. That can be handles pretty effectively though by doing some pre-analysis of the content to determine what 'normal' rates of density change and setting your threshholds accordingly.
 
oh yeah I forgot to mention before doing all that I run the html through jtidy or some other html cleaner. makes sure you can accurately match start/end tags.
 
From what you just said you really need regex. It really is powerful and worth knowing. Hell one of the great things about PowerMTA is the ability to use regex for backoffs.

The only reason you need regex is your incapable of writing your own functions to do the same thing. They'll not only be faster, you actually be proud of yourself.
 
The only reason you need regex is your incapable of writing your own functions to do the same thing. They'll not only be faster, you actually be proud of yourself.

From my BST bumper detector:

Code:
threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(main)

How would you do this without regex?

Time is money, I can write regexes quickly and concisely to get exactly what I need.
 
The only reason you need regex is your incapable of writing your own functions to do the same thing. They'll not only be faster, you actually be proud of yourself.


I write lots of PHP and there is NO WAY IN HELL I would consider rewriting regex functions for any reason at all.

Not only do you waste a lot of time reinventing the wheel, but the speed increase is going to be negligible for the most part.

Im gonna have to say, your gonna have to post your function you use and lets run a comparison to regex and really see the difference.

If we are talking about less then 1-2 seconds, then I would have to say its a fail as the time to rewrite regex functions would be way more then the time you might lose.

I code to make money, not make .0013342 second improvements.