The Botting Q&A Thread: Ask Away

You know that's really interesting, maybe it has to do with how many cores the system has? My local system is a 6 core AMD @ 4GHz a core with 8 gigs of ram at 1600 Mhz. My code blows it away. Another rig I tested it on is a dedi (don't know the exact specs) and my code blows it away also.

So then I'm like WTF is going on? How can I not be getting the same conclusion? I just just tested it on a VPS that has 768 MB ram, and my code just got crushed. So who knows, the beefier the system the better non-regex does? I don't know the answer to this, but it's a question worth asking. Anyone know of a PHP configuration that would effect this?

the processor speed on your VPS is a fraction of what you're getting on your home machine and your dedi most likely
 


the processor speed on your VPS is a fraction of what you're getting on your home machine and your dedi most likely

Right, but the data they're processing is the same, and they're processing it the same way. The amount of cores (as there are not multiple threads), nor the processor speed should really change that fact.
 
The only reason you need regex is your incapable of writing your own functions to do the same thing. They'll not only be faster, you actually be proud of yourself.

I'd be more proud of a nicely crafted 1 line regex that gets the exact data you want over php string functions but whatever you're used to I suppose.

Question to informed peeps:
Did anyone try to automate Myspace Myads ad-creation back in the day? It was a Flash file over HTTPS. I don't know if they still use that now.

I just couldn't figure out a way to sniff what was sent. My sniffing tool of choice, Charles stuck a Windows proxy between PC and Internet but the HTTPS couldn't get through this. And I couldn't figure out a way to make Flash use a specified proxy.

I managed to figure it out by guessing POST variables but I'm still curious did anyone do it. I know kyleirwin said he figured it out.
 
I'd say it has something to do with the PHP version and error reporting. Your code doesn't look all that 5.3 friendly based on all the warnings it throws.


[root@boost ~]# time python poop.py

real 0m0.562s
user 0m0.558s
sys 0m0.002s
[root@boost ~]# php mattseh.php
start: 1295908608.7676<br />end: 1295908614.0252<br />Took:5.2575619220734 seconds
[root@boost ~]# php rage.php
start: 1295908618.4713<br />end: 1295908620.9584<br />Took:2.4871139526367 seconds
[root@boost ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
model name : Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
[root@boost ~]# free -m
total used free shared buffers cached
Mem: 9004 8076 928 0 359 5240
-/+ buffers/cache: 2476 6528
Swap: 11055 0 11055
 
I'd say it has something to do with the PHP version and error reporting. Your code doesn't look all that 5.3 friendly based on all the warnings it throws.

The differences between 5.2 and 5.3 are pretty minor, and it actually adds a little functionality. The problems was with: $i <= count($links), should have been $i < count($links).

Good to see someone else get similar results, makes me feel less crazy.
 
Question about ruby. I've been thinking about playing with it for Waitr. Does it actually power a real browser? I'm currently using a Java solution and would much rather migrate to ruby.
 
Question about ruby. I've been thinking about playing with it for Waitr. Does it actually power a real browser? I'm currently using a Java solution and would much rather migrate to ruby.

watir can power firefox, IE (on windows) and chrome and safari. I recommend doing it with Firefox though.

I use jruby + celerity. Why? Celerity is a clone of watir (uses the exact same function set) but uses htmlunit instead of a real browser to do the work for you. THat means you can run your bots on a headless (no X server) linux vps if you want, or run multiple ones simultaneously, etc.

HTMLUnit is it's own beast, which takes a while to get used to, but celerity makes most things easy. You can 99% of the time create your bot in Watir so you can watch it execute in the browser, then change the line browser = Watir::Browser.new to browser = Celerity::Browser.new and it will just work, only in HTMLUnit. It's fucking great.

A few good resources for Watir/Celerity:

Installing TestWise Recorder
https://addons.mozilla.org/en-US/firefox/addon/firepath/
https://github.com/jarib/celerity-viewers

also, nokogiri (the best html document parser for Ruby) doesn't run on jruby out of the box. Install the -pre gem of it to get a pure ruby version:

jruby -S gem install nokogiri -pre
 
watir can power firefox, IE (on windows) and chrome and safari. I recommend doing it with Firefox though.

I use jruby + celerity. Why? Celerity is a clone of watir (uses the exact same function set) but uses htmlunit instead of a real browser to do the work for you. THat means you can run your bots on a headless (no X server) linux vps if you want, or run multiple ones simultaneously, etc.

HTMLUnit is it's own beast, which takes a while to get used to, but celerity makes most things easy. You can 99% of the time create your bot in Watir so you can watch it execute in the browser, then change the line browser = Watir::Browser.new to browser = Celerity::Browser.new and it will just work, only in HTMLUnit. It's fucking great.

A few good resources for Watir/Celerity:

Installing TestWise Recorder
https://addons.mozilla.org/en-US/firefox/addon/firepath/
https://github.com/jarib/celerity-viewers

also, nokogiri (the best html document parser for Ruby) doesn't run on jruby out of the box. Install the -pre gem of it to get a pure ruby version:

jruby -S gem install nokogiri -pre

Ah okay. The last time I checked HTMLUnit didn't support js properly, but now
it probably does. I've also setup xvfb and run firefox in it :D
 
Ah okay. The last time I checked HTMLUnit didn't support js properly, but now
it probably does. I've also setup xvfb and run firefox in it :D

it supports the vast majority of javascript on the most popular botting targets. To put it in another light: you could recreate senuke with it
 
*muttering to himself* ruby, or python? ruby. No, python! Ruby! yeah. wait... python? ruby? Python! No, ruby! Python? Python! wait... *walks off into the distance*
 
I just couldn't figure out a way to sniff what was sent. My sniffing tool of choice, Charles stuck a Windows proxy between PC and Internet but the HTTPS couldn't get through this. And I couldn't figure out a way to make Flash use a specified proxy.

I was going to say use wireshark (would still recommend having it on hand though) but then I noticed you said HTTPS and wireshark wont MITM.

I had the same problem a while back, not with myspace though, and it was a PITA to get working. I can't remember why off hand, something about Adobe considering proxys + https a security violation or something.

There are a few solutions though, first you could reverse the code, chances are they wont use obfuscation and you can see exactly what they are doing.

Or you could sniff the server flash is talking to and spoof the DNS and then push it to a proxy but I believe flash will scream bloody murder if the cert is invalid (and you cant force it not to care like with cURL, etc).

On a side note, for people who want a quick way to create and test regex I use regexbuddy. Its easy and cheap (like $40).
 
Funny, I was thinking the same thing.

Actually I have a question, out of pure curiosity. How many people packet sniff when writing a scraper/bot/whatever?