The Botting Q&A Thread: Ask Away

eliquid · Jan 23, 2011

Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz

Rage9 · Jan 23, 2011

mattseh said:
From my BST bumper detector:

Code:

threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(main)

How would you do this without regex?

Time is money, I can write regexes quickly and concisely to get exactly what I need.

I'm assuming that what the regex statement does is return the all the actuall thread urls on the BST page, I could optimize this further but this will do:

Code:

<?php
//break up the page
$links = explode('sell-buy-trade/', $pagedata);
//setup a loop
$x = 0; //counter for good trails array
for($i=1; $i<=count($links); $i++){

    $qp = strpos($links[$i], '"'); //find our quote position
    $trail = substr($links[$i], 0, $qp); //extract the rest of the string
    $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
    if($htmlp !== FALSE){
        $trail = substr($trail, 0, $htmlp); //extract the data before .html
        $data = explode('-', $trail); //seperate all data elements
        $numdata = count($data); //find out how many items are in our array
        if(is_numeric($data[0]) && !is_numeric($data[$numdata - 1])){ //if the first data chunk is numeric and and last one isn't
            $goodtrail[$x] = $trail; //save our trail
            $x++; //increase our array counter
        }
    }
}
?>

Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz

Believe me I'm not trying to start a holy war or anything like that. I stated my opinion/philosophy on the regex subject and everyone else is going crazy like little teen school girls. If regex is your thing by all means go for it.

crackp0t · Jan 23, 2011

Rage9 I feel sorry for anyone that has you code for them if you really think your functions are faster than good regex.

Rage9 · Jan 23, 2011

crackp0t said:
Rage9 I feel sorry for anyone that has you code for them if you really think your functions are faster than good regex.

Wow, maybe you should go out and test this before you open your mouth. Regex becomes exponentially slower with the more "rules" that are added and also the lower level of a language you go. Even at the PHP level, regex is often times quite slower due to the load time of the regex engine.

Take the challenge issued earlier in the thread by mattseh, my function (which I wrote in about 10 min) is actually, according to my tests 2x as fast. That's right, twice as fast. Also I'd hardly consider that a complex regex. I'm running this off my personal rig and getting times of ~22 seconds for mine and ~43 seconds for the regex. Don't believe me, run these scripts:

Mine:

Code:

<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start =  microtime_float();

$pagedata = file_get_contents('bst.html');
for ($z = 0; $z < 20000; $z++) {

//break up the page
    $links = explode('sell-buy-trade/', $pagedata);
//setup a loop
    $x = 0; //counter for good trails array
    for ($i = 1; $i <= count($links); $i++) {

        $qp = strpos($links[$i], '"'); //find our quote position
        $trail = substr($links[$i], 0, $qp); //extract the rest of the string
        $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
        if ($htmlp !== FALSE) {
            $trail = substr($trail, 0, $htmlp); //extract the data before .html
            $data = explode('-', $trail); //seperate all data elements
            $numdata = count($data); //find out how many items are in our array
            if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't
                $goodtrail[$x] = $trail; //save our trail
                $x++; //increase our array counter
            }
        }
    }
}
$end =  microtime_float();
$parseTime = $end - $start;
echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

What the equivalent of his in PHP would be:

Code:

<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start = microtime_float();

$pagedata = file_get_contents('bst.html');

for ($z = 0; $z < 20000; $z++) {
    preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches);
}
$end = microtime_float();
$parseTime = $end - $start;

echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

bst.html is simply a file containing all the data from the B/S/T page.

tainted · Jan 24, 2011

Put 3 computer nerds in a room together, and you get...

crackp0t · Jan 24, 2011

Rage9 said:
Wow, maybe you should go out and test this before you open your mouth. Regex becomes exponentially slower with the more "rules" that are added and also the lower level of a language you go. Even at the PHP level, regex is often times quite slower due to the load time of the regex engine.

Take the challenge issued earlier in the thread by mattseh, my function (which I wrote in about 10 min) is actually, according to my tests 2x as fast. That's right, twice as fast. Also I'd hardly consider that a complex regex. I'm running this off my personal rig and getting times of ~22 seconds for mine and ~43 seconds for the regex. Don't believe me, run these scripts:

Mine:

Code:

<?php function microtime_float() { list($usec, $sec) = explode(" ", microtime()); return ((float) $usec + (float) $sec); } $start = microtime_float(); $pagedata = file_get_contents('bst.html'); for ($z = 0; $z < 20000; $z++) { //break up the page $links = explode('sell-buy-trade/', $pagedata); //setup a loop $x = 0; //counter for good trails array for ($i = 1; $i <= count($links); $i++) { $qp = strpos($links[$i], '"'); //find our quote position $trail = substr($links[$i], 0, $qp); //extract the rest of the string $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string if ($htmlp !== FALSE) { $trail = substr($trail, 0, $htmlp); //extract the data before .html $data = explode('-', $trail); //seperate all data elements $numdata = count($data); //find out how many items are in our array if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't $goodtrail[$x] = $trail; //save our trail $x++; //increase our array counter } } } } $end = microtime_float(); $parseTime = $end - $start; echo 'start: ' . $start . ' '; echo 'end: ' . $end . ' '; echo 'Took:' . $parseTime . ' seconds'; ?>

What the equivalent of his in PHP would be:

Code:

<?php function microtime_float() { list($usec, $sec) = explode(" ", microtime()); return ((float) $usec + (float) $sec); } $start = microtime_float(); $pagedata = file_get_contents('bst.html'); for ($z = 0; $z < 20000; $z++) { preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches); } $end = microtime_float(); $parseTime = $end - $start; echo 'start: ' . $start . ' '; echo 'end: ' . $end . ' '; echo 'Took:' . $parseTime . ' seconds'; ?>

bst.html is simply a file containing all the data from the B/S/T page.

I ran the scripts and the regex came out faster. However, I should have been a little nicer.

user@devbox:~/wf# php other.php
start: 1295859636.8 end: 1295859671.85 Took:35.0463261604 seconds

user@devbox:~/wf# php rage.php
start: 1295859681.71 end: 1295859841.74 Took:160.028321981 seconds

EDIT:

My above statements weren't meant to reflect on his product which from all accounts is great.

crackp0t · Jan 24, 2011

tainted said:
Put 3 computer nerds in a room together, and you get...

lol yeah

I was in a shitty mood and he received unwarranted hostility.

dchuk · Jan 24, 2011

I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.

mattseh · Jan 24, 2011

matthew@matthew-laptop:~/php$ time php mattseh.php
start: 1295867727.176 end: 1295867817.3645 Took:90.188463926315 seconds
real 1m30.243s
user 0m27.550s
sys 0m0.250s

matthew@matthew-laptop:~/php$ time php rage.php
start: 1295867660.5338 end: 1295867807.4684 Took:146.93461108208 seconds
real 2m26.971s
user 0m50.060s
sys 0m1.060s

matthew@matthew-laptop:~/php$ time python mattseh.py

real 0m2.526s
user 0m2.100s
sys 0m0.020s

Code:

import re
data = open('bst.html').read()
for i in range(2000):
    threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(data)

Also, I was getting undefined offset notices in your code, php5. Total time to write my code, maybe 5 minutes including original time to write regex.

Running on a netbook, single core, running at same time , so the php times will be off, the "user" time will be accurate.

Python regex caching ftw

eliquid · Jan 24, 2011

dchuk said:
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.

but lets say the code isnt html, but some random txt database, would this still apply?

mattseh · Jan 24, 2011

eliquid said:
but lets say the code isnt html, but some random txt database, would this still apply?

xpath is designed for xml / html, so no, iirc.

eliquid · Jan 24, 2011

dchuk said:
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.

You know Ive used that plugin for some PHP scripts I was writing but I never could get the xpath it would spit out to work with any of my PHP code.

Has anyone used that plugin with anything besides Ruby and gotten it to work?

I used some other xpath tool and finally got it to work though, so Im not sure of that plugin is biased or what.

emp · Jan 24, 2011

While regexes are nice (and often quite fast), the headache surrounding even the simplest tasks just make it not worth it.

I can do regex, but if you have to revisit a regex or give it to another programmer - fucking forget it.

Sorry, but coder time is much, much more valuable (read: expensive) than computer time.

Unless we are talking gigabytes of data analysis here, I am steering around regexes.

::emp::

mattseh · Jan 24, 2011

My mistake, original python code was 2000 not 20000,

matthew@matthew-laptop:~/php$ time python mattseh.py

real 0m26.568s
user 0m20.830s
sys 0m0.160s

So close to php. My point still stands though, regex is faster for both the computer and the programmer than writing your own functions.

mattseh · Jan 24, 2011

XPath and XSLT with lxml <-- finally figured out how to do xpath in python easily with these examples.

dchuk · Jan 24, 2011

eliquid said:
You know Ive used that plugin for some PHP scripts I was writing but I never could get the xpath it would spit out to work with any of my PHP code.

Has anyone used that plugin with anything besides Ruby and gotten it to work?

I used some other xpath tool and finally got it to work though, so Im not sure of that plugin is biased or what.

with the xpath plugin, you can test out any xpath code in it to see what it pulls up. I pretty regularly adjust the xpath's to match what I'm looking for correctly. for instance, to find a link on a page with a unique anchor text, you would use .//a[text()="anchor text here"]

also, let's move past the speed comparisons, it doesn't really matter for the point of this thread.

crackp0t · Jan 24, 2011

dchuk said:
with the xpath plugin, you can test out any xpath code in it to see what it pulls up. I pretty regularly adjust the xpath's to match what I'm looking for correctly. for instance, to find a link on a page with a unique anchor text, you would use .//a[text()="anchor text here"]

also, let's move past the speed comparisons, it doesn't really matter for the point of this thread.

No doubt. I use the shit out of xpather.

Rage9 · Jan 24, 2011

crackp0t said:
I ran the scripts and the regex came out faster. However, I should have been a little nicer.

user@devbox:~/wf# php other.php
start: 1295859636.8 end: 1295859671.85 Took:35.0463261604 seconds

user@devbox:~/wf# php rage.php
start: 1295859681.71 end: 1295859841.74 Took:160.028321981 seconds

EDIT:

My above statements weren't meant to reflect on his product which from all accounts is great.

I can't duplicate your results, not either on my box or on any of my servers.

Insomniac · Jan 24, 2011

Rage9 said:
I can't duplicate your results, not either on my box or on any of my servers.

[root@striker ~]# php mattseh.php
start: 1295896783.8194 end: 1295896792.5596 Took:8.7402460575104 seconds
[root@striker ~]# php rage.php
start: 1295896915.0603 end: 1295896926.8662 Took:11.805890083313 seconds
[root@striker ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz

My laptop is teh suck too.

Rage9 · Jan 24, 2011

Insomniac said:
[root@striker ~]# php mattseh.php
start: 1295896783.8194 end: 1295896792.5596 Took:8.7402460575104 seconds
[root@striker ~]# php rage.php
start: 1295896915.0603 end: 1295896926.8662 Took:11.805890083313 seconds
[root@striker ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz

My laptop is teh suck too.

You know that's really interesting, maybe it has to do with how many cores the system has? My local system is a 6 core AMD @ 4GHz a core with 8 gigs of ram at 1600 Mhz. My code blows it away. Another rig I tested it on is a dedi (don't know the exact specs) and my code blows it away also.

So then I'm like WTF is going on? How can I not be getting the same conclusion? I just just tested it on a VPS that has 768 MB ram, and my code just got crushed. So who knows, the beefier the system the better non-regex does? I don't know the answer to this, but it's a question worth asking. Anyone know of a PHP configuration that would effect this?

The Botting Q&A Thread: Ask Away

Serpwoo.com

Banned

010001100100011101010100

Banned

New member

010001100100011101010100

010001100100011101010100

Senior Botter

import this

Serpwoo.com

import this

Serpwoo.com

New member

import this

import this

Senior Botter

010001100100011101010100

Banned

New member

Banned