The Botting Q&A Thread: Ask Away

Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz
 


From my BST bumper detector:

Code:
threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(main)

How would you do this without regex?

Time is money, I can write regexes quickly and concisely to get exactly what I need.

I'm assuming that what the regex statement does is return the all the actuall thread urls on the BST page, I could optimize this further but this will do:

Code:
<?php
//break up the page
$links = explode('sell-buy-trade/', $pagedata);
//setup a loop
$x = 0; //counter for good trails array
for($i=1; $i<=count($links); $i++){

    $qp = strpos($links[$i], '"'); //find our quote position
    $trail = substr($links[$i], 0, $qp); //extract the rest of the string
    $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
    if($htmlp !== FALSE){
        $trail = substr($trail, 0, $htmlp); //extract the data before .html
        $data = explode('-', $trail); //seperate all data elements
        $numdata = count($data); //find out how many items are in our array
        if(is_numeric($data[0]) && !is_numeric($data[$numdata - 1])){ //if the first data chunk is numeric and and last one isn't
            $goodtrail[$x] = $trail; //save our trail
            $x++; //increase our array counter
        }
    }
}
?>

Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz

Believe me I'm not trying to start a holy war or anything like that. I stated my opinion/philosophy on the regex subject and everyone else is going crazy like little teen school girls. If regex is your thing by all means go for it.
 
Rage9 I feel sorry for anyone that has you code for them if you really think your functions are faster than good regex.
 
Rage9 I feel sorry for anyone that has you code for them if you really think your functions are faster than good regex.

Wow, maybe you should go out and test this before you open your mouth. Regex becomes exponentially slower with the more "rules" that are added and also the lower level of a language you go. Even at the PHP level, regex is often times quite slower due to the load time of the regex engine.

Take the challenge issued earlier in the thread by mattseh, my function (which I wrote in about 10 min) is actually, according to my tests 2x as fast. That's right, twice as fast. Also I'd hardly consider that a complex regex. I'm running this off my personal rig and getting times of ~22 seconds for mine and ~43 seconds for the regex. Don't believe me, run these scripts:

Mine:
Code:
<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start =  microtime_float();

$pagedata = file_get_contents('bst.html');
for ($z = 0; $z < 20000; $z++) {

//break up the page
    $links = explode('sell-buy-trade/', $pagedata);
//setup a loop
    $x = 0; //counter for good trails array
    for ($i = 1; $i <= count($links); $i++) {

        $qp = strpos($links[$i], '"'); //find our quote position
        $trail = substr($links[$i], 0, $qp); //extract the rest of the string
        $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
        if ($htmlp !== FALSE) {
            $trail = substr($trail, 0, $htmlp); //extract the data before .html
            $data = explode('-', $trail); //seperate all data elements
            $numdata = count($data); //find out how many items are in our array
            if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't
                $goodtrail[$x] = $trail; //save our trail
                $x++; //increase our array counter
            }
        }
    }
}
$end =  microtime_float();
$parseTime = $end - $start;
echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

What the equivalent of his in PHP would be:
Code:
<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start = microtime_float();

$pagedata = file_get_contents('bst.html');

for ($z = 0; $z < 20000; $z++) {
    preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches);
}
$end = microtime_float();
$parseTime = $end - $start;

echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

bst.html is simply a file containing all the data from the B/S/T page.
 
Wow, maybe you should go out and test this before you open your mouth. Regex becomes exponentially slower with the more "rules" that are added and also the lower level of a language you go. Even at the PHP level, regex is often times quite slower due to the load time of the regex engine.

Take the challenge issued earlier in the thread by mattseh, my function (which I wrote in about 10 min) is actually, according to my tests 2x as fast. That's right, twice as fast. Also I'd hardly consider that a complex regex. I'm running this off my personal rig and getting times of ~22 seconds for mine and ~43 seconds for the regex. Don't believe me, run these scripts:

Mine:
Code:
<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start =  microtime_float();

$pagedata = file_get_contents('bst.html');
for ($z = 0; $z < 20000; $z++) {

//break up the page
    $links = explode('sell-buy-trade/', $pagedata);
//setup a loop
    $x = 0; //counter for good trails array
    for ($i = 1; $i <= count($links); $i++) {

        $qp = strpos($links[$i], '"'); //find our quote position
        $trail = substr($links[$i], 0, $qp); //extract the rest of the string
        $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
        if ($htmlp !== FALSE) {
            $trail = substr($trail, 0, $htmlp); //extract the data before .html
            $data = explode('-', $trail); //seperate all data elements
            $numdata = count($data); //find out how many items are in our array
            if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't
                $goodtrail[$x] = $trail; //save our trail
                $x++; //increase our array counter
            }
        }
    }
}
$end =  microtime_float();
$parseTime = $end - $start;
echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

What the equivalent of his in PHP would be:
Code:
<?php
function microtime_float() {
    list($usec, $sec) = explode(" ", microtime());
    return ((float) $usec + (float) $sec);
}

$start = microtime_float();

$pagedata = file_get_contents('bst.html');

for ($z = 0; $z < 20000; $z++) {
    preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches);
}
$end = microtime_float();
$parseTime = $end - $start;

echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>

bst.html is simply a file containing all the data from the B/S/T page.

I ran the scripts and the regex came out faster. However, I should have been a little nicer.


user@devbox:~/wf# php other.php
start: 1295859636.8<br />end: 1295859671.85<br />Took:35.0463261604 seconds

user@devbox:~/wf# php rage.php
start: 1295859681.71<br />end: 1295859841.74<br />Took:160.028321981 seconds

EDIT:

My above statements weren't meant to reflect on his product which from all accounts is great.
 
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.
 
matthew@matthew-laptop:~/php$ time php mattseh.php
start: 1295867727.176<br />end: 1295867817.3645<br />Took:90.188463926315 seconds
real 1m30.243s
user 0m27.550s
sys 0m0.250s

matthew@matthew-laptop:~/php$ time php rage.php
start: 1295867660.5338<br />end: 1295867807.4684<br />Took:146.93461108208 seconds
real 2m26.971s
user 0m50.060s
sys 0m1.060s

matthew@matthew-laptop:~/php$ time python mattseh.py

real 0m2.526s
user 0m2.100s
sys 0m0.020s


Code:
import re
data = open('bst.html').read()
for i in range(2000):
    threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(data)

Also, I was getting undefined offset notices in your code, php5. Total time to write my code, maybe 5 minutes including original time to write regex.

Running on a netbook, single core, running at same time , so the php times will be off, the "user" time will be accurate.

Python regex caching ftw ;)
 
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.


but lets say the code isnt html, but some random txt database, would this still apply?
 
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.

You know Ive used that plugin for some PHP scripts I was writing but I never could get the xpath it would spit out to work with any of my PHP code.

Has anyone used that plugin with anything besides Ruby and gotten it to work?

I used some other xpath tool and finally got it to work though, so Im not sure of that plugin is biased or what.
 
While regexes are nice (and often quite fast), the headache surrounding even the simplest tasks just make it not worth it.

I can do regex, but if you have to revisit a regex or give it to another programmer - fucking forget it.

Sorry, but coder time is much, much more valuable (read: expensive) than computer time.

Unless we are talking gigabytes of data analysis here, I am steering around regexes.

::emp::
 
My mistake, original python code was 2000 not 20000,

matthew@matthew-laptop:~/php$ time python mattseh.py

real 0m26.568s
user 0m20.830s
sys 0m0.160s

So close to php. My point still stands though, regex is faster for both the computer and the programmer than writing your own functions.
 
You know Ive used that plugin for some PHP scripts I was writing but I never could get the xpath it would spit out to work with any of my PHP code.

Has anyone used that plugin with anything besides Ruby and gotten it to work?

I used some other xpath tool and finally got it to work though, so Im not sure of that plugin is biased or what.

with the xpath plugin, you can test out any xpath code in it to see what it pulls up. I pretty regularly adjust the xpath's to match what I'm looking for correctly. for instance, to find a link on a page with a unique anchor text, you would use .//a[text()="anchor text here"]

also, let's move past the speed comparisons, it doesn't really matter for the point of this thread.
 
with the xpath plugin, you can test out any xpath code in it to see what it pulls up. I pretty regularly adjust the xpath's to match what I'm looking for correctly. for instance, to find a link on a page with a unique anchor text, you would use .//a[text()="anchor text here"]

also, let's move past the speed comparisons, it doesn't really matter for the point of this thread.

No doubt. I use the shit out of xpather.
 
I ran the scripts and the regex came out faster. However, I should have been a little nicer.


user@devbox:~/wf# php other.php
start: 1295859636.8<br />end: 1295859671.85<br />Took:35.0463261604 seconds

user@devbox:~/wf# php rage.php
start: 1295859681.71<br />end: 1295859841.74<br />Took:160.028321981 seconds

EDIT:

My above statements weren't meant to reflect on his product which from all accounts is great.

I can't duplicate your results, not either on my box or on any of my servers.
 
I can't duplicate your results, not either on my box or on any of my servers.

[root@striker ~]# php mattseh.php
start: 1295896783.8194<br />end: 1295896792.5596<br />Took:8.7402460575104 seconds
[root@striker ~]# php rage.php
start: 1295896915.0603<br />end: 1295896926.8662<br />Took:11.805890083313 seconds
[root@striker ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz

My laptop is teh suck too.
 
[root@striker ~]# php mattseh.php
start: 1295896783.8194<br />end: 1295896792.5596<br />Took:8.7402460575104 seconds
[root@striker ~]# php rage.php
start: 1295896915.0603<br />end: 1295896926.8662<br />Took:11.805890083313 seconds
[root@striker ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz

My laptop is teh suck too.

You know that's really interesting, maybe it has to do with how many cores the system has? My local system is a 6 core AMD @ 4GHz a core with 8 gigs of ram at 1600 Mhz. My code blows it away. Another rig I tested it on is a dedi (don't know the exact specs) and my code blows it away also.

So then I'm like WTF is going on? How can I not be getting the same conclusion? I just just tested it on a VPS that has 768 MB ram, and my code just got crushed. So who knows, the beefier the system the better non-regex does? I don't know the answer to this, but it's a question worth asking. Anyone know of a PHP configuration that would effect this?