Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz
From my BST bumper detector:
Code:threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(main)
How would you do this without regex?
Time is money, I can write regexes quickly and concisely to get exactly what I need.
<?php
//break up the page
$links = explode('sell-buy-trade/', $pagedata);
//setup a loop
$x = 0; //counter for good trails array
for($i=1; $i<=count($links); $i++){
$qp = strpos($links[$i], '"'); //find our quote position
$trail = substr($links[$i], 0, $qp); //extract the rest of the string
$htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
if($htmlp !== FALSE){
$trail = substr($trail, 0, $htmlp); //extract the data before .html
$data = explode('-', $trail); //seperate all data elements
$numdata = count($data); //find out how many items are in our array
if(is_numeric($data[0]) && !is_numeric($data[$numdata - 1])){ //if the first data chunk is numeric and and last one isn't
$goodtrail[$x] = $trail; //save our trail
$x++; //increase our array counter
}
}
}
?>
Fuck I cant believe this thread might be going into another debate about lang uses like every other thread us coders get into on this forums. lulz
Rage9 I feel sorry for anyone that has you code for them if you really think your functions are faster than good regex.
<?php
function microtime_float() {
list($usec, $sec) = explode(" ", microtime());
return ((float) $usec + (float) $sec);
}
$start = microtime_float();
$pagedata = file_get_contents('bst.html');
for ($z = 0; $z < 20000; $z++) {
//break up the page
$links = explode('sell-buy-trade/', $pagedata);
//setup a loop
$x = 0; //counter for good trails array
for ($i = 1; $i <= count($links); $i++) {
$qp = strpos($links[$i], '"'); //find our quote position
$trail = substr($links[$i], 0, $qp); //extract the rest of the string
$htmlp = strpos($links[$i], '.html'); //find where .html lies in the string
if ($htmlp !== FALSE) {
$trail = substr($trail, 0, $htmlp); //extract the data before .html
$data = explode('-', $trail); //seperate all data elements
$numdata = count($data); //find out how many items are in our array
if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't
$goodtrail[$x] = $trail; //save our trail
$x++; //increase our array counter
}
}
}
}
$end = microtime_float();
$parseTime = $end - $start;
echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>
<?php
function microtime_float() {
list($usec, $sec) = explode(" ", microtime());
return ((float) $usec + (float) $sec);
}
$start = microtime_float();
$pagedata = file_get_contents('bst.html');
for ($z = 0; $z < 20000; $z++) {
preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches);
}
$end = microtime_float();
$parseTime = $end - $start;
echo 'start: ' . $start . '<br />';
echo 'end: ' . $end . '<br />';
echo 'Took:' . $parseTime . ' seconds';
?>
Wow, maybe you should go out and test this before you open your mouth. Regex becomes exponentially slower with the more "rules" that are added and also the lower level of a language you go. Even at the PHP level, regex is often times quite slower due to the load time of the regex engine.
Take the challenge issued earlier in the thread by mattseh, my function (which I wrote in about 10 min) is actually, according to my tests 2x as fast. That's right, twice as fast. Also I'd hardly consider that a complex regex. I'm running this off my personal rig and getting times of ~22 seconds for mine and ~43 seconds for the regex. Don't believe me, run these scripts:
Mine:
Code:<?php function microtime_float() { list($usec, $sec) = explode(" ", microtime()); return ((float) $usec + (float) $sec); } $start = microtime_float(); $pagedata = file_get_contents('bst.html'); for ($z = 0; $z < 20000; $z++) { //break up the page $links = explode('sell-buy-trade/', $pagedata); //setup a loop $x = 0; //counter for good trails array for ($i = 1; $i <= count($links); $i++) { $qp = strpos($links[$i], '"'); //find our quote position $trail = substr($links[$i], 0, $qp); //extract the rest of the string $htmlp = strpos($links[$i], '.html'); //find where .html lies in the string if ($htmlp !== FALSE) { $trail = substr($trail, 0, $htmlp); //extract the data before .html $data = explode('-', $trail); //seperate all data elements $numdata = count($data); //find out how many items are in our array if (is_numeric($data[0]) && !is_numeric($data[$numdata - 1])) { //if the first data chunk is numeric and and last one isn't $goodtrail[$x] = $trail; //save our trail $x++; //increase our array counter } } } } $end = microtime_float(); $parseTime = $end - $start; echo 'start: ' . $start . '<br />'; echo 'end: ' . $end . '<br />'; echo 'Took:' . $parseTime . ' seconds'; ?>
What the equivalent of his in PHP would be:
Code:<?php function microtime_float() { list($usec, $sec) = explode(" ", microtime()); return ((float) $usec + (float) $sec); } $start = microtime_float(); $pagedata = file_get_contents('bst.html'); for ($z = 0; $z < 20000; $z++) { preg_match_all('/<div>[^<]+?<a href="http:\/\/www.wickedfire.com\/sell-buy-trade\/(\d+)-([^#]+?)(\d*?)\.html"/', $pagedata, $matches); } $end = microtime_float(); $parseTime = $end - $start; echo 'start: ' . $start . '<br />'; echo 'end: ' . $end . '<br />'; echo 'Took:' . $parseTime . ' seconds'; ?>
bst.html is simply a file containing all the data from the B/S/T page.
Put 3 computer nerds in a room together, and you get...
import re
data = open('bst.html').read()
for i in range(2000):
threads = re.compile('<div>[^<]+?<a href="http://www.wickedfire.com/sell-buy-trade/(\d+)-([^#]+?)(\d*?)\.html"').findall(data)
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.
but lets say the code isnt html, but some random txt database, would this still apply?
I'm a huge fan of xpath instead of regex bullshit. Regex makes absolutely no sense and is a bitch and a half to try and read when revisiting code. Xpath is sick, fast, accurate, and there's the firexpath plugin for firefox that lets you find the exact xpath for any element on a page, super handy.
You know Ive used that plugin for some PHP scripts I was writing but I never could get the xpath it would spit out to work with any of my PHP code.
Has anyone used that plugin with anything besides Ruby and gotten it to work?
I used some other xpath tool and finally got it to work though, so Im not sure of that plugin is biased or what.
with the xpath plugin, you can test out any xpath code in it to see what it pulls up. I pretty regularly adjust the xpath's to match what I'm looking for correctly. for instance, to find a link on a page with a unique anchor text, you would use .//a[text()="anchor text here"]
also, let's move past the speed comparisons, it doesn't really matter for the point of this thread.
I ran the scripts and the regex came out faster. However, I should have been a little nicer.
user@devbox:~/wf# php other.php
start: 1295859636.8<br />end: 1295859671.85<br />Took:35.0463261604 seconds
user@devbox:~/wf# php rage.php
start: 1295859681.71<br />end: 1295859841.74<br />Took:160.028321981 seconds
EDIT:
My above statements weren't meant to reflect on his product which from all accounts is great.
I can't duplicate your results, not either on my box or on any of my servers.
[root@striker ~]# php mattseh.php
start: 1295896783.8194<br />end: 1295896792.5596<br />Took:8.7402460575104 seconds
[root@striker ~]# php rage.php
start: 1295896915.0603<br />end: 1295896926.8662<br />Took:11.805890083313 seconds
[root@striker ~]# cat /proc/cpuinfo | grep CPU
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
model name : Intel(R) Core(TM)2 Duo CPU T6400 @ 2.00GHz
My laptop is teh suck too.