The WF PHP Functions War Chest

chatmasta · Aug 16, 2009

UPDATE: I made it better...now finds multiple words.

Code:

<?php
/*
    * by chatmasta
    * word_match($s1, $s2, $min_length = 4, $collapse_whitespace = true)
    * compares two $s1 and $s2 and returns a shared substring of length greater
    * than or equal to $min_length. if no shared substring is found, returns false.
    * if $find_all is true, will find ALL words. otherwise will just find the first (longest) one.
    * if $collapse_whitespace is true, deletes whitespace before all operations.
*/
function word_match($s1, $s2, $min_length = 4, $find_all = true, $collapse_whitespace = true) {
    if($collapse_whitespace) {
        $s1 = str_replace(' ', '', preg_replace('/\s\s+/', '', $s1));
        $s2 = str_replace(' ', '', preg_replace('/\s\s+/', '', $s2));
    }
    $return = array();
    $longer = (strcmp($s1, $s2) < 0) ? 's1' : 's2';
    $shorter = (strcmp($s1, $s2) > 0) ? 's1' : 's2';
    $st = similar_text($s1, $s2);
    $remaining = strlen(${$longer}) - $st;
    for($i = 0; $i < $remaining; $i++) {
        $word_length = $st - $i;
        $stop = ($find_all) ? strlen(${$shorter}) : strlen(${$shorter}) - $st + 1;
        for($start = 0; $start < $stop; $start++) {
            $check = substr(${$shorter}, $start, $word_length);
            if(@strstr(${$longer}, $check) && strlen($check) >= $min_length) {
                if($find_all) {    $return[] = $check; }
                else { return $check; }
            }
        }
    }
    return (count($return) > 0) ? array_unique($return) : false;
}
$s1 = 'blow up like the world trade';
$s2 = 'like trade';
if($match = word_match($s1, $s2)) { print_r($match); } else { echo 'fail'; }
?>

TylerL · Oct 1, 2009

Remove duplicate entries using an array of matches. Feel free to fix it up, I'm not exactly a good coder:

[high='php']<?php

function isMatch($var) {

$matches = array ('Carrot', 'Onion');

$x = true;

foreach ($matches as $value) {
if ($var == $value) {
$x = false;
}
}

return $x;

}

function removeMatches ($array) {

$newArray = array_filter($array, "isMatch");
return $newArray;

}

$items = array ('Banana', 'Carrot', 'Orange', 'Onion', 'Apple', 'Onion', 'Strawberry', 'Banana');

$items = array_unique($items);
$final = removeMatches($items);

sort($final);
print_r($final);

?>[/high]

hehejo · Oct 12, 2009

The script doesnt work with the proxy script, can someone help? =)

It works without, but I don't want to get my IP banned ;-)

You can upload a csv from the google keywords tool and get different search results numbers back, useful to find low competition keywords for articles etc... I will repost with proper credits for the used code as soon as it works

Code:

<?PHP
    set_time_limit(180);
    $GooglePrefix       = "http://www.google.com/search?q=";
    $GooglePrefix2      = "http://www.google.com/search?q=allintitle%3A%22";
    $GooglePrefix3      = "http://www.google.com/search?q=allinanchor%3A%22";
    $GoogleSuffix       = "%22";
    $query              = urlencode($_GET['s']);
    $proxyip                        = "147.133.3.136:8080"
?>
<h1>Upload Google Adwords csv</h1>
<form enctype="multipart/form-data" method="post" action="<?=$_SERVER['PHP_SELF']?>">
<input type="file" name="fileToUpload" /><br />
<input type="submit" value="Upload File" />
</form>

<?php

  
    // Import csv
    $csv = array();
    $lines = file($_FILES["fileToUpload"]["tmp_name"], FILE_IGNORE_NEW_LINES);

    // Create Table
        echo '<table style="border: 1px solid black;" width="800"><tr><td><b>Keywords</b></td><td><b>Average Searches</b></td><td><b>Google Results</b></td><td><b>Allintitle Results</b></td><td><b>Allinanchor Results</b></td></tr>';
  

        foreach ($lines as $key => $value)
        {
    
        $query = urldecode($value);
    if(preg_match("/\[(.*)\]/s",$query,$text)){
        $urltxt = $text[1];
        $query2 = $text[1];
        $query2 = urlencode($query2);
        $CompleteUrl    = $GooglePrefix.$query2;
        $CompleteUrl2   = $GooglePrefix2.$query2.$GoogleSuffix;
        $CompleteUrl3   = $GooglePrefix3.$query2.$GoogleSuffix;
    $txt = $value;
    
        $out=preg_split('/\,+/',trim($txt)); 
        $average = $out[count($out)-1];  
   
    
    // Fetch URL and get Results # with preg_match
    $urlproxy = curl_proxy($CompleteUrl,$proxyip) ;
        $comp = webFetcher($urlproxy);
        if(preg_match("/ of about \<b\>([0-9,]*)/",$comp,$text)){
            $resultsnr = $text[1];
        }     
        
    $urlproxy = curl_proxy($CompleteUrl2,$proxyip) ;
        $comp = webFetcher($urlproxy);
        if(preg_match("/ of about \<b\>([0-9,]*)/",$comp2,$text2)){
            $allintitlenr = $text2[1];
        } 
        
    $urlproxy = curl_proxy($CompleteUrl3,$proxyip) ;
        $comp = webFetcher($urlproxy);
        if(preg_match("/ of about \<b\>([0-9,]*)/",$comp3,$text3)){
            $allinanchornr = $text3[1];
        } 
        

        echo "<tr><td><a href=\"$CompleteUrl\">".$urltxt."</a></td><td>".$average."</td><td>".$resultsnr."</td><td>".$allintitlenr."</td><td>".$allinanchornr."</td></tr>";
      }
        }
        echo "</table>";
        
        
        // Functions
        function do_reg($text, $regex) {
                preg_match_all($regex, $text, $regxresult, PREG_PATTERN_ORDER);
                return $regresult = $regxresult[1];
            }
            
        function webFetcher($url) {
            // Fetch and return source from $url

            $crawl = curl_init();                           // The curl library is initiated, the following lines set the curl variables
            curl_setopt ($crawl, CURLOPT_URL, $url);        // The URL is set
            curl_setopt($crawl, CURLOPT_RETURNTRANSFER, 1); // Tells it to return the results in a variable
            $resulting = $resulting.curl_exec($crawl);      // curl is executed and the results stored in $resulting
            curl_close($crawl);                             // closes the curl procedure.
            return $result = $resulting;                    // Returns source
        }
        
        function curl_proxy($url,$proxy) {
          //$useragent= random_useragent();
          $cUrl = curl_init();
          curl_setopt($cUrl, CURLOPT_URL, $url);
          curl_setopt($cUrl, CURLOPT_RETURNTRANSFER, 5);
          curl_setopt($cUrl, CURLOPT_CONNECTTIMEOUT, 5);
          curl_setopt($cUrl, CURLOPT_TIMEOUT, 30);
          //curl_setopt($cUrl, CURLOPT_PROXY, $proxy);
          curl_setopt($cUrl, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
          //curl_setopt($cUrl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);
          //curl_setopt($cUrl, CURLOPT_USERAGENT, $useragent);
          curl_setopt($cUrl, CURLOPT_FOLLOWLOCATION, TRUE);               
          $PageContent = curl_exec($cUrl);
          return $PageContent ;
        }
  ?>

webDOMinator · Oct 22, 2009

Lazy Insert and Update queries from forms

*Digs through magic hat*

oh, here's a nice little one-liner trick for those lazy like myself. It is two functions that allow you to make an SQL insert or update query based off of $_POST form data. I use it all the time. The arguments are just which table you want to insert to, and then if you have a different array besides POST that you want to override into the functions.

Note on this: Form field names must be the same names as the fields in the table you're putting the data in. If you don't want something to be included in the queries, put a "_" before it... for instance if there was a password confirmation field that was extra for creating a user account, I would name it "_passconfirm" and then it wouldn't get included.

Code:

function make_forminsert($tablename,$inputarr=''){
    global $reqSymbol,$logged;
    if (!$inputarr) $inputarr=$_POST;
    $_SESSION[$logged][reportdata][$tablename] = false;
    $fieldnames = 'INSERT INTO `'.$tablename.'` (';
    $fieldvalues = 'VALUES (';
    $fieldnamearray = array_keys($inputarr);
    foreach ($fieldnamearray as $fieldname){
        //print $fieldname.' '.strlen($inputarr[$fieldname]).'<br>';
        if ($inputarr[$fieldname]=='YYYY-MM-DD') $inputarr[$fieldname]='0000-00-00';
        if ($fieldname != 'shipsame' && substr($fieldname, 0, 1) != '_' && strlen($fieldname) > 1 && $fieldname != 'MAX_FILE_SIZE' && substr($fieldname,0,8) != 'userfile' && $fieldname != 'image_x' && $fieldname != 'image_y' && $fieldname != 'submit_x' && $fieldname != 'submit_y'){
            if (substr($fieldname, 0, strlen($reqSymbol)) == $reqSymbol) $fieldname1 = substr($fieldname, strlen($reqSymbol), strlen($fieldname) - strlen($reqSymbol)); else $fieldname1 = $fieldname;
            $fieldnames .= '`'.$fieldname1.'`,';
            if (is_numeric($inputarr[$fieldname])){
                $fieldvalues .= $inputarr[$fieldname].',';
            }elseif (is_array($inputarr[$fieldname])){
                $fieldval = '';
                foreach ($inputarr[$fieldname] as $val){
                    $fieldval .= $val;
                }
                if ($inputarr[$fieldname][2]=='am' || $inputarr[$fieldname][2]=='pm') $fieldval = date('H:i:s',strtotime($fieldval));
                $fieldvalues .= '"'.addslashes($fieldval).'",';
            }else{
                $fieldvalues .= '"'.addslashes($inputarr[$fieldname]).'",';
            }
        }
    }
    $fieldnames = substr($fieldnames,0,strlen($fieldnames)-1).') ';
    $fieldvalues = substr($fieldvalues,0,strlen($fieldvalues)-1).');';
    //print $fieldnames.$fieldvalues;
    return $fieldnames.$fieldvalues;
}

function make_formupdate($tablename, $where, $inputarr=''){
    global $reqSymbol,$logged, $vid;
    $_SESSION[$logged][reportdata][$tablename] = false;
    $fieldnames = 'UPDATE `'.$tablename.'` SET ';
    if (!$inputarr) $inputarr = $_POST;
    $fieldnamearray = array_keys($inputarr);
    foreach ($fieldnamearray as $fieldname){
        //print $fieldname.' '.strlen($_POST[$fieldname]).'<br>';
        if ($inputarr[$fieldname]=='YYYY-MM-DD') $inputarr[$fieldname]='0000-00-00';
        if (substr($fieldname, 0, strlen($reqSymbol)) == $reqSymbol) $fieldname = substr($fieldname, strlen($reqSymbol), strlen($fieldname) - strlen($reqSymbol));
        if ($fieldname != 'PHPSESSID' && substr($fieldname, 0, 1) != '_' && strlen($fieldname) > 1 && $fieldname != 'MAX_FILE_SIZE' && substr($fieldname,0,8) != 'userfile' && $fieldname != 'prodid' && $fieldname != 'image_x' && $fieldname != 'image_y' && $fieldname != 'submit_x' && $fieldname != 'submit_y'){
            $fieldnames .= '`'.$fieldname.'`=';
            if (is_numeric($inputarr[$fieldname])){
                $fieldnames .= $inputarr[$fieldname].',';
            }elseif (is_array($inputarr[$fieldname])){
                $fieldval = '';
                foreach ($inputarr[$fieldname] as $val){
                    $fieldval .= $val;
                }
                if ($inputarr[$fieldname][2]=='am' || $inputarr[$fieldname][2]=='pm') $fieldval = date('H:i:s',strtotime($fieldval));
                $fieldnames .= '"'.addslashes($fieldval).'",';
            }else{
                $fieldnames .= '"'.$inputarr[$fieldname].'",';
            }
        }
    }
    $fieldnames = substr($fieldnames,0,strlen($fieldnames)-1).' WHERE '.$where;
    //print $fieldnames.$fieldvalues;
    return $fieldnames;
}

It leaves out submit fields, image fields, and MAX_FILE_SIZE fields for multipart forms with file uploads. There are some other little goodies in there... nuances and what not... I'll leave that up to you to figure out.

Merezza · Dec 5, 2009

Jizzlobber said:
hxxp://rapidshare.com/files/182076915/Webbots.Spiders.and.Screen.Scrapers.rar

Hell yeah JL! I've been looking for this book everywhere. Thanks a lot

Marketcake · Jan 14, 2010

scraper.php?christianity+is

Code:

christianity is bullshit - 1,080,000 Results
christianity is not a religion - 24,500,000 Results
christianity is a lie - 6,860,000 Results
christianity is false - 8,390,000 Results
christianity is a cult - 3,780,000 Results
christianity is wrong - 14,300,000 Results

scraper.php?buddhism+is

Code:

buddhism is not a religion - 6,350,000 Results
buddhism is wrong - 2,350,000 Results
buddhism is not what you think - 842,000 Results
buddhism is bullshit - 304,000 Results
buddhism is polytheistic - 760,000 Results

scraper.php?hinduism+is

Code:

hinduism is false - 814,000 Results
hinduism is polytheistic - 153,000 Results
hinduism is not a religion - 3,400,000 Results
hinduism is fake - 510,000 Results

scraper.php?islam+is

Code:

guerilla · Jan 14, 2010

Marketcake said:
scraper.php?christianity+is
scraper.php?buddhism+is
scraper.php?hinduism+is
scraper.php?islam+is

GTFO. This is not shooting the shit.

plasticbag · Jan 18, 2010

Since I grabbed a few of these, I'll offer up the following. Basically, stops spam from ever reaching the WordPress moderation queue, or Akismet. Akismet went from flagging 50+ comments a day to maybe 2 or 3 a week, now. Just add your filters to $bad_comment_content. The search is case insensitive and haven't noticed any false positives over the last few weeks. Also, script assumes "/var/log/apache/wp_post-logs/dropped-comments.txt" is writable and dumps the dropped comment (along with other info) into that text file for later review...

open up /comments-posts.php. Find:

Code:

$comment_content      = ( isset($_POST['comment']) ) ? trim($_POST['comment']) : null;

And right after, insert the following...

Code:

$low_comment_content = strtolower($comment_content);
$bad_comment_content = array(
                        'viagra',
                        'hydrocodone',
                        '[url=http',
                        '[link=http',
                        'russian girls',
                        'russian brides',
                        'amoxicillin'
                        );
function in_array_like($string, $array) { foreach($array as $ref) { if(strstr($string, $ref)) { return true; } } return false; }
if (in_array_like($low_comment_content, $bad_comment_content)) {
        $comment_box_text = wordwrap(trim($comment_content), 80, "\n  ", true);
        $txtdrop = fopen('/var/log/apache/wp_post-logs/dropped-comments.txt', 'a');
        fwrite($txtdrop, "  --------------\n  [COMMENT] = " . $comment_content . "\n  --------------\n");
        fwrite($txtdrop, "  [SOURCE_IP] = " . $_SERVER['REMOTE_ADDR'] . " @ " . date("F j, Y, g:i a") . "\n");
        fwrite($txtdrop, "  [USERAGENT] = " . $_SERVER['HTTP_USER_AGENT'] . "\n");
        fwrite($txtdrop, "  [FILE_NAME] = " . $_SERVER['SCRIPT_NAME'] . " - [REQ_URI] = " . $_SERVER['REQUEST_URI'] . "\n");
        fwrite($txtdrop, '--------------**********------------------'."\n");
        header("HTTP/1.1 406 Not Acceptable");
        header("Status: 406 Not Acceptable");
        header("Connection: Close");
        wp_die( __('bang bang.') );
        // die('bang bang.');
}

Think I ripped "function in_array_like" from php.net comments..

ashbeats · Jan 18, 2010

Heya,

I just wanted to share this class with everyone. The original was not written by me, but it has been very useful in reducing the time taken to scrape multiple urls. I simply extended it to work with my scrappers. Now instead of taking ( e.g 10 sec X 30 Urls = 300s ), it only takes as long as the slowest request, in my test it was 8 secs! Overhead was quite low as well.

The class is very simple to extend on.. so have fun.

Simulate Multi-Threaded HTTP requests.

[high=php]

<?php
// LICENSE: PUBLIC DOMAIN
// The author disclaims copyright to this source code.
// AUTHOR: Shailesh N. Humbad
// SOURCE: 404 Not Found
// DATE: 6/4/2008

// index.php
// Run the parallel get and print the total time

// Class to run parallel GET requests and return the transfer
class ParallelGet
{

public $html_assoc;

function __construct($urls)
{
// Create get requests for each URL
$mh = curl_multi_init();
foreach($urls as $i => $url)
{
$ch[$i] = curl_init($url);
curl_setopt($ch[$i], CURLOPT_RETURNTRANSFER, 1);
curl_multi_add_handle($mh, $ch[$i]);
}

// Start performing the request
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
// Loop and continue processing the request
while ($runningHandles && $execReturnValue == CURLM_OK) {
// Wait forever for network
$numberReady = curl_multi_select($mh);
if ($numberReady != -1) {
// Pull in any new data, or at least handle timeouts
do {
$execReturnValue = curl_multi_exec($mh, $runningHandles);
} while ($execReturnValue == CURLM_CALL_MULTI_PERFORM);
}
}

// Check for any errors
if ($execReturnValue != CURLM_OK) {
trigger_error("Curl multi read error $execReturnValue\n", E_USER_WARNING);
}

// Extract the content
foreach($urls as $i => $url)
{
// Check for errors
$curlError = curl_error($ch[$i]);
if($curlError == "") {
$res[$i] = curl_multi_getcontent($ch[$i]);
} else {
print "Curl error on handle $i: $curlError\n";
}
// Remove and close the handle
curl_multi_remove_handle($mh, $ch[$i]);
curl_close($ch[$i]);
}
// Clean up the curl_multi handle
curl_multi_close($mh);

// Print the response data
//print_r($res);
$this->html_assoc = $res;
return($res);
}

}
?>
[/high]

Usage:

[high=php]
<?

#EXAMPLE USAGE
$s = microtime(true);
// Define the URLs
$urls = array(
"http://yahoo.com/",
"http://google.com",
"http://trackitdown.net"
);
$pg = new ParallelGet($urls);
print "<br />total time: ".round(microtime(true) - $s, 4)." seconds to process " . count($pg) . " urls";

$html_of_all_urls_assoc = $pg->html_assoc;

file_put_contents( "debug_dump_of_all_html.html",print_r(html_of_all_urls_assoc ,true) );

#END EXAMPLE USAGE
?>
[/high]

ashbeats · Jan 19, 2010

SuperDave2U said:
Hey guys glad you're liking it.

Anyone have a workaround for timeout issues with php? I want to open up the scope of the spider but running it with a depth of a (x pages * y keywords) would really hang this thing.

Suggestions?

Hi Dave,

I just modded your script to use parallel curl like my previous code, and shaved a whole load of time off.

This is the result:

Original: 73.2088 seconds to process 49 urls (keyword=myspace)

Modded: 7.0003 seconds to process 49 urls (keyword=myspace)!!! That's 10x faster.

Modded:

[high=php]
<?
// Setting the variables

if(! defined(ABSPATH)) define('ABSPATH', dirname(__FILE__).'/');
include_once ABSPATH . "Parallel_curl.php";

$s = microtime(true); #timer

set_time_limit(180);

// -------------------GooglePrefix------¦-----query--¦suffix¦-counter--¦
$GooglePrefix = "http://www.google.com/search?q=";
$GoogleCountSuffix ="&start=";
$dontcrawlarray = array('wikipedia.org','google.com','amazon.com');

$query = urlencode($_GET['s']);
$drilldepth = $_GET['drilldepth'];
?>
<h1>Adsense Spider</h1>
<form action="<?=$_SERVER['PHP_SELF']?>">
<input type="text" name="s" value="<?=urldecode($query)?>">
<select name="drilldepth">
<?PHP
for ($i = 5; $i >= 1; $i--)
echo "<option value='$i' " . ($i == $drilldepth ? 'selected' : '') . "> $i </option>";
?>
</select>
<input type="submit" value="Crawl">
</form>

<?PHP // Loop SERPs for given keyword
if ($query == ''){
die("No search term defined");
}

echo "<hr><h2>" . urldecode($query) . "</h2>Looping through SERPs";
ob_flush(); flush();

for ($loop = 0, $stop = ($drilldepth != NULL ? $drilldepth * 10-10 : 30); $loop <= $stop; $loop = $loop + 10) {

$CompleteUrl = $GooglePrefix.$query.$GoogleCountSuffix.$loop;

$res = andstatus( $res. webFetcher($CompleteUrl) );

}

echo "<br>";
ob_flush(); flush();

// Take search results and parse out URLs
echo "Parsing URIs from SERPs";
ob_flush(); flush();

$resultURLs = do_reg($res, "/h3.class=r.*(http.*)\"/U");
echo " [" . count($resultURLs) . "]<br>";
ob_flush(); flush();

foreach ($dontcrawlarray as $url)
$resultURLs = array_ereg_search($url,$resultURLs);

// Take URL results and loop through each to find Adsense code
echo "Digging through URIs for Adsense enabled pages";
ob_flush(); flush();

$s = microtime(true);
$pg = new ParallelGet($resultURLs);
$html_of_all_urls = $pg->html_assoc;

$x = 0;
foreach ($html_of_all_urls as $html_source) {

( preg_match("/google_ad/", $html_source) ? ( $matches[] = andstatus( $resultURLs[$x] ) ) : '') ;

$x++;
}

$textdump = "";
echo "<hr><h3>Adsense Enabled Placements:</h3><ol>";

// Loop through matches and format for review
foreach ($matches as $match => $pURI) {
echo "<LI><a href='$pURI' target='_BLANK'>$pURI</a></LI>";
$textdump .= "$pURI\n";
}

// Dump links into a box for easy C&P
echo "</ol><textarea cols=80 rows=5>$textdump</textarea>";

print "<br />total time: ".round(microtime(true) - $s, 4)." seconds to process " . count($resultURLs) . " urls";

function do_reg($text, $regex) {
preg_match_all($regex, $text, $regxresult, PREG_PATTERN_ORDER);
return $regresult = $regxresult[1];
}

function webFetcher($url) {
// Fetch and return source from $url
$crawl = curl_init(); // The curl library is initiated, the following lines set the curl variables
curl_setopt ($crawl, CURLOPT_URL, $url); // The URL is set
curl_setopt($crawl, CURLOPT_RETURNTRANSFER, 1); // Tells it to return the results in a variable
$resulting = $resulting.curl_exec($crawl); // curl is executed and the results stored in $resulting
curl_close($crawl); // closes the curl procedure.
return $result = $resulting; // Returns source
}

function andstatus($data) {
// Function allows a psudo progress status '....' to be displayed
echo "."; return $data; }

function array_ereg_search($val, $array) {
/* This removes $val from $array if found - used to remove the dontcrawlarray URLs */
$return = array();
foreach($array as $v) {
if(!eregi($val, $v)) $return[] = $v;
}
return $return;
}
?>
[/high]

Don't forget to include the class.

Cheers,
John

LazyHippy · Jan 20, 2010

Great stuff ashbeats, cheers for sharing. Have some rep!

shadowpwner · Jan 22, 2010

This.
Is.
The.
Greatest
Thread
Ever.
I spent a couple hours, skipping classes to read all of this. I'll post some of my (noobish) stuff here soon.

OsamaBinBBQ · Jan 29, 2010

So, there's been a bit of an evolution to the Google Suggest scrapper that's been floating around. I tried it out, and it was slow as hell. So I fixed it with curl multi-threading and added a few things. The different methods of deviation (keyword following and alphabet incrementing) are now settable options and can be followed to an arbitrary depth. Also, it's been redone in OOP to make it easier to user as part of a larger automation process.

The code was too big to post as part of the thread, so it's attached. Adding a randomized proxy list or using Tor could be a good possible improvement, but I don't know if that's necessary for Google Suggest.

Usage Examples

Code:

gsscrape.php?keyword=cactus&opt-keyword-follow=1
gsscrape.php?keyword=cactus&opt-alpha-increment=1
gsscrape.php?keyword=cactus&opt-keyword-follow=1&depth=2&format=xml

The script provides basic HTML output (default) as well as Excel XML.

Oh ya, this is my first post by the way. Enjoy.

emp · Jan 29, 2010

+ rep for sure.

::emp::

phrench · Jan 29, 2010

OsamaBinBBQ said:
Oh ya, this is my first post by the way.

Way to introduce yourself

Welcome and +rep!

OsamaBinBBQ · Jan 29, 2010

Thanks fellas.

I figured I should put my money where my mouth is and figure out just how much faster my version is compared to the previous ones.

The last version posted by Emp did, what I call keyword following

Code:

keywords-original-keyword-following.php?cactus
    Total: 275.88896894455s, Avg: 27.427700012922s

gsscrape.php?keyword=cactus&opt-keyword-follow=1
    Total: 9.5647752285004s, Avg: 0.95215427875519s

Erect's mod did what I call alpha-incrementing

Code:

keywords-original-alpha-increment.php?cactus
    Total: 663.909467s, Avg: 66.47412959s

gsscrape.php?keyword=cactus&opt-alpha-increment=1
    Total:  24.833792448044s, Avg: 2.2230766713619s

All averages are from 10 iterations with the lowest and highest values dropped.

OsamaBinBBQ · Jan 29, 2010

TylerL said:
Remove duplicate entries using an array of matches. Feel free to fix it up, I'm not exactly a good coder:

Your functions can be simplified with one PHP library function, `array_diff`.

Code:

$final = array_diff(array_unique($items), $matches);

Also, you can use `array_intersect` to do the exact opposite should the urge strike.

yellowarrior · Feb 5, 2010

Agreed, way to introduce yourself!

sunseeker · Feb 7, 2010

This looks to be an awesome script

I was just wondering if it would be possible to combine it with a script that scrapes the google related searches results up to say 4 levels deep, removed duplicates from the suggest results and then allowed you to export to csv.

I found this site, that has a script that scrapes the related searches, but it would be even better if it combined the 2.

I'm not a coder myself, but if you could come up with something that would be awesome.

sunseeker · Feb 8, 2010

OsamaBinBBQ said:
So, there's been a bit of an evolution to the Google Suggest scrapper that's been floating around. I tried it out, and it was slow as hell. So I fixed it with curl multi-threading and added a few things. The different methods of deviation (keyword following and alphabet incrementing) are now settable options and can be followed to an arbitrary depth. Also, it's been redone in OOP to make it easier to user as part of a larger automation process.

The code was too big to post as part of the thread, so it's attached. Adding a randomized proxy list or using Tor could be a good possible improvement, but I don't know if that's necessary for Google Suggest.

Usage Examples

Code:

gsscrape.php?keyword=cactus&opt-keyword-follow=1 gsscrape.php?keyword=cactus&opt-alpha-increment=1 gsscrape.php?keyword=cactus&opt-keyword-follow=1&depth=2&format=xml

The script provides basic HTML output (default) as well as Excel XML.

Oh ya, this is my first post by the way. Enjoy.

I tried using your script but I couldn't get it to work. I'm not a programmer, but shouln't it end with ?>

Once I get it to work it will be awesome. Thanks for sharing

The WF PHP Functions War Chest

Well-known member

read less, get paid

Developer

Lord of the Robot Army

New member

God of Leisure

All we do is win

New member

Member

Member

New member

New member

New member

Attachments

New member

Not yet banned

New member

New member

New member

New member

New member