The WF PHP Functions War Chest



Updated some code posted here last year by emp - http://www.wickedfire.com/design-de...rgeted-adword-ads-something-i-whipped-up.html

What it does is spiders the Google results for a keyword and scrapes each result page looking for Adsense code. You then know which sites have legitimate traffic and are probably less likely to have fraudulent clicks.

Your posted edits werent working with google SERPs when I tried so I experimented a bit and this was my effort to update it and make it more portable.

Great tool thanks for turning me on to it.

w1bodv.gif


PHP:
<?PHP // Setting the variables 

    set_time_limit(180);

    // -------------------GooglePrefix------¦-----query--¦suffix¦-counter--¦
    $GooglePrefix       = "http://www.google.com/search?q=";
    $GoogleCountSuffix  ="&start=";
    $dontcrawlarray     = array('wikipedia.org','google.com','amazon.com');
    
    $query              = urlencode($_GET['s']);
    $drilldepth         = $_GET['drilldepth'];
?>
    <h1>Adsense Spider</h1>
    <form action="<?=$_SERVER['PHP_SELF']?>">
        <input type="text" name="s" value="<?=urldecode($query)?>">
        <select name="drilldepth">
            <?PHP
            for ($i = 5; $i >= 1; $i--)
                echo "<option value='$i' " . ($i == $drilldepth ? 'selected' : '') . "> $i </option>";
            ?>
        </select>
        <input type="submit" value="Crawl">
    </form>

<?PHP // Loop SERPs for given keyword
    if ($query == '')
        die("No search term defined");
    echo "<hr><h2>" . urldecode($query) . "</h2>Looping through SERPs"; 
    
    for ($loop = 0, $stop = ($drilldepth != NULL ? $drilldepth * 10-10 : 30); $loop <= $stop; $loop = $loop + 10) {
        $CompleteUrl    = $GooglePrefix.$query.$GoogleCountSuffix.$loop;
        $res            = andstatus($res.webFetcher($CompleteUrl)); 
    }
    
    echo "<br>";
    
    // Take search results and parse out URLs
    echo "Parsing URIs from SERPs";
    $resultURLs = do_reg($res, "/h3.class=r.*(http.*)\"/U");                                                          
    echo " [" . count($resultURLs) . "]<br>";
    
    foreach ($dontcrawlarray as $url)
        $resultURLs = array_ereg_search($url,$resultURLs);
       
    // Take URL results and loop through each to find Adsense code
    echo "Digging through URIs for Adsense enabled pages";
    for ($i = 0, $stop = count($resultURLs); $i < $stop ; $i++) {
        $thisURI = $resultURLs[$i]; 
        $comp = webFetcher($thisURI); 
        // Regex to parse SERP urls to match google code and return URL into $matches array
        ( preg_match("/google_ad/", $comp) ? (  $matches[] = andstatus($thisURI) ) : '') ;
    }
    
    $textdump = "";
    echo "<hr><h3>Adsense Enabled Placements:</h3><ol>";

    // Loop through matches and format for review
    foreach ($matches as $match => $pURI) {
        echo "<LI><a href='$pURI' target='_BLANK'>$pURI</a></LI>";
        $textdump .= "$pURI\n";
    }
    
    // Dump links into a box for easy C&P
    echo "</ol><textarea cols=80 rows=5>$textdump</textarea>";
    
    function do_reg($text, $regex) {
        preg_match_all($regex, $text, $regxresult, PREG_PATTERN_ORDER);
        return $regresult = $regxresult[1];
    }
    
    function webFetcher($url) {
        // Fetch and return source from $url
        $crawl = curl_init();                           // The curl library is initiated, the following lines set the curl variables
        curl_setopt ($crawl, CURLOPT_URL, $url);        // The URL is set
        curl_setopt($crawl, CURLOPT_RETURNTRANSFER, 1); // Tells it to return the results in a variable
        $resulting = $resulting.curl_exec($crawl);      // curl is executed and the results stored in $resulting
        curl_close($crawl);                             // closes the curl procedure.
        return $result = $resulting;                    // Returns source
    }
    
    function andstatus($data) { 
        // Function allows a psudo progress status '....' to be displayed
        echo "."; return $data; }
        
    function array_ereg_search($val, $array) {
      /* This removes $val from $array if found - used to remove the dontcrawlarray URLs */
          $return = array();
          foreach($array as $v) {
               if(!eregi($val, $v)) $return[] = $v;
          }
      return $return;
      } 
?>
 
Hey guys glad you're liking it.

Anyone have a workaround for timeout issues with php? I want to open up the scope of the spider but running it with a depth of a (x pages * y keywords) would really hang this thing.

Suggestions?
 
Hey guys glad you're liking it.

Anyone have a workaround for timeout issues with php? I want to open up the scope of the spider but running it with a depth of a (x pages * y keywords) would really hang this thing.

Suggestions?

With my yellow pages scraper I had it either output records as they were found or line such as page 1 of xx so that there was some output to keep it from timing out.
 
You can set the timeout to no limit directly in the code

// Set it to no-limit
set_time_limit(0);

err... Dew... mind getting into more detail on your method?

::emp::
 
Import any db content into WP

Ok, this is something I wrote a while back.. it's messy as fuck, poorly commented, with shitty naming conventions and if I remember correctly the time-delayed posting shit didn't work but if you want to create xml files to import into wp from a database then this should do the trick.

3 files, config, functions, create_file:

config.php:

Code:
set_time_limit(3000);
ini_set("memory_limit","256M");

define("DEBUG", true);

$db_host = 'localhost'; //localhost if db on same server
$db_username = 'root'; //database username
$db_password = 'xxxxx'; //database password
$db_database = 'restaurants'; //database name
$db_table_name = 'restaurants'; //the name of the table in your database

$db_fields = array('RestaurantID', 'Name', 'Address1', 'Address2', 'City', 'County', 'State_full', 'Zip', 'Country', 'PhoneMain', 'URL');

//define id field
$db_id_field = RestaurantID;

//!something! is a tag to use when creating the posts... like I said, shitty naming!
$replacement_fields = array('!name!' => $db_fields[1], '!address1!' => $db_fields[2], '!address2!' => $db_fields[3], '!city!' => $db_fields[4], '!county!' => $db_fields[5], '!state!' => $db_fields[6], '!zip!' => $db_fields[7], '!country!' => $db_fields[8], '!phone!' => $db_fields[9], '!url!' => $db_fields[10],);

// can use content and title as subject - add title, title_content, and indv fields - use for adding links
$custom_replace_1 = array('pattern' => "/<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>/", 'replacement' => "<br \/>", 'subject' => 'content');

$site_title = 'Worldwide Hotel Directory';
$site_url = '[URL="http://www.worldwidehoteldirectory.co.uk"]Hotel Reservations, Rooms & Reviews - Find Luxury Hotels to Cheap Hotel Deals at The Worldwide Hotel Directory[/URL]'; //no trailing slash
$site_description = 'Worldwide Hotel Directory';

$post_title_format = '!name! - !city!, !state!, USA';
$post_title_format = htmlentities($post_title_format, ENT_QUOTES);

$post_content_format = '
            Restaurant Name: <b>!name!</b><br />
            City: <b>!city!, !state!</b><br />
            Country: <b>United States</b><br />
            <!--more-->
            <h2>!name!</h2>
            <h3>Find !name!</h3>
            !name!<br />
            !address1!, !address2!<br />
            !city!<br />
            !county! County<br />
            !state!<br />
            !zip!<br />
            United States<br />
            <h3>Contact !name!</h3>
            Phone: <b>!phone!</b><br />
            Website: <b><a href=!url! rel=nofollow>!url!</a></b><br />
            <h3>Reviews and opinions of the !name!</h3>
            Have you dined at the !name! in !city!, !state!? Please leave a comment and let others know what you think!<br />
            <br />
            If this is your business and would like to add or update your information, please contact us: [EMAIL="admin@worldwidehoteldirectory.co.uk"]admin@worldwidehoteldirectory.co.uk[/EMAIL]';
$post_content_format = htmlentities($post_content_format, ENT_QUOTES);

//if all uses num records
//$num_posts = 'all';
$num_posts = 'all';
//posts per file (400 seems good, but use low with loads of cats)
$posts_per_file = 800;
//days to spread posts over 0 to not schedule - scheduling not working yet...?
$num_days = 0;

//include categories in first xml file?
$build_cats = false;

//keys: lvl_top - lvl_2 - lvl_3 - lvl_4 - lvl_5. free_1, free_2...free_5. 
$category_fields = array('lvl_top' => $db_fields[8], 'lvl_2' => $db_fields[6], 'lvl_3' => $db_fields[4]);
//$category_fields = array('lvl_top' => '!state!', 'lvl_2' => '!city!');
//define which cat posts will go under
$post_field = $category_fields[lvl_3];

//define which fields as tags if more just add to array
$tag_fields = array($db_fields[8], $db_fields[6], $db_fields[4]);

//no need to change these unless blog url diff from base or not in english
//$site_pubDate = date(DATE_RFC822);
$site_generator = 'DB-XML-Script'; //make sure thats ok, otherwise [URL="http://wordpress.org/?v=2.5"]WordPress › Blog Tool and Publishing Platform[/URL]
$site_language = 'en';
$site_wxr_version = '1.0';
$site_base_site_url = $site_url;
$site_base_blog_url = $site_url;

$post_link = '[URL]http://test.com/link[/URL]'; // use post name? DOESN'T MATTER - use anything

$post_creator = 'db-xml-script';

$post_comment_status = 'open'; //leave as open
$post_ping_status = 'open'; //leave as open

$post_status = 'publish'; //use 'publish' to publish or 'future' for future posts - no need, seems to schedule ones with future date
$post_post_parent = '0'; //leave?
$post_menu_order = '0'; //leave?
$post_post_type = 'post'; //leave? 'page' for pages, 'post' for post
 
functions.php:

Code:
function get_data($db_nice_name, $field, $match_field, $value){
    
    //content_master db - index of databases
    $master_db_host = 'localhost'; //localhost if db on same server
    $master_db_username = 'root'; //database username
    $master_db_password = 'root66'; //database password
    $master_db_database = 'content_master'; //database name
    $master_db_table = 'databases'; //the name of the table in your database
    
//    echo $db_nice_name;
    
    $db = mysql_connect($master_db_host, $master_db_username, $master_db_password) or die("Could not connect to database");
    mysql_select_db($master_db_database, $db) or die(mysql_error());
        
    $query = mysql_query("SELECT * FROM `$master_db_table` WHERE nice_name = '$db_nice_name'") or die(mysql_error());

    $result = mysql_fetch_array( $query );
    $db_host = $result[db_host]; //localhost if db on same server
    $db_username = $result[db_user]; //database username
    $db_password = $result[db_pass]; //database password
    $db_database = $result[db_name]; //database name
    $db_table_name = $result[db_table]; //the name of the table in your database
    
    if($match_field == 'id'){
        $match_field = $result[db_id_field];
    }
        
    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());    
    
    $query = mysql_query("SELECT $field FROM $db_table_name WHERE $match_field = '$value' LIMIT 1") or die("Could not complete database query2");    
    
    $result = mysql_fetch_array( $query );
    
    return $result[$field];
    
    
}


function title_content($id){
    require './config.php';

    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());
    $query = mysql_query("SELECT * FROM $db_table_name WHERE $db_id_field='$id'") or die("Could not complete database query");
    
    while($result = mysql_fetch_array( $query )){
        foreach ($replacement_fields as $patt => $repl){
            $patterns[] = "/$patt/";
            $replacements[] = $result[$repl];
        }    
        
//        foreach ($replacement_fields as $patt => $repl){
//            $patterns[] = "/$patt/";
//            $replacements[] = $repl;
//        }    
        
        
        //ensures they match up!
        ksort($patterns);
        ksort($replacements);
        
        if(DEBUG == true){
            echo '<br />----------------from title_content function----------------<br />';
            echo '<br />-------------------------replacement fields----------------------<br />';        
            var_dump($replacement_fields);
            echo '<br />-------------------------patterns----------------------<br />';        
            var_dump($patterns);
            echo '<br />-------------------------replacements----------------------<br />';    
            var_dump($replacements);
            echo '<br /><br />';
        }    
            
        $title_content[0] = preg_replace($patterns, $replacements, $post_title_format);
        $title_content[1] = preg_replace($patterns, $replacements, $post_content_format);
        
        $custom_patterns = array();
        $custom_replacements = array();
        
        foreach ($custom_replace_1 as $patt => $value){
            if($custom_replace_1[subject] == 'content'){
                $custom_patterns[custom_1] = $custom_replace_1[pattern];
                $custom_replacements[custom_1] = $custom_replace_1[replacement];
                //ensures they match up! (not really needed at mo)
                ksort($custom_patterns);
                ksort($custom_replacements);
                $title_content[1] = preg_replace($custom_patterns, $custom_replacements, $title_content[1]);
            }
        }
        echo $title_content[1];
        
        return $title_content;
    }
}    
function blog_info(){
    require './config.php';
    
    $_xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n";
    $_xml .="\r\n";
    $_xml .="<!-- comment -->\r\n";
    $_xml .="\r\n";
    $_xml .="<rss version=\"2.0\"\r\n";
    $_xml .="    xmlns:content=\"[URL="http://purl.org/rss/1.0/modules/content/"]RDF Site Summary 1.0 Modules: Content[/URL]\"\r\n";
    $_xml .="    xmlns:wfw=\"[URL]http://wellformedweb.org/CommentAPI/[/URL]\"\r\n";
    $_xml .="    xmlns:dc=\"[URL]http://purl.org/dc/elements/1.1/[/URL]\"\r\n";
    $_xml .="    xmlns:wp=\"[URL]http://wordpress.org/export/1.0/[/URL]\"\r\n";
    $_xml .=">\r\n";
    $_xml .="\r\n";
    
    $_xml .="<channel>\r\n";
    $_xml .="    <title>" . $site_title . "</title>\r\n";
    $_xml .="    <link>" . $site_url . "</link>\r\n";
    $_xml .="    <description>" . $site_description . "</description>\r\n";
    $_xml .="    <pubDate>" . $site_pubDate . "</pubDate>\r\n";
    $_xml .="    <generator>" . $site_generator . "</generator>\r\n";
    $_xml .="    <language>" . $site_language . "</language>\r\n";
    $_xml .="    <wp:wxr_version>" . $site_wxr_version . "</wp:wxr_version>\r\n";
    $_xml .="    <wp:base_site_url>" . $site_base_site_url . "</wp:base_site_url>\r\n";
    $_xml .="    <wp:base_blog_url>" . $site_base_blog_url . "</wp:base_blog_url>\r\n";
    
    return $_xml;
}

function add_category($cat_name, $cat_parent){
    require './config.php';
    $cat_name = str_replace('&', 'and', $cat_name);
    $cat_name = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_name);
    $cat_parent = str_replace('&', 'and', $cat_parent);
    $cat_parent = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_parent);
    //$cat_name_nice = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_name);
    $cat_name_nice = strtolower(str_replace(' ', '-', $cat_name));
    
    $cat_xml .="    <wp:category><wp:category_nicename>" . $cat_name_nice . "</wp:category_nicename><wp:category_parent>" . $cat_parent . "</wp:category_parent><wp:cat_name><![CDATA[" . $cat_name . "]]></wp:cat_name></wp:category>\r\n";
    
    return $cat_xml;
}

function add_tag($tag){
    require './config.php';
    
    $tag = str_replace('&', 'and', $tag);
    $tag = preg_replace("/[^a-zA-Z0-9\s]/", "", $tag);
    $tag_nice = strtolower(str_replace(' ', '-', $tag));
    
    $tag_xml ="        <category domain=\"tag\"><![CDATA[" . $tag . "]]></category>\r\n";
    $tag_xml .="\r\n";
    $tag_xml .="        <category domain=\"tag\" nicename=\"" . $tag_nice . "\"><![CDATA[" . $tag . "]]></category>\r\n";
    $tag_xml .="\r\n";
    
    return $tag_xml;
}

function build_categories(){
    require './config.php';
    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());
    
    $query = mysql_query("SELECT * FROM $db_table_name") or die("Could not complete database query");
    
    $categories = array();
    
    while($result = mysql_fetch_array( $query )){
        
        $id = $result[$db_id_field]; //id
        
        echo '<br />--- ---- ------ ------  ------- <br />';
        echo 'id: ' . $id;
        echo '<br />--- ---------- ----------- -- ---- <br />';
            
        foreach ($category_fields as $key => $field_name){
        //foreach ($category_fields as $key => $repl){
            
            //echo 'repl: ' . $repl;
            //echo '<br />';
            $category = $result[$field_name];
            //require './config.php';
            
            //$replacement_fields = array(
//                        '!city!' => ucwords(strtolower(get_data("US Restaurants", "City", "id", "$id"))), 
//                        '!state!' => ucwords(strtolower(get_data("US Zipcodes", "STATE_FULL", "STATE", get_data("US Restaurants", "State", "id", "$id")))),
//                        );
            
            //$category = $replacement_fields["$repl"];
                        
            //echo 'category: ' . $category;
            //echo '<br />--- -- -- ------<br/>';
            
            if(!array_key_exists($category,$categories)){
                if ($key == 'lvl_top' or $key == 'free_1' or $key == 'free_2' or $key == 'free_3' or $key == 'free_4' or $key == 'free_5'){
                    $parent_category = "";
                }elseif($key == 'lvl_2'){
                    $parent_category = $result[$category_fields[lvl_top]];
                }elseif($key == 'lvl_3'){
                    $parent_category = $result[$category_fields[lvl_2]];
                }elseif($key == 'lvl_4'){
                    $parent_category = $result[$category_fields[lvl_3]];
                }elseif($key == 'lvl_5'){
                    $parent_category = $result[$category_fields[lvl_4]];
                }
            
            $categories[$category] = $parent_category;    
            $cat_xml .= add_category($category, $parent_category);
            
            echo 'new cat added';
            echo '<br />-----------------/oo\----------------------<br/>';
            }
        }        
    }        
            
print_r($categories);
//    foreach ($categories as $category => $parent_category){
//        $cat_xml .= add_category($category, $parent_category);            
//    }
    
    return $cat_xml;
}
 
functions.php (continued)

Code:
function post_xml($id){
    require './config.php';
    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());
    
    $query = mysql_query("SELECT * FROM $db_table_name WHERE $db_id_field = '$id'") or die("Could not complete database query");
    
    while($result = mysql_fetch_array( $query )){
        
        $title_content = title_content("$id");
        $post_title = html_entity_decode($title_content[0]);
        $post_post_name = strtolower(str_replace(' ', '-', $post_title)); //need to remove non alphanum??
        $post_content = html_entity_decode($title_content[1]); //encode??
        
        //the category the post is in 
        $post_category = $result[$post_field]; 
        //echo $post_category;
        $post_category = str_replace('&', 'and', $post_category);
        $post_category = preg_replace("/[^a-zA-Z0-9\s]/", "", $post_category);
        $post_category_nice = strtolower(str_replace(' ', '-', $post_category)); //clean
        
        $post_date = random_timestamp("$num_days");
        $post_date_gmt = gmdate('Y-m-d H:i:s',strtotime($post_date));      
        $post_pubDate = date(DATE_RFC822,strtotime($post_date)); //also + x for future
        
        $post_xml .="<item>\r\n";
        $post_xml .="<title>" . $post_title . "</title>\r\n";
        $post_xml .="<link>" . $post_link . "</link>\r\n";
        $post_xml .="<pubDate>" . $post_pubDate . "</pubDate>\r\n";
        $post_xml .="<dc:creator><![CDATA[" . $post_creator . "]]></dc:creator>\r\n";
        $post_xml .="\r\n";
        $post_xml .="        <category><![CDATA[" . $post_category . "]]></category>\r\n";
        $post_xml .="\r\n";
        $post_xml .="        <category domain=\"category\" nicename=\"" . $post_category_nice . "\"><![CDATA[" . $post_category . "]]></category>\r\n";
        $post_xml .="\r\n"; 
        //add tags
        foreach ($tag_fields as $key => $field_name){
            
//            $replacement_fields = array(
//            '!city!' => ucwords(strtolower(get_data("US Restaurants", "City", "id", "$id"))), 
//            '!state!' => ucwords(strtolower(get_data("US Zipcodes", "STATE_FULL", "STATE", get_data("US Restaurants", "State", "id", "$id")))),
//            );
//            
//            $tag = $replacement_fields["$repl"];
                                    
            $tag = $result[$field_name];
            $post_xml .= add_tag("$tag");
        }
    
        $post_xml .="<guid isPermaLink=\"false\"></guid>\r\n";
        $post_xml .="<description></description>\r\n";
        $post_xml .="<content:encoded><![CDATA[" . $post_content . "]]></content:encoded>\r\n";
        $post_xml .="<wp:post_id>" . $post_id . "</wp:post_id>\r\n";
           $post_xml .="<wp:post_date>" . $post_date . "</wp:post_date>\r\n";
        $post_xml .="<wp:post_date_gmt>" . $post_date_gmt . "</wp:post_date_gmt>\r\n";
        $post_xml .="<wp:comment_status>" . $post_comment_status . "</wp:comment_status>\r\n";
        $post_xml .="<wp:ping_status>" . $post_ping_status . "</wp:ping_status>\r\n";
        $post_xml .="<wp:post_name>" . $post_post_name . "</wp:post_name>\r\n";
        $post_xml .="<wp:status>" . $post_status . "</wp:status>\r\n";
        $post_xml .="<wp:post_parent>" . $post_post_parent . "</wp:post_parent>\r\n";
        $post_xml .="<wp:menu_order>" . $post_menu_order . "</wp:menu_order>\r\n";
        $post_xml .="<wp:post_type>" . $post_post_type . "</wp:post_type>\r\n";
        //add password & meta? - don't seem to be needed
        $post_xml .="    </item>\r\n";
        
    }
    return $post_xml;
}

function num_records(){
    require './config.php';
    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());
    
    $query = mysql_query("SELECT COUNT(*) FROM $db_table_name") or die("Could not complete database query - count");
    $result = mysql_fetch_array( $query );
    $count = $result[0];
    return $count;
}

//generate random timestamp over x no days
function random_timestamp($no_days = 0){
    
    $post_date = date('Y-m-d H:i:s'); //use to future date? + x [URL="http://uk2.php.net/manual/en/function.date.php"]PHP: date - Manual[/URL]        
    //echo $post_date;
    //echo '<br /><br/>';    
    
    //convert to seconds
    $no_seconds = $no_days * 24 * 60 * 60;
    //echo 'over how many days (no_days): ' . $no_days . '<br />';
    //echo 'in seconds: ' . $no_seconds . '<br /><br />';
        
    $x = rand(0,$no_seconds);
    //echo '+ x (' . $x . ')<br />';
    
    $random_time = date('Y-m-d H:i:s',strtotime($post_date) + $x);
    //echo 'date: ' . $random_time;
    //echo '<br /><br/>';
    
    return $random_time;
}
 
great stuff. thanks.

edit: should be able to use my bit of code for the date thing. basically, just run a counter on each loop of the while when it's put together. multiply that by 4 or 12 or whatever (hours...) so, first time it's 4 hours from now, then 2 * 4 = 8 hours from now, etc etc.
 
create_file.php

Code:
require_once './functions.php';
require './config.php';

if($num_posts == 'all'){
    $num_posts = num_records();
}

echo $num_posts;
echo '<br /><br />';
echo $posts_per_file;
echo '<br /><br />';


$no_of_files = floor($num_posts / $posts_per_file) + 1;

echo $no_of_files;
echo '<br /><br />';


$total_posts = 0;
$file_no = 1;

while($file_no <= $no_of_files){
    
    if($file_no == $no_of_files){
        $posts_per_file = fmod($num_posts,$posts_per_file); //remainder of division - no of posts for last file
    }

    echo 'file no: ' . $file_no . '<br />';
    echo '-----------------------<br /><br />';
    
    //define filename and open to write to
    $file_time = date('Y-m-d_Hi');
    $filename = $file_time . '_' . strtolower(str_replace(' ', '-', $site_title)) . '_file' . $file_no . '.xml';
    $file = fopen($filename , "w"); //fopen -- Opens file or URL
        
    echo 'include blog info (top part of xml)<br /><br />';
    $_xml = blog_info();
    
    if($file_no == 1 & $build_cats == true){
        echo 'if file 1 & build_cats=true include categories<br /><br />';
        $_xml .= build_categories();
    }
    
    $post_id = 0; //increment - doesn't seem to matter
    $db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
    mysql_select_db($db_database, $db) or die(mysql_error());
    
    $query = mysql_query("SELECT * FROM $db_table_name ORDER BY $db_id_field LIMIT $total_posts, $posts_per_file") or die("Could not complete database query");
    
    while($result = mysql_fetch_array( $query )){
        $id = $result[$db_id_field];
        echo $id . ', ';
        
        $_xml .= post_xml("$id");
    
        $post_id++; //add 1 to id     needed? not passed to function at moment
    }
    
    $total_posts += $posts_per_file;

    
    echo 'total posts: ' . $total_posts . '<br />';

    echo '<br />include bottom part of xml<br /><br />';
    echo '-----------------------<br /><br />';
    
    $_xml .="</channel>\r\n";
    $_xml .="</rss>\r\n";
    
    fwrite($file, $_xml); //fwrite -- Binary-safe file write
    fclose($file); //fclose -- Closes an open file pointer
    
    //echo $_xml;
    echo '<a href=' . $filename . '>' . $filename . '</a> created. (right click, save as) <br />';

    $file_no++;
}

echo 'Then <a href=' . $site_url . '/wp-admin/admin.php?import=wordpress>Import into WordPress</a>';
 
great stuff. thanks.

edit: should be able to use my bit of code for the date thing. basically, just run a counter on each loop of the while when it's put together. multiply that by 4 or 12 or whatever (hours...) so, first time it's 4 hours from now, then 2 * 4 = 8 hours from now, etc etc.

If I remember, the problem wasn't with creating the random dates/times but with wp accepting them when the files were imported. It was a while back now but I think it would schedule them but not post them when the time came around.
 
^^^^ hope that shit makes sense to anyone else. Just looking at bits of it now and I'm struggling to remember what I did! The category stuff gives you 4 lvls of hierarchy, defined in config along with which fields to add as tags.. I think it used another db to get the state name from the abbreviation but would work well with the code posted earlier.

Sorry, a wee bit more than a function but hopefully useful to someone.
 
You can set the timeout to no limit directly in the code

// Set it to no-limit
set_time_limit(0);

err... Dew... mind getting into more detail on your method?

::emp::

The problem I had was timing out due to no interactions between the script and the browser.

The script has two options, one to output to a file and one to output to the screen. Outputting to the screen isn't a problem as the script is always communicating with the browser. When the script outputs to a file there can be quite a long time between when the request is made and the file is generated and a download link is sent to the browser.

The script works by reading one page from yp.com, processing it, writing it to the file and then going on to the next page. What I did to prevent the timeout was when it reads each page it outputs something like processing page 1 of xx. It does this for each page and also lets the user know how far along the script is.

From what I have read you can use set_error_handler(error_function) this will replace the normal error function and if your function returns to the script it will continue where it was. If you use this you are taking over ALL error handling.
 
Great feedback guys. This really was what I needed to tie it all together now I'm working out the kinks with tieing in the DB and having each module pull various bits out of it.

I have each module working independently, drawing "Target Keywords" from a bucket that both fills from a "multiplier" module as well as seeds that I submit. Once it does its duty with the KWs it'll drop the results (urls/ ad copy/ placements) into other buckets that will be picked up by the modules designed to use each.
 
One of the few things I code that I don't mind giving away...

[high="php"]<?php
/*
* by chatmasta
* word_match($s1, $s2, $min_length = 4, $collapse_whitespace = true)
* compares two $s1 and $s2 and returns a shared substring of length greater
* than or equal to $min_length. if no shared substring is found, returns false.
* if $collapse_whitespace is true, deletes whitespace before all operations.
*/
function word_match($s1, $s2, $min_length = 4, $collapse_whitespace = true) {
if($collapse_whitespace) {
$s1 = str_replace(' ', '', preg_replace('/\s\s+/', '', $s1));
$s2 = str_replace(' ', '', preg_replace('/\s\s+/', '', $s2));
}
$longer = (strcmp($s1, $s2) < 0) ? 's1' : 's2';
$shorter = (strcmp($s1, $s2) > 0) ? 's1' : 's2';
$st = similar_text($s1, $s2);
$remaining = strlen(${$longer}) - $st;
for($i = 0; $i < $remaining; $i++) {
$word_length = $st - $i;
for($start = 0; $start < strlen(${$shorter}) - $st + 1; $start++) {
$check = substr(${$shorter}, $start, $word_length);
if(strstr(${$longer}, $check) && strlen($check) >= $min_length) {
return $check;
}
}
}
return false;
}
$s1 = 'jen aniston';
$s2 = 'aniston center';
echo ($match = word_match($s1, $s2)) ? $match : 'fail';
?>[/high]

Function description in the comments. This is really helpful for keyword research and weeding out shitty keywords.