Updated some code posted here last year by emp - http://www.wickedfire.com/design-de...rgeted-adword-ads-something-i-whipped-up.html
What it does is spiders the Google results for a keyword and scrapes each result page looking for Adsense code. You then know which sites have legitimate traffic and are probably less likely to have fraudulent clicks.
<?PHP // Setting the variables
set_time_limit(180);
// -------------------GooglePrefix------¦-----query--¦suffix¦-counter--¦
$GooglePrefix = "http://www.google.com/search?q=";
$GoogleCountSuffix ="&start=";
$dontcrawlarray = array('wikipedia.org','google.com','amazon.com');
$query = urlencode($_GET['s']);
$drilldepth = $_GET['drilldepth'];
?>
<h1>Adsense Spider</h1>
<form action="<?=$_SERVER['PHP_SELF']?>">
<input type="text" name="s" value="<?=urldecode($query)?>">
<select name="drilldepth">
<?PHP
for ($i = 5; $i >= 1; $i--)
echo "<option value='$i' " . ($i == $drilldepth ? 'selected' : '') . "> $i </option>";
?>
</select>
<input type="submit" value="Crawl">
</form>
<?PHP // Loop SERPs for given keyword
if ($query == '')
die("No search term defined");
echo "<hr><h2>" . urldecode($query) . "</h2>Looping through SERPs";
for ($loop = 0, $stop = ($drilldepth != NULL ? $drilldepth * 10-10 : 30); $loop <= $stop; $loop = $loop + 10) {
$CompleteUrl = $GooglePrefix.$query.$GoogleCountSuffix.$loop;
$res = andstatus($res.webFetcher($CompleteUrl));
}
echo "<br>";
// Take search results and parse out URLs
echo "Parsing URIs from SERPs";
$resultURLs = do_reg($res, "/h3.class=r.*(http.*)\"/U");
echo " [" . count($resultURLs) . "]<br>";
foreach ($dontcrawlarray as $url)
$resultURLs = array_ereg_search($url,$resultURLs);
// Take URL results and loop through each to find Adsense code
echo "Digging through URIs for Adsense enabled pages";
for ($i = 0, $stop = count($resultURLs); $i < $stop ; $i++) {
$thisURI = $resultURLs[$i];
$comp = webFetcher($thisURI);
// Regex to parse SERP urls to match google code and return URL into $matches array
( preg_match("/google_ad/", $comp) ? ( $matches[] = andstatus($thisURI) ) : '') ;
}
$textdump = "";
echo "<hr><h3>Adsense Enabled Placements:</h3><ol>";
// Loop through matches and format for review
foreach ($matches as $match => $pURI) {
echo "<LI><a href='$pURI' target='_BLANK'>$pURI</a></LI>";
$textdump .= "$pURI\n";
}
// Dump links into a box for easy C&P
echo "</ol><textarea cols=80 rows=5>$textdump</textarea>";
function do_reg($text, $regex) {
preg_match_all($regex, $text, $regxresult, PREG_PATTERN_ORDER);
return $regresult = $regxresult[1];
}
function webFetcher($url) {
// Fetch and return source from $url
$crawl = curl_init(); // The curl library is initiated, the following lines set the curl variables
curl_setopt ($crawl, CURLOPT_URL, $url); // The URL is set
curl_setopt($crawl, CURLOPT_RETURNTRANSFER, 1); // Tells it to return the results in a variable
$resulting = $resulting.curl_exec($crawl); // curl is executed and the results stored in $resulting
curl_close($crawl); // closes the curl procedure.
return $result = $resulting; // Returns source
}
function andstatus($data) {
// Function allows a psudo progress status '....' to be displayed
echo "."; return $data; }
function array_ereg_search($val, $array) {
/* This removes $val from $array if found - used to remove the dontcrawlarray URLs */
$return = array();
foreach($array as $v) {
if(!eregi($val, $v)) $return[] = $v;
}
return $return;
}
?>
Hey guys glad you're liking it.
Anyone have a workaround for timeout issues with php? I want to open up the scope of the spider but running it with a depth of a (x pages * y keywords) would really hang this thing.
Suggestions?
set_time_limit(3000);
ini_set("memory_limit","256M");
define("DEBUG", true);
$db_host = 'localhost'; //localhost if db on same server
$db_username = 'root'; //database username
$db_password = 'xxxxx'; //database password
$db_database = 'restaurants'; //database name
$db_table_name = 'restaurants'; //the name of the table in your database
$db_fields = array('RestaurantID', 'Name', 'Address1', 'Address2', 'City', 'County', 'State_full', 'Zip', 'Country', 'PhoneMain', 'URL');
//define id field
$db_id_field = RestaurantID;
//!something! is a tag to use when creating the posts... like I said, shitty naming!
$replacement_fields = array('!name!' => $db_fields[1], '!address1!' => $db_fields[2], '!address2!' => $db_fields[3], '!city!' => $db_fields[4], '!county!' => $db_fields[5], '!state!' => $db_fields[6], '!zip!' => $db_fields[7], '!country!' => $db_fields[8], '!phone!' => $db_fields[9], '!url!' => $db_fields[10],);
// can use content and title as subject - add title, title_content, and indv fields - use for adding links
$custom_replace_1 = array('pattern' => "/<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>\s+<br\s+\/>/", 'replacement' => "<br \/>", 'subject' => 'content');
$site_title = 'Worldwide Hotel Directory';
$site_url = '[URL="http://www.worldwidehoteldirectory.co.uk"]Hotel Reservations, Rooms & Reviews - Find Luxury Hotels to Cheap Hotel Deals at The Worldwide Hotel Directory[/URL]'; //no trailing slash
$site_description = 'Worldwide Hotel Directory';
$post_title_format = '!name! - !city!, !state!, USA';
$post_title_format = htmlentities($post_title_format, ENT_QUOTES);
$post_content_format = '
Restaurant Name: <b>!name!</b><br />
City: <b>!city!, !state!</b><br />
Country: <b>United States</b><br />
<!--more-->
<h2>!name!</h2>
<h3>Find !name!</h3>
!name!<br />
!address1!, !address2!<br />
!city!<br />
!county! County<br />
!state!<br />
!zip!<br />
United States<br />
<h3>Contact !name!</h3>
Phone: <b>!phone!</b><br />
Website: <b><a href=!url! rel=nofollow>!url!</a></b><br />
<h3>Reviews and opinions of the !name!</h3>
Have you dined at the !name! in !city!, !state!? Please leave a comment and let others know what you think!<br />
<br />
If this is your business and would like to add or update your information, please contact us: [EMAIL="admin@worldwidehoteldirectory.co.uk"]admin@worldwidehoteldirectory.co.uk[/EMAIL]';
$post_content_format = htmlentities($post_content_format, ENT_QUOTES);
//if all uses num records
//$num_posts = 'all';
$num_posts = 'all';
//posts per file (400 seems good, but use low with loads of cats)
$posts_per_file = 800;
//days to spread posts over 0 to not schedule - scheduling not working yet...?
$num_days = 0;
//include categories in first xml file?
$build_cats = false;
//keys: lvl_top - lvl_2 - lvl_3 - lvl_4 - lvl_5. free_1, free_2...free_5.
$category_fields = array('lvl_top' => $db_fields[8], 'lvl_2' => $db_fields[6], 'lvl_3' => $db_fields[4]);
//$category_fields = array('lvl_top' => '!state!', 'lvl_2' => '!city!');
//define which cat posts will go under
$post_field = $category_fields[lvl_3];
//define which fields as tags if more just add to array
$tag_fields = array($db_fields[8], $db_fields[6], $db_fields[4]);
//no need to change these unless blog url diff from base or not in english
//$site_pubDate = date(DATE_RFC822);
$site_generator = 'DB-XML-Script'; //make sure thats ok, otherwise [URL="http://wordpress.org/?v=2.5"]WordPress › Blog Tool and Publishing Platform[/URL]
$site_language = 'en';
$site_wxr_version = '1.0';
$site_base_site_url = $site_url;
$site_base_blog_url = $site_url;
$post_link = '[URL]http://test.com/link[/URL]'; // use post name? DOESN'T MATTER - use anything
$post_creator = 'db-xml-script';
$post_comment_status = 'open'; //leave as open
$post_ping_status = 'open'; //leave as open
$post_status = 'publish'; //use 'publish' to publish or 'future' for future posts - no need, seems to schedule ones with future date
$post_post_parent = '0'; //leave?
$post_menu_order = '0'; //leave?
$post_post_type = 'post'; //leave? 'page' for pages, 'post' for post
function get_data($db_nice_name, $field, $match_field, $value){
//content_master db - index of databases
$master_db_host = 'localhost'; //localhost if db on same server
$master_db_username = 'root'; //database username
$master_db_password = 'root66'; //database password
$master_db_database = 'content_master'; //database name
$master_db_table = 'databases'; //the name of the table in your database
// echo $db_nice_name;
$db = mysql_connect($master_db_host, $master_db_username, $master_db_password) or die("Could not connect to database");
mysql_select_db($master_db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT * FROM `$master_db_table` WHERE nice_name = '$db_nice_name'") or die(mysql_error());
$result = mysql_fetch_array( $query );
$db_host = $result[db_host]; //localhost if db on same server
$db_username = $result[db_user]; //database username
$db_password = $result[db_pass]; //database password
$db_database = $result[db_name]; //database name
$db_table_name = $result[db_table]; //the name of the table in your database
if($match_field == 'id'){
$match_field = $result[db_id_field];
}
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT $field FROM $db_table_name WHERE $match_field = '$value' LIMIT 1") or die("Could not complete database query2");
$result = mysql_fetch_array( $query );
return $result[$field];
}
function title_content($id){
require './config.php';
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT * FROM $db_table_name WHERE $db_id_field='$id'") or die("Could not complete database query");
while($result = mysql_fetch_array( $query )){
foreach ($replacement_fields as $patt => $repl){
$patterns[] = "/$patt/";
$replacements[] = $result[$repl];
}
// foreach ($replacement_fields as $patt => $repl){
// $patterns[] = "/$patt/";
// $replacements[] = $repl;
// }
//ensures they match up!
ksort($patterns);
ksort($replacements);
if(DEBUG == true){
echo '<br />----------------from title_content function----------------<br />';
echo '<br />-------------------------replacement fields----------------------<br />';
var_dump($replacement_fields);
echo '<br />-------------------------patterns----------------------<br />';
var_dump($patterns);
echo '<br />-------------------------replacements----------------------<br />';
var_dump($replacements);
echo '<br /><br />';
}
$title_content[0] = preg_replace($patterns, $replacements, $post_title_format);
$title_content[1] = preg_replace($patterns, $replacements, $post_content_format);
$custom_patterns = array();
$custom_replacements = array();
foreach ($custom_replace_1 as $patt => $value){
if($custom_replace_1[subject] == 'content'){
$custom_patterns[custom_1] = $custom_replace_1[pattern];
$custom_replacements[custom_1] = $custom_replace_1[replacement];
//ensures they match up! (not really needed at mo)
ksort($custom_patterns);
ksort($custom_replacements);
$title_content[1] = preg_replace($custom_patterns, $custom_replacements, $title_content[1]);
}
}
echo $title_content[1];
return $title_content;
}
}
function blog_info(){
require './config.php';
$_xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n";
$_xml .="\r\n";
$_xml .="<!-- comment -->\r\n";
$_xml .="\r\n";
$_xml .="<rss version=\"2.0\"\r\n";
$_xml .=" xmlns:content=\"[URL="http://purl.org/rss/1.0/modules/content/"]RDF Site Summary 1.0 Modules: Content[/URL]\"\r\n";
$_xml .=" xmlns:wfw=\"[URL]http://wellformedweb.org/CommentAPI/[/URL]\"\r\n";
$_xml .=" xmlns:dc=\"[URL]http://purl.org/dc/elements/1.1/[/URL]\"\r\n";
$_xml .=" xmlns:wp=\"[URL]http://wordpress.org/export/1.0/[/URL]\"\r\n";
$_xml .=">\r\n";
$_xml .="\r\n";
$_xml .="<channel>\r\n";
$_xml .=" <title>" . $site_title . "</title>\r\n";
$_xml .=" <link>" . $site_url . "</link>\r\n";
$_xml .=" <description>" . $site_description . "</description>\r\n";
$_xml .=" <pubDate>" . $site_pubDate . "</pubDate>\r\n";
$_xml .=" <generator>" . $site_generator . "</generator>\r\n";
$_xml .=" <language>" . $site_language . "</language>\r\n";
$_xml .=" <wp:wxr_version>" . $site_wxr_version . "</wp:wxr_version>\r\n";
$_xml .=" <wp:base_site_url>" . $site_base_site_url . "</wp:base_site_url>\r\n";
$_xml .=" <wp:base_blog_url>" . $site_base_blog_url . "</wp:base_blog_url>\r\n";
return $_xml;
}
function add_category($cat_name, $cat_parent){
require './config.php';
$cat_name = str_replace('&', 'and', $cat_name);
$cat_name = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_name);
$cat_parent = str_replace('&', 'and', $cat_parent);
$cat_parent = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_parent);
//$cat_name_nice = preg_replace("/[^a-zA-Z0-9\s]/", "", $cat_name);
$cat_name_nice = strtolower(str_replace(' ', '-', $cat_name));
$cat_xml .=" <wp:category><wp:category_nicename>" . $cat_name_nice . "</wp:category_nicename><wp:category_parent>" . $cat_parent . "</wp:category_parent><wp:cat_name><![CDATA[" . $cat_name . "]]></wp:cat_name></wp:category>\r\n";
return $cat_xml;
}
function add_tag($tag){
require './config.php';
$tag = str_replace('&', 'and', $tag);
$tag = preg_replace("/[^a-zA-Z0-9\s]/", "", $tag);
$tag_nice = strtolower(str_replace(' ', '-', $tag));
$tag_xml =" <category domain=\"tag\"><![CDATA[" . $tag . "]]></category>\r\n";
$tag_xml .="\r\n";
$tag_xml .=" <category domain=\"tag\" nicename=\"" . $tag_nice . "\"><![CDATA[" . $tag . "]]></category>\r\n";
$tag_xml .="\r\n";
return $tag_xml;
}
function build_categories(){
require './config.php';
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT * FROM $db_table_name") or die("Could not complete database query");
$categories = array();
while($result = mysql_fetch_array( $query )){
$id = $result[$db_id_field]; //id
echo '<br />--- ---- ------ ------ ------- <br />';
echo 'id: ' . $id;
echo '<br />--- ---------- ----------- -- ---- <br />';
foreach ($category_fields as $key => $field_name){
//foreach ($category_fields as $key => $repl){
//echo 'repl: ' . $repl;
//echo '<br />';
$category = $result[$field_name];
//require './config.php';
//$replacement_fields = array(
// '!city!' => ucwords(strtolower(get_data("US Restaurants", "City", "id", "$id"))),
// '!state!' => ucwords(strtolower(get_data("US Zipcodes", "STATE_FULL", "STATE", get_data("US Restaurants", "State", "id", "$id")))),
// );
//$category = $replacement_fields["$repl"];
//echo 'category: ' . $category;
//echo '<br />--- -- -- ------<br/>';
if(!array_key_exists($category,$categories)){
if ($key == 'lvl_top' or $key == 'free_1' or $key == 'free_2' or $key == 'free_3' or $key == 'free_4' or $key == 'free_5'){
$parent_category = "";
}elseif($key == 'lvl_2'){
$parent_category = $result[$category_fields[lvl_top]];
}elseif($key == 'lvl_3'){
$parent_category = $result[$category_fields[lvl_2]];
}elseif($key == 'lvl_4'){
$parent_category = $result[$category_fields[lvl_3]];
}elseif($key == 'lvl_5'){
$parent_category = $result[$category_fields[lvl_4]];
}
$categories[$category] = $parent_category;
$cat_xml .= add_category($category, $parent_category);
echo 'new cat added';
echo '<br />-----------------/oo\----------------------<br/>';
}
}
}
print_r($categories);
// foreach ($categories as $category => $parent_category){
// $cat_xml .= add_category($category, $parent_category);
// }
return $cat_xml;
}
function post_xml($id){
require './config.php';
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT * FROM $db_table_name WHERE $db_id_field = '$id'") or die("Could not complete database query");
while($result = mysql_fetch_array( $query )){
$title_content = title_content("$id");
$post_title = html_entity_decode($title_content[0]);
$post_post_name = strtolower(str_replace(' ', '-', $post_title)); //need to remove non alphanum??
$post_content = html_entity_decode($title_content[1]); //encode??
//the category the post is in
$post_category = $result[$post_field];
//echo $post_category;
$post_category = str_replace('&', 'and', $post_category);
$post_category = preg_replace("/[^a-zA-Z0-9\s]/", "", $post_category);
$post_category_nice = strtolower(str_replace(' ', '-', $post_category)); //clean
$post_date = random_timestamp("$num_days");
$post_date_gmt = gmdate('Y-m-d H:i:s',strtotime($post_date));
$post_pubDate = date(DATE_RFC822,strtotime($post_date)); //also + x for future
$post_xml .="<item>\r\n";
$post_xml .="<title>" . $post_title . "</title>\r\n";
$post_xml .="<link>" . $post_link . "</link>\r\n";
$post_xml .="<pubDate>" . $post_pubDate . "</pubDate>\r\n";
$post_xml .="<dc:creator><![CDATA[" . $post_creator . "]]></dc:creator>\r\n";
$post_xml .="\r\n";
$post_xml .=" <category><![CDATA[" . $post_category . "]]></category>\r\n";
$post_xml .="\r\n";
$post_xml .=" <category domain=\"category\" nicename=\"" . $post_category_nice . "\"><![CDATA[" . $post_category . "]]></category>\r\n";
$post_xml .="\r\n";
//add tags
foreach ($tag_fields as $key => $field_name){
// $replacement_fields = array(
// '!city!' => ucwords(strtolower(get_data("US Restaurants", "City", "id", "$id"))),
// '!state!' => ucwords(strtolower(get_data("US Zipcodes", "STATE_FULL", "STATE", get_data("US Restaurants", "State", "id", "$id")))),
// );
//
// $tag = $replacement_fields["$repl"];
$tag = $result[$field_name];
$post_xml .= add_tag("$tag");
}
$post_xml .="<guid isPermaLink=\"false\"></guid>\r\n";
$post_xml .="<description></description>\r\n";
$post_xml .="<content:encoded><![CDATA[" . $post_content . "]]></content:encoded>\r\n";
$post_xml .="<wp:post_id>" . $post_id . "</wp:post_id>\r\n";
$post_xml .="<wp:post_date>" . $post_date . "</wp:post_date>\r\n";
$post_xml .="<wp:post_date_gmt>" . $post_date_gmt . "</wp:post_date_gmt>\r\n";
$post_xml .="<wp:comment_status>" . $post_comment_status . "</wp:comment_status>\r\n";
$post_xml .="<wp:ping_status>" . $post_ping_status . "</wp:ping_status>\r\n";
$post_xml .="<wp:post_name>" . $post_post_name . "</wp:post_name>\r\n";
$post_xml .="<wp:status>" . $post_status . "</wp:status>\r\n";
$post_xml .="<wp:post_parent>" . $post_post_parent . "</wp:post_parent>\r\n";
$post_xml .="<wp:menu_order>" . $post_menu_order . "</wp:menu_order>\r\n";
$post_xml .="<wp:post_type>" . $post_post_type . "</wp:post_type>\r\n";
//add password & meta? - don't seem to be needed
$post_xml .=" </item>\r\n";
}
return $post_xml;
}
function num_records(){
require './config.php';
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT COUNT(*) FROM $db_table_name") or die("Could not complete database query - count");
$result = mysql_fetch_array( $query );
$count = $result[0];
return $count;
}
//generate random timestamp over x no days
function random_timestamp($no_days = 0){
$post_date = date('Y-m-d H:i:s'); //use to future date? + x [URL="http://uk2.php.net/manual/en/function.date.php"]PHP: date - Manual[/URL]
//echo $post_date;
//echo '<br /><br/>';
//convert to seconds
$no_seconds = $no_days * 24 * 60 * 60;
//echo 'over how many days (no_days): ' . $no_days . '<br />';
//echo 'in seconds: ' . $no_seconds . '<br /><br />';
$x = rand(0,$no_seconds);
//echo '+ x (' . $x . ')<br />';
$random_time = date('Y-m-d H:i:s',strtotime($post_date) + $x);
//echo 'date: ' . $random_time;
//echo '<br /><br/>';
return $random_time;
}
require_once './functions.php';
require './config.php';
if($num_posts == 'all'){
$num_posts = num_records();
}
echo $num_posts;
echo '<br /><br />';
echo $posts_per_file;
echo '<br /><br />';
$no_of_files = floor($num_posts / $posts_per_file) + 1;
echo $no_of_files;
echo '<br /><br />';
$total_posts = 0;
$file_no = 1;
while($file_no <= $no_of_files){
if($file_no == $no_of_files){
$posts_per_file = fmod($num_posts,$posts_per_file); //remainder of division - no of posts for last file
}
echo 'file no: ' . $file_no . '<br />';
echo '-----------------------<br /><br />';
//define filename and open to write to
$file_time = date('Y-m-d_Hi');
$filename = $file_time . '_' . strtolower(str_replace(' ', '-', $site_title)) . '_file' . $file_no . '.xml';
$file = fopen($filename , "w"); //fopen -- Opens file or URL
echo 'include blog info (top part of xml)<br /><br />';
$_xml = blog_info();
if($file_no == 1 & $build_cats == true){
echo 'if file 1 & build_cats=true include categories<br /><br />';
$_xml .= build_categories();
}
$post_id = 0; //increment - doesn't seem to matter
$db = mysql_connect($db_host, $db_username, $db_password) or die("Could not connect to database");
mysql_select_db($db_database, $db) or die(mysql_error());
$query = mysql_query("SELECT * FROM $db_table_name ORDER BY $db_id_field LIMIT $total_posts, $posts_per_file") or die("Could not complete database query");
while($result = mysql_fetch_array( $query )){
$id = $result[$db_id_field];
echo $id . ', ';
$_xml .= post_xml("$id");
$post_id++; //add 1 to id needed? not passed to function at moment
}
$total_posts += $posts_per_file;
echo 'total posts: ' . $total_posts . '<br />';
echo '<br />include bottom part of xml<br /><br />';
echo '-----------------------<br /><br />';
$_xml .="</channel>\r\n";
$_xml .="</rss>\r\n";
fwrite($file, $_xml); //fwrite -- Binary-safe file write
fclose($file); //fclose -- Closes an open file pointer
//echo $_xml;
echo '<a href=' . $filename . '>' . $filename . '</a> created. (right click, save as) <br />';
$file_no++;
}
echo 'Then <a href=' . $site_url . '/wp-admin/admin.php?import=wordpress>Import into WordPress</a>';
great stuff. thanks.
edit: should be able to use my bit of code for the date thing. basically, just run a counter on each loop of the while when it's put together. multiply that by 4 or 12 or whatever (hours...) so, first time it's 4 hours from now, then 2 * 4 = 8 hours from now, etc etc.
You can set the timeout to no limit directly in the code
// Set it to no-limit
set_time_limit(0);
err... Dew... mind getting into more detail on your method?
::emp::