Dogpile Search Spy - Watch Real Time Searches Being Made



Wondering if this is censored / doctored.

Most "live search" features in search engines are.

- stripping out swearwords
- stripping of "offensive" searches

I fondly remember having access to a real live search backend. From 6pm on, it becomes really interesting.

The frontend was doctored like you would not believe.

::emp::
 
uncheck the box that says "omit adult terms"

lol can't believe how many porn searches I saw in like 30 seconds. Just think of all the men sitting there with their dick in their hands searching for things like "free black anal" which just scrolled by.
 
Wondering if this is censored / doctored.

you can turn adult terms on/off. some of my favs that just came through:

  • ebony anal
  • illegal young girls
  • pronhub
  • drunk and abused
  • freeanalvideo

conclusion: people are great
 
The real question being:

How can we scrape this?
Anyone?

::emp::
 
Had someone code a chron driven perl script a couple of years ago to scrape Dogpile. At the end of each day it would zip up the log & name by the days date. This is the guts of it, might be of use to someone to get working. If I remember where I archived the rest of the files I'll post. They were just the archive & start/stop functions, so not vital to the actual scraping.
Filename - xmlscrape.cgi
Code:
#!/usr/bin/perl -X
## perl /home/xxxxxxx/public_html/kwscrape/pack.cgi
## perl /home/xxxxxxx/public_html/kwscrape/pack.cg
sub sendemail
{
    my ($t,$s,$b) = @_;
    
    if( open SENDMAIL, "|/usr/sbin/sendmail -oi -t -odq" ) {
        $data{from}='xxxxxxxx@gmail.com';
        $data{to}=$t;
        $data{subject}=$s;
        $data{text}=$b;
        
        print SENDMAIL 'From: ' . $data{from} . "\n";
        print SENDMAIL 'To: ' . $data{to} . "\n";
        print SENDMAIL "Content-type: text/plain;charset=windows-1251\n";
        print SENDMAIL 'Subject: ' . $data{subject} . "\n";
        print SENDMAIL "\n";
        print SENDMAIL $data{text};
        close (SENDMAIL);
    };
};


sub got_error{
    $ke=0;
    if (open(E,"<scrape_error.log")){
        $ke=<E>;
        close(E);
    };
    if ($ke==10){
        sendemail('xxxxxxxxx@gmail.com','Fetching Error',"More then 10 times fetching error!");
        $ke=0;
    };
    $ke++;
    open(E,">scrape_error.log");
    print E $ke;
    close(E);
};

use LWP::UserAgent;
use HTTP::Request::Common;

$fwrite="log.txt";

if (open(C,"<stop.flg")){
    close(C);
    unlink("./stop.flg");
    die;
};


if (open(C,"<pack.flg")){
    close(C);
    unlink("./pack.flg");
    system("mv -f $fwrite pack.txt");
    system("./arch.cgi &");
}else{
    sleep(3);
};
#die;
#print "Content-type: text/html; charset=windows-1251\n\n";
#print "Started!";

chdir("/home/xxxxxxx/public_html/kwscrape/");

$ua = LWP::UserAgent->new;
$ua->agent('Mozilla/5.0');

$url='http://www.dogpile.com/info.dogpl/searchspy/inc/data.xml?filter=0';

($absurl) = $url =~ m!(http://.+?)/!si;
($otnurl) = $url =~ m!(http://.+)/!si;

$re = $ua->request(GET "$url");
if (!$re->is_success) {
    print "Error at getting $url!\n";
    got_error();
    };
$mainresponse = $re->as_string;

($cont) = $mainresponse =~ m!(<.*>)!;

$cont =~ s!<.*?>!#!sig;
$cont =~ s!#+!#!sig;
$cont =~ s!&.*?;!!sig;
($cont) = $cont =~ m!#(.*)#!si;

@words = split("#",$cont);

open(F,">>$fwrite");
for($i=0;$i<scalar(@words);$i++){
    print F "$words[$i]\n";
};
close(F);

system("./xmlscrape.cgi &");
 
My favorites so far ....

"i shot myself"

"when your kid wont friend you on facebook"

"big bra fanclub"

"assholes from canada"

"how to hunt and kill terrorists"

"goodluck bros"

"is little a word"
 
Heres a couple I spotted:

Can a woman get pregnant by a horse?

Yea, I've often wondered that...

3d Incest.
and yea, you don't want any of that 2 dimensional incest, that sucks, 3D's the way to go.

The human race - WTF
 
White Sluts Desire to Breed Black
I have small tits
bear ass scratcher
you jizz
easy way of anal
banana boobs

These came up in about a 2 minute span...too funny.
 
infotiger has this, also. But the results are not as interesting. Here's a script to scrape it, though:
Code:
#!/usr/bin/env python3.0
import datetime
import time
import os
import random

log_path_base = '/home/user/infotiger'
outer_left = 'voyeur ***************** -->'
outer_right = '<!-- ************ /voyeur'
cont_left = 'blank">'
cont_right = '</a'
min_wait = 2
wait_factr = 2
file_searches = 0
directory_files = 0
approx_searches_per_file = 30000

def log_path():
    return log_path_base + '/' + str(datetime.date.today()) + '.' + str(directory_files)

def should_continue():
    return True

def random_wait():
        time.sleep(min_wait + (wait_factr * random.random()))

        
while should_continue():

    # time for a new file
    if file_searches > approx_file_searches_per_file:
        directory_files += 1
        file_searches = 0

    random_wait()
    f = os.popen("wget -qO - --no-cache --no-cookies header='Host: [URL="http://www.infotiger.com/"]Infotiger search engine homepage[/URL]' --header='User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042513 Ubuntu/8.04 (hardy) Firefox/3.0.10' --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' --header='Accept-Language: en-us,en;q=0.5' --header='Accept-Encoding: gzip,deflate' --header='Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' --header='Keep-Alive: 300' --header='Connection: keep-alive' --header='Content-Type: text/html; charset=utf-8' --header='Referer: [URL="http://en.wikipedia.org/wiki/Search_engine"]Web search engine - Wikipedia, the free encyclopedia[/URL]' --header='Pragma: no-cache' --header='Cache-Control: no-cache' http://www.infotiger.com/voyeur.html?filter=no")
    raw = f.read()
    f.close()

    # isolate voyeur section
    raw = raw.partition(outer_left)[2].partition(outer_right)[0]
    # create strings with the wanted content at the beginning
    left_chopped = [raw.partition(cont_left)[2]]
    n = 1
    while left_chopped[n-1] != "":
        left_chopped.append(left_chopped[n-1].partition(cont_left)[2])
        n += 1

    # chop off the extraneous text from the end
    cont = [left_chopped[0].partition(cont_right)[0]]
    n = 1
    lgth = len(left_chopped)
    while n < lgth:
        cont.append(left_chopped[n].partition(cont_right)[0])
        n += 1
    
    file_searches += len(cont) - 1
    output = '\n'.join(cont)
    f = open(log_path(), 'a')
    f.write(output)
    f.close()