The WF Python Functions War Chest

Jake232

New member
Sep 4, 2010
712
4
0
So, after seeing quite a few people asking questions recently, and useful Python code snippets being posted (usually by the awesome mattseh) it seems we need somewhere to group them all together.

Rules:
  • All code must be Python
  • Must state whether the code is for Python 2 or Python 3
  • If the code is huge, include some comments or at least a explanation
  • Post the actual code within the
    Code:
     tags
    [*]Can include download / link if code is huge
    [/LIST]
    
    That's it I guess. Post what you have :)
 


Python 2.

Scrapes google suggest. Use suggest() on its own, or alphabet suggest, to get a load more results.

Code:
import web
import urllib
import random
import json
import string
import urlparse
import re

import gevent
from gevent import monkey
from gevent import queue
from gevent import pool
monkey.patch_all(thread=False)

def suggest(term,proxy=None):
    phrases = []
    data = web.grab('http://www.google.com/s?hl=en&xhr=t&q=%s&cp=50&pf=p&sclient=psy&site=&source=hp' % urllib.quote_plus(term),proxy=proxy)
    json_data = json.loads(str(data))
    phrases = [re.sub('<.+?>','',phrase[0]) for phrase in json_data[1]]
    return phrases

def alphabet_suggest(term,proxies=True):
    manager = web.ProxyManager(proxies,delay=3)
    results = []
    work_pool = pool.Pool(len(manager.records))
    jobs = []
    for i in string.lowercase:
        for j in string.lowercase:
            print term+' '+i+j
            jobs.append(work_pool.spawn(suggest,term+' '+i+j,manager))
    work_pool.join()
    for job in jobs:
        if job.value:
            results += job.value
    return results
 
^^ as someone that has done this in php, i suggest this:

1. have it to pull normal suggest ( you did )
2. have it pull pull the kw + a, kw + b, kw + c ( all the way to z )
3. have it pull kw normal suggest and grab each one, then suggest each of those indiv. as well

hope that makes sense
 
Python 2.

This is a google wonderwheel scraper, simply add the starting keyword as shown, and select the amount of levels deep you want to go.

Each level will be a list within the items list.

And before I get bitched at for sucky coding, I know. I'm a newb at Python and if you don't like it, then don't use it, although if anybody would like to make a suggestion on something I could have done better, then go ahead :)

Code:
import urllib
import urllib2
from lxml import objectify

def grab_keywords(items,level = 2):
        if level <= 0:
            return items
        #Append a empty list to the end
        items.append([])
        #Loop through last got items (but not the empty list just appended) - This avoids doing allready processed keywords
        for i in xrange(len(items[-2])):
            try:
                kw = urllib.quote_plus(items[-2][i])
                source = urllib2.urlopen('http://google.com/complete/search?output=toolbar&q=%s' % kw).read()
                root = objectify.fromstring(source)
            except Exception as excp:
                #If this occurs, Google has banned you, so just return what we have
                print excp
                return items

            
            try:
                for e in root.CompleteSuggestion:
                    #Add to the new list
                    if kw != e.suggestion.attrib['data']:
                        items[-1].append(e.suggestion.attrib['data'])
            except:
                #If this happens, current keyword has no suggestions
                pass

        return grab_keywords(items, level - 1)


items = []


#Add your keyword here
items.append(['cheese'])

#Amount of levels deep to go can be passed, defaults to 2
grab_keywords(items,level=3)

print items

PS: If you need proxy support, use mattseh's Proxy Manager, but to be honest, you can use this a shitload without being banned.
 
Python 2.

This is a google wonderwheel scraper, simply add the starting keyword as shown, and select the amount of levels deep you want to go.

Each level will be a list within the items list.

And before I get bitched at for sucky coding, I know. I'm a newb at Python and if you don't like it, then don't use it, although if anybody would like to make a suggestion on something I could have done better, then go ahead :)

Code:
import urllib
import urllib2
from lxml import objectify

def grab_keywords(items,level = 2):
        if level <= 0:
            return items
        #Append a empty list to the end
        items.append([])
        #Loop through last got items (but not the empty list just appended) - This avoids doing allready processed keywords
        for i in xrange(len(items[-2])):
            try:
                kw = urllib.quote_plus(items[-2][i])
                source = urllib2.urlopen('http://google.com/complete/search?output=toolbar&q=%s' % kw).read()
                root = objectify.fromstring(source)
            except Exception as excp:
                #If this occurs, Google has banned you, so just return what we have
                print excp
                return items

            
            try:
                for e in root.CompleteSuggestion:
                    #Add to the new list
                    if kw != e.suggestion.attrib['data']:
                        items[-1].append(e.suggestion.attrib['data'])
            except:
                #If this happens, current keyword has no suggestions
                pass

        return grab_keywords(items, level - 1)


items = []


#Add your keyword here
items.append(['cheese'])

#Amount of levels deep to go can be passed, defaults to 2
grab_keywords(items,level=3)

print items
PS: If you need proxy support, use mattseh's Proxy Manager, but to be honest, you can use this a shitload without being banned.

Very good work, here's my implementation:

Code:
import web

def wonderwheel(term,levels=3,min_searches=0,proxy=None,delay=3):
    if proxy:
        proxy = web.ProxyManager(proxy,delay=delay)
    results = set()
    last_level_results = set([term])
    for i in range(1,levels+1):
        level_results = set()
        for page in web.multi_grab(['http://google.com/complete/search?output=toolbar&q=%s' % urllib.quote_plus(term) for term in last_level_results],proxy=proxy):
            level_results |= set(page.xpath('//suggestion[../num_queries/@int > %s]/@data' % min_searches))
        if not level_results:
            break
        last_level_results = level_results
        results |= level_results
    return results

print wonderwheel('hello')
 
^^ as someone that has done this in php, i suggest this:

1. have it to pull normal suggest ( you did )
2. have it pull pull the kw + a, kw + b, kw + c ( all the way to z )
3. have it pull kw normal suggest and grab each one, then suggest each of those indiv. as well

hope that makes sense

that's what alphabet suggest does brah, up to aa-zz :)
 
Code:
import web
def wonderwheel(term,levels=3,min_searches=0,proxy=None,delay=3):
    if proxy:
        proxy = web.ProxyManager(proxy,delay=delay)
    results = set()
    last_level_results = set([term])
    for i in range(1,levels+1):
        level_results = set()
        for page in web.multi_grab(['http://google.com/complete/search?output=toolbar&q=%s' % urllib.quote_plus(term) for term in last_level_results],proxy=proxy):
            level_results |= set(page.xpath('//suggestion[../num_queries/@int > %s]/@data' % min_searches))
        if not level_results:
            break
        last_level_results = level_results
        for result in level_results:
            if result not in results:
                yield result
                results.add(result)

for result in wonderwheel('hello'):
   print result
made it a generator, get your results faster!
 
Python 2

A very fast email link activator.

When you sign up somewhere, and they send you a link to verify your email. This will scan your hotmail inbox, grab all links, and then visit them all. It uses regex to avoid wasting time visiting images, and other shit that defo isn't a activation link.

Uses gevent for threading. This could easily be changed to work with Gmail or any other provider that has POP3 access.

Code:
import poplib
import re

import gevent
from gevent import monkey

monkey.patch_all()

import urllib2

def clean(link):
    '''Hotmail has allkinds of shit in here'''
    if ':' in link:
        link = link.replace('X-Verification:','')
        link = link.replace('X-Message-Delivery:','')
        link = link.replace('Content-Type:','')

    return link

def visit_url(url):
    try:
        print url
        urllib2.urlopen(url)
    except Exception as excep:
        pass
    
# Connect to hotmail pop3 server
M = poplib.POP3_SSL('pop3.live.com', 995)

# Account Details
user = "someemail@hotmail.com"
password = "xxxxxx"

# Regex to look for urls with a few tweaks to avoid images etc
links_regex = re.compile(r'(?!\S+?\.(?:png|jpe?g|gif))((?:https?:\/\/|www\.)(?:[\w-]+\.)+(?:[a-zA-Z]{2,6})+(?:\/[\w@$?%.+\/\\\:#=_-]+)?)')
some_string = ""
try:
    M.user(user)
    M.pass_(password)
except:
    print "Invalid credentials"
else:
    print "Successful login"
    numMessages = len(M.list()[1])
    for i in range(numMessages):
        for j in M.retr(i+1)[1]:
            try:
                some_string = some_string + j
            except Exception as excep:
                pass
# Find all links
links = re.findall(links_regex,some_string)

# Clean out the shit
uniq = set(clean(i) for i in links if len(i) >=14)


# Super epic fast shit
jobs = [gevent.spawn(visit_url, url) for url in uniq]

gevent.joinall(jobs)

print 'Finished Activating'

Oh, and thanks for sticky erect.
 
Python 2

A very fast email link activator.

When you sign up somewhere, and they send you a link to verify your email. This will scan your hotmail inbox, grab all links, and then visit them all. It uses regex to avoid wasting time visiting images, and other shit that defo isn't a activation link.

Uses gevent for threading. This could easily be changed to work with Gmail or any other provider that has POP3 access.

The problem with POP3 is that you can only get messages that are in the inbox, lots of activation emails end up in bulk/trash/junk folders so it's better to use IMAP.

Here's something that I use to get activation links (not a finished script, I don't have time to polish it up right now), I know it works with yahoo accounts, should work with others too but IMAP implementations can be a bit wonky (as you'll see from the comments). It uses IMAPClient 0.7 : Python Package Index.

Code:
from imapclient import IMAPClient
import re
from heapq import merge

def get_activation_links(domain, email, password, host, use_ssl=True, port=993):
    """
    Logs in to an IMAP email account and checks emails for account activation links which it returns.
    """
    c = IMAPClient(host, ssl=use_ssl)
    c.login(email, password)
    folders = ('Inbox', 'Bulk Mail')

    pattern = re.compile(r'(http://.*?%s[^"\s><]+)' % domain)
    all_links = []
    for folder in folders:
        c.select_folder(folder)
        uids = c.search()
        messages = c.fetch(uids, ['BODY.PEEK[TEXT]']) # can't get ['RFC822'] working on yahoo imap server
        for msg in messages.itervalues():
            print '------------------------'
            print msg['BODY[TEXT]']
            print '------------------------'
            matches = pattern.findall(msg['BODY[TEXT]'])
            all_links = list(merge(all_links, matches))

    # TODO: think about deleting emails 
    return [link for link in all_links if ('activate' in link) or ('validation' in link)]
 
The problem with POP3 is that you can only get messages that are in the inbox, lots of activation emails end up in bulk/trash/junk folders so it's better to use IMAP.


IMAP was the original plan, but it seems hotmail doesn't support IMAP (shitty, I know). So the above is the best way to connect to hotmail I believe.
 
IMAP was the original plan, but it seems hotmail doesn't support IMAP (shitty, I know). So the above is the best way to connect to hotmail I believe.

I've always used catchall emails with my own domains and hosting for link spam bots...makes shit way easier to deal with when you can just change your settings to not spambox anything and you can do [randomname]@domain.com and the services treat you as a whole new person. easy peasy
 
Simple script to return # of eBay search results for a specified keyword in Python 2.

Code:
#!/usr/bin/python
 
import urllib2
import re

searchkey = "your keyword here"
stripkey = searchkey.replace(" ","+")
request = urllib2.Request('http://www.ebay.com/sch/i.html?_nkw=' + stripkey)
request.add_header('UserAgent', 'Mozilla Firefox')
response = urllib2.urlopen(request)
for line in response.read().split('\n'):
    match = re.search("untClass'>[0-9]+(,[0-9]+)*", line, re.I)
    if match:
        print match.group().strip("untClass'>")
I'm new to coding so please feel free to improve/fix errors. :bowdown:

P.S. My goal is to integrate this with Sikuli to build an awesome automated jython-powered scraping/analyzing setup. :rasta: