The WF Python Functions War Chest

Simple script to return # of eBay search results for a specified keyword in Python 2.

Code:
#!/usr/bin/python
 
import urllib2
import re

searchkey = "your keyword here"
stripkey = searchkey.replace(" ","+")
request = urllib2.Request('http://www.ebay.com/sch/i.html?_nkw=' + stripkey)
request.add_header('UserAgent', 'Mozilla Firefox')
response = urllib2.urlopen(request)
for line in response.read().split('\n'):
    match = re.search("untClass'>[0-9]+(,[0-9]+)*", line, re.I)
    if match:
        print match.group().strip("untClass'>")
I'm new to coding so please feel free to improve/fix errors. :bowdown:

P.S. My goal is to integrate this with Sikuli to build an awesome automated jython-powered scraping/analyzing setup. :rasta:


Good first forum post :) When you're inserting a variable into a url, it's best to use urllib.quote_plus on the variable. For the regex, not sure why you're looping through each line, re.S might be helpful to you though.

XPaths are great for this sort of thing, more resiliant to code slightly changing, as well as quicker to write once you're proficient.

Here's my one liner:

Code:
import web
import urllib

keyword = 'potato peeler'
print web.grab('http://www.ebay.com/sch/i.html?_nkw=%s' % urllib.quote_plus(keyword)).single_xpath('//span[@class="countClass"]/text()').replace(',','')

That still uses urllib2 under the hood. To get xpath with any random html (such as from sikuli):

Code:
from lxml import etree
doc = etree.HTML(htmldata)
print doc.xpath('//somexpath')
 


^^

Want to scrape a load of those keywords at once?

Code:
import web
import urllib

keywords = ['potato peeler','fire hydrant','xbox 360','acai berry']
for page in web.multi_grab(['http://www.ebay.com/sch/i.html?_nkw=%s' % urllib.quote_plus(keyword) for keyword in keywords]):
    print page.single_xpath('//h1[@class="keywordClass"]/text()') #keyword count
    print page.single_xpath('//span[@class="countClass"]/text()').replace(',','') #results count

eBay counts helps me a lot with something, thanks for the idea :)
 
mattseh, I was wondering if you've had any experience using any other HTML Parsers than lxml.

The reason I ask is that lxml is C under the hood, with a Python wrapper. This means that when using it asynchronously (such as gevent), its going to block Python until any lxml commands have been executed, because gevent can't "asynchronously switch" with C code.

Now, I know lxml is the fastest library as standard, but I was just wondering if using a pure python library while using gevent would be faster, due to the blocking issue?
 
mattseh, I was wondering if you've had any experience using any other HTML Parsers than lxml.

The reason I ask is that lxml is C under the hood, with a Python wrapper. This means that when using it asynchronously (such as gevent), its going to block Python until any lxml commands have been executed, because gevent can't "asynchronously switch" with C code.

Now, I know lxml is the fastest library as standard, but I was just wondering if using a pure python library while using gevent would be faster, due to the blocking issue?

I really don't think it matters, as lxml takes so little time to process a document compared to network latency. When I run my scrapers full blast, I look at network utilisation and it's always a good percentage of the max, so I'm getting good throughput.
 
I really don't think it matters, as lxml takes so little time to process a document compared to network latency. When I run my scrapers full blast, I look at network utilisation and it's always a good percentage of the max, so I'm getting good throughput.

I didn't think it'd matter all that much, seeing as most time is spent waiting for the network operations as you say. I'll stick with lxml then, it's a pretty neat little library once you get the hang of it :)
 
I also don't think it matter, i also used python in my project lookbeyondresumes.com project ,and as lxml take little time to process a document. i used the Python in NLTK in lookbeyondresumes,it takes less time to calculate the value and it is faster than any other language also
 
Wrapper for PyCurl. Doesn't do much, but I find this very useful. Returns header, body, and BeautifulSoup object. Gets binary file, (image, etc.) and returns path to the retrieved file. Makes setting cURL options less verbose.

Code:
import pycurl, urllib, StringIO
from BeautifulSoup import BeautifulSoup

class CurlWrapper(object):
    '''
    Convenience wrapper for pycurl.

    Set cURL handler options from list of (option, value) tuples. 
    Perform get and post html and return a tuple of (header, body, BeautfulSoup).
    Perform get binary and return (header, filename).
    '''
    def __init__(self):
        useragent = 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:5.0) Gecko/20100101 Firefox/5.0'
        self.header = StringIO.StringIO()
        self.body = StringIO.StringIO()
        self.c = pycurl.Curl()
        default_options = [(pycurl.FOLLOWLOCATION, 1),
                           (pycurl.TIMEOUT, 10),
                           (pycurl.USERAGENT, useragent),
                           (pycurl.REFERER, 'http://google.com'),
                           (pycurl.HEADERFUNCTION, self.header.write),
                           (pycurl.WRITEFUNCTION, self.body.write)]
        self.options(default_options)

    def clear_buffers(self):
        self.header.truncate(0)
        self.body.truncate(0)

    def options(self, option_list):
        '''
        Requires a list of tuples -> (pycurl.OPTION, value) and adds calls handle.setopt for each.
        '''
        for option in option_list:
            self.c.setopt(option[0], option[1])
    
    def get(self, url):
        '''
        Requires string of the URL and returns tuple (str(header), str(html), BeautifulSoup(html))
        '''
        self.clear_buffers()
        get_options = [(pycurl.URL, url), (pycurl.HTTPGET, 1)]
        self.options(get_options)
        self.c.perform()
        return (self.header.getvalue(), self.body.getvalue(), BeautifulSoup(self.body.getvalue()))

    def post(self, url, data):
        '''
        Requires string of the URL and dict of postdata and returns tuple (str(header), str(html), BeautifulSoup(html))
        '''
        self.clear_buffers()
        postdata = urllib.urlencode(data)
        post_options = [(pycurl.URL, url), (pycurl.POST, 1), (pycurl.POSTFIELDS, postdata)]
        self.options(post_options)
        self.c.perform()
        return (self.header.getvalue(), self.body.getvalue(), BeautifulSoup(self.body.getvalue()))

    def binary(self, url, path, filename):
        '''
        Requires string of the URL, string of the path, string of filename and returns tuple (str(header), str(path+filename)).
        '''
        self.clear_buffers()
        binary_options = [(pycurl.URL, url), (pycurl.HTTPGET, 1)]
        self.options(binary_options)
        self.c.perform()
        f = open(path + filename, 'wb')
        f.write(self.body.getvalue())
        f.close()
        return (self.header.getvalue(), path + filename)
 
just found this thread, have hard-on
i'd love to contribute but don't have any snippets ready off-hand. requests?

Request, I want some type of script when it runs it asks for an Integer and then that amount of money gets put in my bank account. I know you can do it, don't let the community down. Also if you can make it multithreaded that would be great.

Heres some code to get you started
print 'Enter the amount of money requested: '
monies = raw_input()

# missing code for lwbco to write

print "%s has been deposited to your account, have a nice day" %monies
 
I do stuff like putting a key on a remote maching using fabric, here's a script I use to setup a new ubuntu server (handles multiple versions). It takes about 10 minutes to run and install everything I want as a base on a new server.

I have a directory named sources.list in the same place as the script containing files 'hardy', 'lucid' and 'maverick' to be used as sources.list. To use `fab configure_new_server:distrofilename`


Code:
from fabric.api import *


env.hosts = ['root@11.22.11.22']
env.key_filename = ['/path/to/.ssh/id_rsa.personal']


def configure_new_server(distro):
    setup_key()
    setup_sources_list(distro)
    apt_get_update()
    install_default_packages(distro)
    install_fish()
    copy_config_files()


def setup_key():
    """
    Creates .ssh/authorized_keys and puts my personal key in it
    """
    with cd('~'):
        put('~/.ssh/id_rsa.personal.pub', '')
        run('rm -rf .ssh')
        run('mkdir .ssh')
        run('cat id_rsa.personal.pub > .ssh/authorized_keys')
        run('rm id_rsa.personal.pub')
        run('chmod go-w $HOME $HOME/.ssh')
        run('chmod 600 $HOME/.ssh/authorized_keys')
        run('chown $USER $HOME/.ssh/authorized_keys')


def copy_config_files():
    with cd('~'):
        put('~/.vimrc_server', '')
        run('mv .vimrc_server .vimrc')


def install_fish():
    """
    Installs fish (better than bash)
    Username must be root
    """
    if env.user != 'root':
        abort("Aborting: username must be root")
    run('apt-get -y install fish')
#    run('chsh -s /usr/bin/fish')
    run('rm -rf ~/.config/fish/functions')
    run('mkdir -p ~/.config/fish/functions')
    with cd('/root'):
        put('~/.config/fish/functions/*.fish', '.config/fish/functions')
        put('~/.config/fish/functions/....fish', '.config/fish/functions')
        put('~/.config/fish/functions/.....fish', '.config/fish/functions')
        put('~/.config/fish/functions/......fish', '.config/fish/functions')


def setup_sources_list(distro):
    """
    Backs up existing sources.list then sends new version to remote server
    """
    if env.user != 'root':
        abort("Aborting: username must be root")

    with cd('/etc/apt'):
        run('cp sources.list sources.list.orig')
        put('sources.list/%s' % distro, 'sources.list')
    run('apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 7CC17CD2')
    apt_get_update()
    if distro in ('maverick', 'lucid'):
        run('wget http://www.dotdeb.org/dotdeb.gpg')
        run('cat dotdeb.gpg | sudo apt-key add -')


def apt_get_update():
    run('apt-get update')


def apt_cache_search(name):
    run('apt-cache search %s' % name)


def user_add(username):
    if env.user != 'root':
        abort("Aborting: username must be root")
    run('useradd -m %s' % username)
    run('passwd %s' % username)
    run('rm -rf /home/%s/.config/fish/functions' % username)
    run('mkdir -p /home/%s/.config/fish/functions' % username)
    put('~/.config/fish/functions/*.fish', '/home/%s/.config/fish/functions' % username)
    put('~/.config/fish/functions/....fish', '/home/%s/.config/fish/functions' % username)
    put('~/.config/fish/functions/.....fish', '/home/%s/.config/fish/functions' % username)
    put('~/.config/fish/functions/......fish', '/home/%s/.config/fish/functions' % username)
    run('echo "set -g WORKON_HOME ~/virtualenvs" > /home/%s/.config/fish/config.fish' % username)
    run('chown -R %s /home/%s/.config' % (username, username))
    run('chgrp -R %s /home/%s/.config' % (username, username))
    put('~/.ssh/id_rsa.personal.pub', '/home/%s' % username)
    with cd('/home/%s' % username):
        run('rm -rf .ssh')
        run('mkdir .ssh')
        run('cat id_rsa.personal.pub > .ssh/authorized_keys')
        run('rm id_rsa.personal.pub')
        run('chmod go-w /home/%s /home/%s/.ssh' % (username, username))
        run('chmod 600 /home/%s/.ssh/authorized_keys' % username)
        run('chown %s /home/%s/.ssh/authorized_keys /home/%s /home/%s/.ssh' % (username, username, username, username))
        run('chgrp %s /home/%s/.ssh/authorized_keys /home/%s /home/%s/.ssh' % (username, username, username, username))


def shell():
    open_shell();


def install_default_packages(distro='maverick'):
    if env.user != 'root':
        abort("Aborting: username must be root")
    run('apt-get -y install language-pack-en wget python cron vim man-db less build-essential zlib1g-dev libreadline5-dev libncurses5-dev python-dev libapache2-mod-wsgi apache2-mpm-worker libpq-dev')
    run('mandb --create')
    if distro in ('maverick', 'lucid'):
        run('wget http://pypi.python.org/packages/2.6/s/setuptools/setuptools-0.6c11-py2.6.egg; sh setuptools-0.6c11-py2.6.egg')
        run('apt-get -y install nginx')
    elif distro == 'hardy':
        run('wget http://pypi.python.org/packages/2.5/s/setuptools/setuptools-0.6c11-py2.5.egg; sh setuptools-0.6c11-py2.5.egg')
        with cd('/tmp'):
            run('wget http://www.python.org/ftp/python/2.6.6/Python-2.6.6.tgz')
            run('tar xzf Python-2.6.6.tgz')
    run('rm setuptools-*')
    run('easy_install pip')
    run('pip install virtualenv')
    run('pip install mercurial')
 
web hosting,web development

I tink..may not continually take the tests for me until it gets a good enough score to pass. i will largely ignore him, and instead post useful shit.. I know you can do it, don't let the community down.
 
jamescraigmtts is a bot trying to boost post count.

SSH key - out of habit of typing ssh servername (set in ~/ssh/.config)
 
Spinner from Mattseh's blog but with a minor correction(?) which prevents it from locking up/looping indefinitely on my computer

Code:
import random
def spin(text,brackets='{}'):
    """use "{choice1|choice2}" """
    open_marker = 0
    while text.find(brackets[0]) != -1:
        counter = 0
        for char in text[:]:
            if char == brackets[0]:
                open_marker = counter
            if char == brackets[1]:
                part = text[open_marker+1:counter]
                words = part.split('|')
                word = random.choice(words)
                text = text.replace(brackets[0]+part+brackets[1],word,1)
                open_marker = 0
        break
            counter += 1
    return text

Py 2; maybe 3 compatible.
 
Great information shared!!
echeaptravel,cheap flight deals, extremely cheap flights, cheap flights to anywhere, how to find the cheapest flights
 
google scraper using mattsehs awesome web.py

# -*- coding: utf-8 -*-
from web import *
import operator

query = 'python'
urls = ['http://www.google.com/search?q=%s&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:eek:fficial&client=firefox&num=%d'%(query,i*100) for i in range(0,9)]
links = [grab(url).xpath("//h3/a/@href") for url in urls]
result = [ link for link in reduce(operator.add, links) if not link.startswith('http://www.google.com')]

print result
print len(result)
edit* who can make this a one liner?
 
I'm not a python fag but I believe it would be:

print [ link for link in reduce(operator.add, [grab(url).xpath("//h3/a/@href") for url in ['http://www.google.com/search?q=%s&ie=utf-8&as_qdr=all&aq=t&rls=org:mozilla:us:eek:ffic ial&client=firefox&num=%d'%('python',i*100) for i in range(0,9)]]) if not link.startswith('http://www.google.com')]
 
Use multi_grab, get your parallelism game proper ;)

Code:
for page in web.multi_grab(urls):
   links += page.xpath('//h3/@href')

You are missing some of the obfuscation google uses to fuck with you btw ;)