The WF Python Functions War Chest

Need to bulk-create .htaccess redirects? This will pull in a CSV file with your old & new paths (don't include the domain) and correctly format the redirects based on if the old path contains dynamic URLs

Code:
import csv
import re

def main():

    # Instantiate Other Variables
    input_path = str(raw_input("Input CSV File Path: "))
    output_path = open(str(raw_input("Output Text File Path: ")), "wt")
    url_list = list(csv.reader(open(input_path, 'r')))
    headers = url_list[0]
    del url_list[0]

    # Walk through the list and write to the file
    for line in url_list:
        removed_leading_slash = line[0].lstrip('/')
        separated_dynamic_variable_part = re.search('(?<=\?)([^\s]+)', removed_leading_slash)
        separated_dynamic_variable_before = re.search('([^\s]+)(?=\?)', removed_leading_slash)

        # Determine if a URL is dynamic and apply appropriate formatting
        if separated_dynamic_variable_part and separated_dynamic_variable_before:
            string = 'RewriteCond %{{QUERY_STRING}} ^{0}$ [NC]\nRewriteRule ^{1}$ {2}? [R=301,L]\n'
            output_path.write(string.format(separated_dynamic_variable_part.group(), separated_dynamic_variable_before.group(), line[1]))
        else:
            output_path.write("RewriteRule ^{0}$ {1} [R=301,L]\n".format(removed_leading_slash, line[1]))
    
    # Close the file
    output_path.close()

if __name__ == '__main__':
    main()
 
  • Like
Reactions: cheshire


I've never used gevent before. Do you find it to be much better than native python multiprocess, when using it for networking tasks? I had some blockage issues with the multiprocess module last time I used it for a lot of simultaneous networking.

Python 2

A very fast email link activator.

When you sign up somewhere, and they send you a link to verify your email. This will scan your hotmail inbox, grab all links, and then visit them all. It uses regex to avoid wasting time visiting images, and other shit that defo isn't a activation link.

Uses gevent for threading. This could easily be changed to work with Gmail or any other provider that has POP3 access.
 
I've never used gevent before. Do you find it to be much better than native python multiprocess, when using it for networking tasks? I had some blockage issues with the multiprocess module last time I used it for a lot of simultaneous networking.

Much quicker. gevent is great for a quick and dirty solution to speed things up. If speed is your main concern, use Twisted, but be prepared to tear your hair out for hours at a time if you're not used to Callback based programming.
 
Spinner from Mattseh's blog but with a minor correction(?) which prevents it from locking up/looping indefinitely on my computer

Code:
import random
def spin(text,brackets='{}'):
    """use "{choice1|choice2}" """
    open_marker = 0
    while text.find(brackets[0]) != -1:
        counter = 0
        for char in text[:]:
            if char == brackets[0]:
                open_marker = counter
            if char == brackets[1]:
                part = text[open_marker+1:counter]
                words = part.split('|')
                word = random.choice(words)
                text = text.replace(brackets[0]+part+brackets[1],word,1)
                open_marker = 0
        break
            counter += 1
    return text
Py 2; maybe 3 compatible.

this is python, you don't have to write C

this is how i'd do it
Code:
import random
import re
def spin(text):
    for _ in range(text.count('{')):
        field = re.findall( '{([^{}]*)}', text)[0]
        text = text.replace('{%s}' % field, random.choice(field.split('|')), 1)
    return text
print spin('\n'.join(['Hi Im {Matt|Fred, your {friend|brother{!!|!}}}'] * 10))
 
this is python, you don't have to write C

this is how i'd do it
Code:
import random
import re
def spin(text):
    for _ in range(text.count('{')):
        field = re.findall( '{([^{}]*)}', text)[0]
        text = text.replace('{%s}' % field, random.choice(field.split('|')), 1)
    return text
print spin('\n'.join(['Hi Im {Matt|Fred, your {friend|brother{!!|!}}}'] * 10))

performs much faster, thanks ;)
 
extract links from a mailinator account. Requires python-web, i'd suggest ProxyManager from python-web as well.

Code:
import web
import time

def mailinator(account,server='mailinator.com',proxy=False):
    links = set()
    http = web.http(proxy)
    inbox_page = web.grab('http://www.'+server+'/maildir.jsp?email='+account)
    emails = inbox_page.xpath('//a[contains(@href,"displayemail")]/@href')

    for email_page in web.multi_grab(emails):
        links |= set(email_page.xpath('//div[@id="message"]/p/a/@href'))
    return links

if __name__ == '__main__':
    print mailinator('bob')
 
performs much faster, thanks ;)

np, it might be smarter to just generate all permutations and then return N spins that are most distinct from each other, this is an interesting problem, I might take a stab at it again when I have more time.
 
np, it might be smarter to just generate all permutations and then return N spins that are most distinct from each other, this is an interesting problem, I might take a stab at it again when I have more time.

The n umber of permutations quickly becomes silly. I've had customers give 50k chars of spintax.
 
mattseh, I was wondering if you've had any experience using any other HTML Parsers than lxml.

The reason I ask is that lxml is C under the hood, with a Python wrapper. This means that when using it asynchronously (such as gevent), its going to block Python until any lxml commands have been executed, because gevent can't "asynchronously switch" with C code.

Now, I know lxml is the fastest library as standard, but I was just wondering if using a pure python library while using gevent would be faster, due to the blocking issue?

if, and i repeat *if* this turns out to be an issue (i can't tell you how may times i've seen people try to code around an issue that never became as problematic as they expected, from the python gil to any number of concurrency-related starvation conditions) you can just run lxml in a process pool and synchronize on, say, shm.

-p
 
Much quicker. gevent is great for a quick and dirty solution to speed things up. If speed is your main concern, use Twisted, but be prepared to tear your hair out for hours at a time if you're not used to Callback based programming.

i have to defend twisted here. the bit you need to grok is that twisted is cooperative multithreading with event loop based dispatch (or async io via event loop, depending on your pov). reading the bits of twisted core that do all of this takes less than an hour and leaves hair intact for the most part while showing you the light (as in enlightenment).

there are a few truly over-engineered things in twisted (eg ridiculous authentication framework in web iirc), but you're unlikely to run into them.

-p
 
i have to defend twisted here. the bit you need to grok is that twisted is cooperative multithreading with event loop based dispatch (or async io via event loop, depending on your pov). reading the bits of twisted core that do all of this takes less than an hour and leaves hair intact for the most part while showing you the light (as in enlightenment).

there are a few truly over-engineered things in twisted (eg ridiculous authentication framework in web iirc), but you're unlikely to run into them.

-p

I see where your coming from, but as I originally stated, if you just need something thrown up quickly, I'd go with gevent.

Also, Twisted isn't something you can really implement to an existing app easily, you kinda have to design it that way from the ground up, due to the callback nature of it etc.

Event Loop based programming just doesn't come natural to alot of people, but I agree with you, I feel it is the superior option, if speed if your number 1 priority.
 
I see where your coming from, but as I originally stated, if you just need something thrown up quickly, I'd go with gevent.

didn't disagree with that, although i started learning twisted (and python, for that matter) because i needed to throw up a monitoring system for a large cluster and i'd read the relevant python docs, read and grokked twisted core and had a prototype up in less than a day, full beta v1 in under three. incidentally, reading twisted is a great way to learn python idioms.

Also, Twisted isn't something you can really implement to an existing app easily, you kinda have to design it that way from the ground up, due to the callback nature of it etc.

it's a framework, so yeah ;)

Event Loop based programming just doesn't come natural to alot of people, but I agree with you, I feel it is the superior option, if speed if your number 1 priority.

i agree, primarily for the same reason functional doesn't come as easily as imperative. the trouble is, as has been proven time and time again, no matter which magic bullet type of language/framework/library emerges with the purpose of allowing people who don't understand concurrency to write concurrent code, it all ends in tears ;) if you don't understand at least cooperative multithreading (it's easy since you don't have the same synchronization issues you do with non-coop), which is isomorphic to event loops, you have no business mucking about with concurrency. it's like being an amateur bomb technician - you can't be one for long as you either learn or turn into a pollock painting impersonator forever ;)

with that said, gevent is great for taking straight-line code and converting it quickly due to the monkey-patching stuff. it's so sexy i almost wish i had something i could use it for ;)

-p
 
Probably not many people will be able to use this, but it is something. Took me a while to workout the right algorithm.

So I'm about to launch a private proxy provider. Part of that requires allocating from available proxies when the user signs up. So given a list of unfilled proxies, I want to give the user the most diverse set possible. That is, I want to avoid non-sequential IPs.

This is a utility function that, given a list ("sequence") (which in my case is a sorted list of IPs) and a number to allocate ("num_to_allocate"), returns a new list of size "num_to_allocate" containing the most diverse items from "sequence".

It's only a couple lines, but the algorithm was the pain in the ass part.

Code:
# util.py

import math

def allocateFromSequence(sequence, num_to_allocate):
	''' Picks num_to_allocate items from a sequence such that they are as widely spaced as possible.
		Returns the list of of allocated items.'''
		
	'''
	    The intended purpose is IP diversity, but I found it easier to restate the problem like this:
		
		Problem:
			Given a line of M boxes and a pile of N stones, fill the boxes with stones so that
			the stones are spaced as far apart as possible.
			
			Analogy: The stones are IP addresses, the boxes are the possible IP addresses.
			
		Algorithm:
			1) Compute step = CEIL(M/N)
			2) First pass:
				a) Place a stone in box 0
				b) Place a stone in prev box + step. Now, N = N - 1
				c) Go to (b)
			3) Second pass, with remaining N stones:
				a) For each stone:
					i) Place stone in the highest numbered empty box
	'''
	
	# Make things easier. M = number of boxes. L = the boxes
	M = len(sequence)
	L = sequence
	N = num_to_allocate

	# Calculate step and convert to int
	step = int( math.ceil( float(M) / float(N) ) )
	
	# L is line of boxes. LL is the numbers where we place the stones
	LL = []
	
	# First pass
	pos = 0
	for i in range(N):
	
		try:
			LL.append(L[pos])
		except IndexError:
			break
		
		pos = pos + step
		N = N - 1
		
	# Second pass. N is remaining stones.
	pos = -1
	while N > 0:
		if L[pos] not in LL:
			LL.append(L[pos])
			N = N - 1
		pos = pos - 1	
	
	return LL
 
I see where your coming from, but as I originally stated, if you just need something thrown up quickly, I'd go with gevent.

Also, Twisted isn't something you can really implement to an existing app easily, you kinda have to design it that way from the ground up, due to the callback nature of it etc.

Event Loop based programming just doesn't come natural to alot of people, but I agree with you, I feel it is the superior option, if speed if your number 1 priority.

just (partially) changed my mind. mechanize+gevent = win

-p
 
New stuff in python-web, allows you to easily run gevent pools in multiple processes. This is useful if you are hitting 100% cpu before running out of bandwidth, but also simplifies your code, as most stuff is just managed for you.

Define your worker function:

Code:
def grab(out_q, url):
    page = web.grab(url)
    if page:
        out_q.put((url, len(page.external_links())))
This just grabs a page, then returns how many external links there on it. You could put an if statement in there if you only wanted < 50 external links etc. out_q always needs to be the first argument, it's where the results go.

You'll need a list of urls in links.txt for the piece of code which actually calls your function:

Code:
for url, link_count in web.pooler(grab, open('list.txt'), debug=True, pool_size=40):
    print url, link_count
pool_size is total, so on a quad core, each process would get 10 "threads" (greenlets) in this example. Instead of just printing the results, you could easily save to a file.

Instead of an open file, you can pass in any iterable, list, set, tuple, etc.

I have something similar to this code running 4 processes, each 75 threads, doing usually 80-90mbit on a 1GB Linode.

Anything that's CPU intensive should be done in the worker function, not in the main loop, otherwise you are limited to one core again.

Next step might be to simplify a little by allowing the worker function to return instead of write to the out_q.
 
I'm just trying to get modules to install like lxml wit easy_install can't get it to work on Windows or Linux fuck.