What a mess. Ones an zeros all over the floor.

Uptime

yeah, makes perfect sense
Dec 20, 2009
1,047
22
0
Keys and Carolinas
Yo 1k post.

With nothing to lose a non-programmer would like to talk about programming.

Background and setup to a tirade of laughable attempts at future coding.

<coolstorybro>
A few weeks ago I got 2013 and a chunk of 2014 taken care of and decided to dive into programming. Other than overseeing a few things the majority of the day to day is taken care of leaving me with significant free time. I do not have any commercial intent for the new skilzs I just enjoy it. Call it my crossword puzzle. I’m not a natural. I don’t get a lot of it the first time around and some of it probably never. I ride the coding short bus but can count to potato in binary and hex so let’s just call it a hobby. I never plan on earning a living from programming I just find it fun for some reason.

Only other coding experience worth mentioning was a couple of C flavors a long time ago with a drop into assembly on occasion. Never did anything with it and definitely never got proficient to any degree. I remember almost nothing so it doesn’t count. The last few years have all been PHP for the IM projects. I could usually take existing code and hack it together to do what I want but again I never considered myself particularly good at it.

Deciding on a language this time around was not very efficient. Initially it was between Ruby or Python. Then mat or someone else would start a GO kind of thread and off I would go chasing that possibility for a day or so. In the end it kept coming back to Ruby or Python. After too much time I finally realized what some of the people that didn’t have a religious type bias to a particular language were saying. The language itself isn’t all important, it’s how you think about the challenge and how you solve it. Syntax or libraries may not matter it is you that matters. A better way to think about the problem is what I’m after. You can spend days reading articles on Ruby vs. Python, I never saw a clear winner. If anyone else is undecided look carefully at the dates on some of those comparisons. If they are claiming Ruby is slow that appears to not be the case in later versions if it ever was an issue. idk

Finally I decided to go to one of the well written tutorial sites that offered courses in both languages. I put Ruby on one screen and Python on the other. I would flip between languages on the same subject to see how both handled same at the beginner level. After a couple of hours Python seem to talk to me a little closer than Ruby. I have full intentions of revisiting Ruby at some point until then Python it is.

One of many mistakes I made with PHP was because of having almost instant success with it. I never really took the time to figure out why my Frankenstein code worked. It did what I wanted and I kept going. I was using frameworks before I fully understood classes. I used classes before I properly understood procedural. I don’t think I ever typed more than five lines without a syntax error but I never got totally stumped. It just worked for what I wanted atm.

This time with Python I was determined not to hack from day one but get the fundamentals down. I looked at many different books, tutorials, methods and teaching styles. I finally decided to start with Learn Python the Hard Way . I’ve never been in the military but this is what I picture programming boot camp would be like. I followed the rules and typed every single character like instructed. Not one copy and paste. For me the result was getting up to speed much faster. The extra time spent in the beginning paid off many times in not having to constantly lookup the basics over and over in reference. Getting the syntax right alone was worth the bruises. In comparison the hipster style of Head First Labs from O'Reilly Media, Inc. :: Head First Python wouldn’t let me get past the second chapter. To each his own.

I went through many tutorials and books as I started my first project. Again no commercial intent just an interest that needed a goal. I started with learning how to scrape and didn’t get very far. Not to trivialize the action but the actual retrieval of the data I quickly lost interest in. What got me was interacting with the website. I liked learning how to click on buttons that didn’t originally exist until a piece of JS / Ajax dynamically created them. Multiple menus hidden until the mouse hover displayed the link sometimes 2 or 3 tiers deep. Then I found asynchronousity. (I don’t think that is a word but I like saying it) I got off on multiple workers carrying out my instructions from a single console.

Since I was on this interacting like a human kick I started doing it with headless browsers and watched the server resources drop dramatically. Then I started creating accounts on different websites and doing the opposite of scraping. Instead of hit and runs with rotating proxies the objective was to create long term accounts where each had it’s on profile; Login ID, Password, IP, User Agent. Interacting with these websites like a human, not loading 20 pages a second. Each worker traversing hundreds of pages a day where the options to go next may be dynamically created on the fly depending on what choices you made on the current page.

Some of the websites I attempted to create long term accounts with specifically prohibit in their TOS a user with multiple accounts. Which is the reason I chose them. This is what interest me atm. The game is they try and detect me and I try to see how long I can go perceived as a human. In the beginning all workers got banned. I got better at blending in as a human with better control of the actions of each worker. Better use of different IP’s and UA’s to not link the accounts. With each improvement came greater worker longevity. This is where I’m at now. There is a lot more to learn.
</coolstorybro>

Future posts in this thread I would like to talk about where to go from here. Even if I end up talking to myself typing this out is helpful. Any advice would be appreciated and I will be glad to answer anything that I can on a novice level. A few off the top of my head in no particular order.

1) What resources, methods would you recommend to become a better programmer? I eventually get the job done and aware there may be more than way to do something right. Any suggested links, guidelines that would help me become more proper? When you read someone else’s code what irritates you that I can avoid?
2) Refactoring code for more server efficiency. Currently the way I use servers I red line the CPU then a distant second is RAM. Storage and bandwidth never come into play. I know this is too vague but at some point would like to have a conversation on threading (aware of GIL) / event based best practices for better use of server resources.
3) Beyond the login credentials and proxy / ua there are many other methods a website can find out information from you. Dealing with this is high on my list.
4) Proxies. Jesus Christ what is wrong with these people? Most that I’ve dealt with are in serious need of a realignment. How hard can creating a proxy server be? I want to try this. Not ever for re-sale, internal use only. I’m currently buying a couple hundred private proxies each month.
5) EC2 and other cloud based solutions. I’ve got a few Linodes and couldn’t be happier but would like to experiment. Get a baseline on ease of scaling vs. cost.
6) OCR and pre-filtering before it gets to the OCR engine.
7) Automated unit / suite testing, constant integration, Git kind of system flow.
8) Transitioning from 2.7.x to 3.x

Obviously I’m at the shallow end of the coding pool and if I ever appear to know much about the subject it was unintentional. Any replies, flames, ignore is all good.

Cheers,

Up
 


1) What resources, methods would you recommend to become a better programmer? I eventually get the job done and aware there may be more than way to do something right. Any suggested links, guidelines that would help me become more proper? When you read someone else’s code what irritates you that I can avoid?

Actively contribute to an open source project or get a day job as a programmer (on a good team at a good company). Once you interact daily with people who are better than you, you will learn fast.

2) Refactoring code for more server efficiency. Currently the way I use servers I red line the CPU then a distant second is RAM. Storage and bandwidth never come into play. I know this is too vague but at some point would like to have a conversation on threading (aware of GIL) / event based best practices for better use of server resources.
What in the world are you doing? in 99% of cases for IM related programming, I/O should be your bottleneck :p

3) Beyond the login credentials and proxy / ua there are many other methods a website can find out information from you. Dealing with this is high on my list.
Those are usually the only way sites try and detect you. Other methods include various types of flash cookies, and all the mechanisms listed at evercookie (evercookie - virtually irrevocable persistent cookies). You can also calculate hashes based on a person's browser footprint, using info such as enabled browser plugins, system fonts, etc:
https://panopticlick.eff.org/

4) Proxies. Jesus Christ what is wrong with these people? Most that I’ve dealt with are in serious need of a realignment. How hard can creating a proxy server be? I want to try this. Not ever for re-sale, internal use only. I’m currently buying a couple hundred private proxies each month.
It's hard. I also have a lot of proxies set up. Best I've managed is $5/month/machine (each machine only has 1 IP) with 1 GBPs up/down and unlimited bandwidth. I could get $1-$2/IP/month but I need extreme amounts of bandwidth on each proxy. You have to also deal with spam complaints and other headaches, so I don't blame proxy providers for overcharging and overselling the way they do.

5) EC2 and other cloud based solutions. I’ve got a few Linodes and couldn’t be happier but would like to experiment. Get a baseline on ease of scaling vs. cost.
I'm a Windows guy because I code everything in C#, and I'm more comfortable with the environment. That being said, stick to Linode if you're comfortable with *nix. You should only use EC2 if you have a variable workload and need to fire up and tear down lots of instances all the time. Also, try to avoid using the regular instances, and opt for spot instances instead. You can save a lot of money that way.

6) OCR and pre-filtering before it gets to the OCR engine.
For simpler captchas you can detect patterns and filter them out with simple algorithms that go over the image pixel by pixel. Also, you get better accuracy if you feed the captcha character by character into the OCR, so you want to have a blob detection algorithm that identifies each letter as a blob, extracts it, and feeds it into the OCR.

In 99% of captchas you want to operate on them in greyscale. However, some captchas make different letters be different colors, so that makes it easier to segment them and extract the letters. Use common sense.

Some captchas will draw lines in between the characters so that they can't be segmented into individual characters. At this point you will have to use a simple segmentation algorithm. I've had the most success with the histogram method for vertical segmentation on these captchas. This powerpoint perfectly summarizes what I've been talking about:
http://www.cs.ucf.edu/~czou/CAP6135...cost Attack on a Microsoft CAPTCHA AP V3.pptx

Eventually you will want to crack captchas where the characters blend into each other and are distorted (such as recaptcha). All I can say is you will have to develop your own segmentation algorithm.

Also, for a classifier, in the beginning you usually want to use tesseract. Consider normalizing the images (i.e resizing them to be larger/smaller, removing obvious protrusions, you have to experiment with tesseract). As you become more advanced you want to build your own classifier using either SVMs (support vector machines) or neural networks.

SVMs are much easier to train and have a high success rate, but convolutional neural networks, when trained properly, have a higher success rate than even humans (MNIST Demos on Yann LeCun's website). All successful neural networks and SVMs rely on feature extraction instead of inputting the image pixel by pixel. Also you want to normalize the images before you feed them into the SVM or NN.

Here's two excellent articles with code on convolutional neural nets to get people started. They also briefly cover feature extraction:
Neural Network for Recognition of Handwritten Digits - CodeProject
Convolutional Neural Network Workbench - CodeProject

Note how the convolutional neural net can recognize objects in images. Pretty cool :)

Here's the article that got me started on SVMs:
Handwriting Recognition Revisited: Kernel Support Vector Machines - CodeProject

Here's a comparison of all available image recognition technologies:
MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges

7) Automated unit / suite testing, constant integration, Git kind of system flow.
For my own projects, I don't test my code. Just run it til it crashes. I realize it's a horrible practice, but I code for myself, and if it runs and performs the task I gave it to it, I couldn't care less about how "proper" it is.

At work, (since no one likes messy code :p) we perform tests on each checkin via the SVN using Jenkins:
Jenkins (software) - Wikipedia, the free encyclopedia

8) Transitioning from 2.7.x to 3.x
I leave this to mattseh.

Cheers,

-t
 
  • Like
Reactions: Uptime
If you want to become a better coder ..code.

Read about how to code and read other people's code (That is painful)

Learn how to comment your own code.

Read this:

The Pragmatic Programmer: From Journeyman to Master
[ame=http://www.amazon.com/gp/product/B000SEGEKI/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B000SEGEKI&linkCode=as2&tag=blindapeseo-20]Amazon.com: The Pragmatic Programmer: From Journeyman to Master eBook: Andrew Hunt, David Thomas: Kindle Store[/ame]

(That book is language agnostic)

::emp::
 
Item 3) in the op.

Many sites that do not allow multiple accounts I was still able to have long term interaction without any account link prevention techniques. That is I would create an account with a dif login id and password and any other info they required for each worker and it never raised a flag. The majority that did catch it and link the accounts a separate ip was all that was necessary. I went ahead and created a profile for each worker so on sign up and each login there was a variety of User Agents as well. My goal was to eliminate a connection or pattern between each worker. So instead of 50 workers from one server all showing Firefox 19, Windows 7 etc. There would be a nice variety of Chrome, FF, Opera, IE and dif versions of each within the browser group and dif OS’s. In addition to each worker having its own private anonymous IP.

Then I got interested in what else besides IP, UA the website could interrogate my workers for. What additional info they could extract that might create unwanted patterns. I remembered an old thread on WF that had a link to this site https://panopticlick.eff.org/ showing how a unique “fingerprint” could be created. Page 5 of this page https://panopticlick.eff.org/browser-uniqueness.pdf talks briefly about their methods. I found this interesting.

So being the cocky little coder I think I am I point a Python script with about 70 workers to the testing part of this site https://panopticlick.eff.org/ each worker on a separate IP and using a variety of user agents. This was a humbling experience. I was under the delusion that when I went thru a proxy server all ties, traces, information disappeared upstream of the proxy server. Fuuuuuuuuuu. Every single one of my 70 workers I thought was protected showed the same . . .

-HTTP Headers
-Browser Plugin details
-Time Zone
-Screen Size
-System Fonts
-Cookies Status
-Super Cookies Status

The IP worked and my User Agent string worked. On the above attributes if the webdriver was actually FF or Chrome (regardless of my fake UA) it would still show the same for the above list. However if I went headless it looked even more unnatural. The biggies making it stand out was Plugin details and System fonts. The attributes for the headless driver is for the most part blank. Considering that the most broad part of website visitors will be on Windows using IE or the top 3 most popular browsers it looked totally unnatural for my headless. It would not take much to pull my workers out of the crowd.

So then my research turned to what are all the possible attributes a website could possibly interrogate my workers for. This is where it really got intimidating.

Just the HTTP Headers List of HTTP header fields - Wikipedia, the free encyclopedia
Now add up all the clues these services could provide, GA, GWT, Piwik and 50 more.

Now how about a few of the routes they can take to extract this info from the workers.
HTTP Headers
JS / Ajax
Flash applet
Java
Cookie monsters

Ignorance on this stuff will keep me from playing my games with the big boys. So back to school we go. I obviously had to know more about what was going on at a lower level than I had been used to. With a little bit of digging it looked like I needed to get intimate with this tool Fiddler - The Free Web Debugging Proxy by Telerik This is a start of having a chance to see what is really happening between my workers and the website but I am unsure the best way to go about it.

1) If I was able to determine exactly what a website was interrogating my workers for I eventually might have the ability to give them the information I want them to have. It should be main stream. Not some browser extensions / system font identity that only a 1 in 3.5 million webdev guy would have on his machine. In addition to being a very common footprint there should be variations between workers because that would look more natural and help reduce a common link between workers.
2) Create a really complete profile for every worker to cover any possible query via any route from the websites.
3) ?????

Both have obvious flaws.

#1 Even if I was able to determine what they were looking for now they could ask for something dif tomorrow or even access by a different route and it fails doing a proper job of isolating the workers from each other.

#2 This maybe a more future proof solution but what kind of time and effort might be unnecessarily involved on something that may never be asked for?

If there was a way to just deny the information to a website that in itself could be a huge red flag that this is not natural person.

This is where I’m at now. I need to know this for my own satisfaction but I’m certainly got to be making this harder than it should be.
 
this is how you set it with curl

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

No idea on python

::emp::
 
tainted thanks man. +rep and my very first like I have ever given out.

2) Server Resources
What in the world are you doing? in 99% of cases for IM related programming, I/O should be your bottleneck :p


Not IM. It could be but all projects and tasks have been chosen to get me as dirty as possible in what I wanted to accomplish atm. When I first started I thought scraping would be a good first project. I immediately realized that if I couldn't navigate to the page that had the target data I could not scrape. I made a list of the most intensive JS, Flash, god awful navigation, non-compliant, W3C go to hell validation sites I could find to practice on.

As far as I/O in my current game if my workers move much faster than a human I get detected and lose.

3) Cookies
Checking out evercookie link.

4) Proxies
Ouch on the $5/month. Cost is not my biggest complaint it is dependablity. Because each of the proxies I use are going at human speed I don't think I have the complaint potential. None yet anyway. My issue is when someone else gets a whole subnet taken out. (back to I/O) Then any of mine that belong to that group may become useless.

Here is a fun filled fact I was surprised with in the begining. Even though I don't do Google. If I'm interacting with a website that uses any of Googles api's (GA, fonts, JS libraries, etc) and they do as suggested and put the JS code after the </head> tag it can stop the page from loading if I'm using a proxy from a group someone else got blacklisted. Okay it happens I'm not blaming anybody. My issue is I have workers down or I have to carry excess proxy inventory to keep going because it may be sometime b4 the proxy provider will replace.

6) OCR
I use a service for captcha's $1.39 per k I think. My interest is converting scanned documents > text > overlayed on PDF for storage. Just something interesting for another day. Excellent links and very interested in the Neural Network thanks.

7) Testing
Lolz on the run it til it crashes method. Which is what I'm using now. However I'm spending too much time manually running after an update and chasing issues. I've got to get a handle on this at some point. It is going to exceed the dev time.

Thanks for the details.
 
If you want to become a better coder ..code.

Read about how to code and read other people's code (That is painful)

Learn how to comment your own code.

Read this:

The Pragmatic Programmer: From Journeyman to Master
Amazon.com: The Pragmatic Programmer: From Journeyman to Master eBook: Andrew Hunt, David Thomas: Kindle Store

(That book is language agnostic)

::emp::

this is how you set it with curl

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

No idea on python

::emp::

Coding evar day and think I've worked my way thru most of the Python war chest.

Thanks for the book link, checking it out now.

Got the UA covered it's those 300 other things the headers are whispering about me.

Thanks man!
 
Ahhh... scraping

That is where your RAM issues are coming from.

Think about where you keep your data while working. Right now, you are leaving everything in RAM.

Runnin regexes is nice, but it also adds to RAM usage.

Serialize your objects if necessary (writing objects to disk).

Think about splitting up tasks.

Retrieval / Analysis / Response

might be seperated into different workflows.

::emp::
 
One of many mistakes I made with PHP was because of having almost instant success with it. I never really took the time to figure out why my Frankenstein code worked. It did what I wanted and I kept going. I was using frameworks before I fully understood classes. I used classes before I properly understood procedural. I don’t think I ever typed more than five lines without a syntax error but I never got totally stumped. It just worked for what I wanted atm.

I really can't contribute but I lol'd while reading this... I started with PHP also back before I chose marketing and people over code. I was building silly stuff in CodeIgniter using abstract classes and namespaces before I even knew why the hell I was using OOP in the first place.

On the other hand PHP is still, by far, the best language to start any webapp with. Not because it's a superior language (it's certainly not), but because of the existing support and developer base that exists for it. I guess the best analogy is Wordpress vs EE/Drupal... Wordpress is a laughable pile of shit, but good luck getting off the ground as fast with the other two.
 
Ahhh... scraping

That is where your RAM issues are coming from.

Think about where you keep your data while working. Right now, you are leaving everything in RAM.

Runnin regexes is nice, but it also adds to RAM usage.

Serialize your objects if necessary (writing objects to disk).

Think about splitting up tasks.

Retrieval / Analysis / Response

might be seperated into different workflows.

::emp::

Good morning emp & WF.

I sidelined scraping early and will revisit after I get a handle on the HTTP Header issue.

I snatched a copy of The Pragmatic Programmer book and read a few chapters last night. Wish I had read the section on Orthogonality a few days earlier. Would have saved a lot of time when I had hard coded 50 plus XPath elements and then the site changed the structure. Seems obvious now and lesson learned. Oh well I'm still really new to all of this and would bet heavy that there are many other better choices I could make.

Thanks again for the heads up on the book.
 
. . . good luck getting off the ground as fast with the other two.

I'm so green I can't really say anything good or bad about any of the language options.

Getting off the ground fast is the last thing I'm after. This time around it is all about understanding what is really happening and learning the mindset. Identify all of the possible methods and logically picking the better of the possible solutions.

The language at this point is irrelevant to what I'm after. :cool2:
 
Today 10.17.13 Tasks:

1) Learn the nuances of Fiddler - The Free Web Debugging Proxy by Telerik better.
Specifically how to get more efficient in filtering down to just the request / responses that I'm after. Remove the noise so I can see what info the websites are gathering about me workers.

2) Identify all info gathering calls.

3) Identify all routes used to gather that info.

4) Research other tools that will aid in identifying the information being exchanged between workers and websites.
 
this is how you set it with curl

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

Here's a whole class for cURL that should save you a lot of typing. Feel free to use/share.
PHP:
<?php
/*
|============================================================|
|----------------------- cURL Wrapper -----------------------|
|---------------------- By Tammam Nima ----------------------|
|---------------------- tammamnima.com ----------------------|
|============================================================|
|------------------------------------------------------------|
|-------- Only ugly people remove this comment block --------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|------------------------------------------------------------|
|------ If you can see this comment, rest assured that ------|
|---------- whoever posted this script is not ugly ----------|
|------------------------------------------------------------|
|--------------------- Have a nice day. ---------------------|
|------------------------------------------------------------|
|============================================================|
*/
    class cURLWrapper {
        private $curlHandle;
        private $cookieFile;
        
        const DEBUGGING = false; //true;
        
        public function __construct() {
            $this->debug("Creating cURL Session.");
            $this->curlHandle = curl_init();
            curl_setopt($this->curlHandle, CURLOPT_RETURNTRANSFER, true);
            $this->setTimeout(60);        //set default timeout of 1 min.
        }
        
        public function __destruct() {
            $this->debug("Closing cURL Session.");
            curl_close($this->curlHandle);
        }
        
        public function sendRequest($executeURL, &$output){
            $this->debug("Sending cURL Request.");
            curl_setopt($this->curlHandle, CURLOPT_URL, $executeURL);
            $output = curl_exec($this->curlHandle);
            
            if($output === false){
                $this->debug("Error: " . curl_error($this->curlHandle));
                $output = "CURL ERROR:\n" . curl_error($this->curlHandle);
                die($output);
            }
        }
        
        public function setCookieFile($fileName){
            $this->debug("Setting Cookie File To: \"" . $fileName . "\".");
            $this->cookieFile = $fileName;
            curl_setopt($this->curlHandle, CURLOPT_COOKIEJAR, $this->cookieFile);
            curl_setopt($this->curlHandle, CURLOPT_COOKIEFILE, $this->cookieFile);
        }
        
        public function clearCookies(){
            $this->debug("Clearing Cookies.");
            
            if($this->cookieFile != ""){
                $fileHandle = fopen($this->cookieFile, 'w');
                fwrite($fileHandle, "");
                fclose($fileHandle);
            } else {
                $this->debug("Cookie file not set. Unable to clear.");
                trigger_error("Cookie file not set. Unable to clear.");
            }
        }
        
        public function clearCookies2(){
            curl_setopt($this->curlHandle, CURLOPT_COOKIESESSION, true);
        }
        
        public function disallowVerifyingCertificates(){
            $this->debug("Disallowing Verifying Peer SSL Certificate.");
            curl_setopt($this->curlHandle, CURLOPT_SSL_VERIFYPEER, false);
        }
        
        public function setTimeout($timeout){
            if(is_numeric($timeout)){
                $this->debug("Setting Timeout To " . $timeout . " Seconds.");
                curl_setopt($this->curlHandle, CURLOPT_TIMEOUT, $timeout);
                curl_setopt($this->curlHandle, CURLOPT_CONNECTTIMEOUT, $timeout);
            } else {
                $this->debug("Timeout Provided Is Not Numeric");
                trigger_error("Timeout Provided Is Not Numeric");
            }
        }
        
        public function setHTTPProxy($proxyIP, $proxyPort){
            $this->debug("Setting HTTP Proxy To \"" . $proxyIP . ":" . $proxyPort . "\".");
            curl_setopt($this->curlHandle, CURLOPT_PROXYTYPE, 'HTTP');
            curl_setopt($this->curlHandle, CURLOPT_PROXY, $proxyIP);
            curl_setopt($this->curlHandle, CURLOPT_PROXYPORT, $proxyPort);
        }
        
        public function setCustomHeaders($headersArray){
            curl_setopt($this->curlHandle, CURLOPT_HTTPHEADER, $headersArray);
        }
        
        public function setReferer($referer){
            curl_setopt($this->curlHandle, CURLOPT_REFERER, $referer);
        }
        
        public function setVerbose($verbose){
            curl_setopt($this->curlHandle, CURLOPT_VERBOSE, $verbose);
        }
        
        public function setUserAgent($userAgent){
            $this->debug("Setting User Agent To \"" . $userAgent . "\".");
            curl_setopt($this->curlHandle, CURLOPT_USERAGENT, $userAgent);
        }
        
        public function enableRedirect(){
            $this->debug("Allowing cURL To Handle Redirects.");
            curl_setopt($this->curlHandle, CURLOPT_FOLLOWLOCATION, true);
        }
        
        public function setPOSTFields($postFields){
            $this->debug("Setting POST Fields To \"" . $postFields . "\"");
            curl_setopt($this->curlHandle, CURLOPT_POST, true);
            curl_setopt($this->curlHandle, CURLOPT_POSTFIELDS, $postFields);
        }
        
        public function setGET(){
            $this->debug("Setting Method To GET");
            curl_setopt($this->curlHandle, CURLOPT_POST, false);
        }
        
        
        private function debug($output, $newLine = true){
            //print, output to file, whatever.
            if(self::DEBUGGING){
                echo $output . ($newLine ? "\n" : "");
            }
        }
    }
?>
What in the world are you doing? in 99% of cases for IM related programming, I/O should be your bottleneck :p
Depending on what you're doing, File Mapping may be the solution for you. I/O shouldn't be a big issue. I'm not sure if I can post links (considering my low post count), just Google "File Mapping", the Microsoft site has a decent explanation. However, as always, they don't have any decent example code.

If you're coding something that builds up a lot of data you want to write to file, such as a scraper, make sure you're not writing to file every time you scrape a url/email/number/whatever (inb4 run-on-sentence). This will definitely slow you down. What you want to do is have a good size buffer. Something big enough so you're not writing to file a hundred times a second, yet small enough that you're not destroying your RAM usage.

If you've written a threaded application, I found it best to have the child threads do all the work, and the main (parent) thread do all the I/O. This way you don't waste time waiting for mutexes to unlock.


If I'm interacting with a website that uses any of Googles api's (GA, fonts, JS libraries, etc) and they do as suggested and put the JS code after the </head> tag it can stop the page from loading if I'm using a proxy from a group someone else got blacklisted.

You must be using a library that automatically evaluates HTML and executes JS (like the Internet Explorer ActiveX control I used religiously when I was younger). It's best to manually get the page and parse it yourself, then write the HTTP requests yourself. This way, you avoid wasting resources on useless things like stylesheets, JS libraries, and images.

If I'm writing a bot that interacts with a website, I like to get a good understanding of the website's communication first. Using Wireshark or the Live HTTP Headers plugin in Firefox are great for that.

If I'm posting to a page where I don't require a response, I'll skip that portion; I'll close the connection after HttpSendRequest (without calling HttpQueryInfo).

Alternatively, if I want to check that I got a "200 OK" HTTP response without getting the rest of the page. I'll close the connection after HttpQueryInfo (without calling InternetReadFile).

cURL gives you the entire response; headers concatenated with the rest of the data. These are always separated by two consecutive sets of carriage return and line break ("\r\n\r\n"). If you just want the header and you're reading the response n bytes at a time, you can stop after finding "\r\n\r\n". The PHP class above doesn't really allow for this without any tweaking first.




I hope this post has been useful. I do most of my coding in C++. My HTTP examples were all for WinINet, but the ideas behind them should be translatable to Python or any other language.
 
Things I learned so far today.

1) mattseh’s byline “import this” Prints the Zen of Python 19 guidelines.

2) Fiddler2 is a beautiful thing. I believe I know the extent of the filtering capabilities.
3) Handling the HTTP Header dialog is going to be a bigger bite than originally thought. I simply do not have my head around it properly yet. Need to go backward to get a more fundamental understanding before I can go forward.
4) Discovered that I am utterly incapable of leaving working code alone. When I re-visit what I wrote just a few days ago and ask myself wtf did I do it that way for? I have to refactor with what I’ve learned since.

Tonight when I have gone cross eyed with the worker vs. website interrogation issue I think I’m going to switch gears for a while and implement Pycoo documentation on all self authored existing code worth keeping. emp mentioned earlier about commenting and I’ve done a poor job. May as well get a system in place now. Hopefully this will tie in nicely with the future auto testing, CI, Git flow action.
 
|-------- Only ugly people remove this comment block --------|
|------------------------------------------------------------|

lol

You must be using a library that automatically evaluates HTML and executes JS

I have to execute the JS. Otherwise I cannot interact with the page. Current objective is to mimic a human and not be detected as a bot.

. . . then write the HTTP requests yourself. . . .

This hits home. What I'm trying to do is intercept any response for information that might be used to link my workers and give the website the info I want them to have.

. . . Using Wireshark or the Live HTTP Headers plugin in Firefox are great for that. . . .

Checking out Wireshark now thanks.
 
Checking out Wireshark now thanks.

Just a heads up:
Wireshark gets all messages, across all ports (and therefore protocols). So you may be overwhelmed with information.



So here are some filters to get you started with Wireshark:
Code:
ip.dst == [Server IP] || ip.src == [Server IP]
Replace "[Server IP]" with the IP address of the machine you're communicating with. It will filter all packets where the destination IP address or source IP address is what you entered.



Another nice one is (for http):
Code:
http contains "this phrase"
Which will filter all HTTP packets containing "this phrase".