cURL problems!!! Aequitas?

krazyjosh5 · Aug 28, 2007

Well, Im trying to program a scraper to pull pretty large chunks of data down from the tubes.

When I run it, though, it seems to stop pulling data around 1 minute from execution and at about 20k of parsed data. The pages are pretty damn big in terms of filesize but I dont imagine it wouldnt be something it couldnt handle... Its going through multiple pages of a directory of online listings and pulling specific pieces of information. Lots of RegExp lovin' goin on but thats all scrapers.

I have already tried the following:

Increased Max Execution times in PHP.ini
Increased Max Input times in PHP.ini
Max uploaded/downloaded filesize in PHP.ini = 8M

Still, the page randomly stops loading most of the way thru it (stops on the S's of an A-Z directory with a redraw of the page at the F's)

Anyone have any ideas on this? Aqeuitas? smaxor?
Suggestions?!

georgiecas1y · Aug 28, 2007

What's your curl_setopts?

krazyjosh5 · Aug 28, 2007

georgiecas1y said:
What's your curl_setopts?

Which one would you like me to post?

georgiecas1y · Aug 28, 2007

all of em except url and post fields that won't give away the exact site you're scraping.

kyleirwin · Aug 28, 2007

do you have CURLOPT_TIMEOUT set? and is the page returning a "script has reached max execution time"? or is the curl request just not returning anything?
as georgiecas1y said, post your curl function, minus the url and postfields.

georgiecas1y · Aug 28, 2007

the problem with the php curl library is you can't see the exact HTTP request sent as of php5, which is essential for debuggings apps and shit. going to a lower level with fsockopen lets you see that as you have to create the header yourself

kyleirwin · Aug 28, 2007

There are still a couple ways around it though, if you make a php page with a var_dump($_SERVER); then change the url in the curl request to that page, you can see all the http headers that you're sending. Then use the CURLOPT_HTTPHEADER setting to override any funky headers (ie. "Expect: 100-Continue"), or set any required headers that curl doesn't set by default.

georgiecas1y · Aug 28, 2007

yeah, i used to use that but then i wanted to check what http version curl was actually using and you don't see the GET / HTTP/1.1 line so i said fuck it and moved over to fsockopen. also, if you use cookiejar and cookiefiles, the test page won't be sent the cookies for the actual site you're scraping. lot of little problems like that

kyleirwin · Aug 29, 2007

Yeah, I agree... there are a lot of bugs in the php/curl binder. Just offering up some workarounds.

Aequitas · Aug 29, 2007

Lot of good pointers and workarounds here but let me ask you the most important question.

Are you running this on a shared hosting or a dedicated hosting?

The reason why I ask this question is because some hosting accounts will automatically kill a script running on your site if that script runs past a certain amount of time.

On shared hosting plans your shit out of luck for changing this but on dedicated hosting plans you might be able to gain access to your php.ini file and change some of the timeout settings. I can't remember the exact settings that need to be changed from the top of my head but that is just another thing.

Also you might not be even having a timeout issue with the script or your hosting, you said it stops working after 1 minute, well this can also be caused by your browser timing out, you see the browser is meant to timeout after 30 seconds if no new information is sent to it, however you can use various php functions to solve this, look into output buffering and the sleep function, let me see here something like the following can stop a browser timeout but I wouldn't use it in a commercial application, if its personal then you can test it out.

Code:

 set_time_limit(900);

// Start output buffering
ob_start();

$message = "First test message";
$oldmessage = "bla";

// Keep on repeating this to prevent PHP from stopping the script
while (true)
{
    $timeoutcounter = 0;
    while ($message == $oldmessage)
    {
        // If 10 seconds elapsed, send a dot (or any other character)
        if ($timeoutcounter == 10)
        {
            echo ".";
            flush();
            ob_flush();
            $timeoutcounter = 0;
        }
        // Timeout executing
        sleep(1);
        // Check for a new message
        $message = file_get_contents("chatdata.txt");
        $timeoutcounter++;
    }

    // Keep the old message in mind
    $oldmessage = $message;

    // And send the message to the user
    echo "<script>window.alert(\"" . $message . "\");</script>";

    // Now, clear the output buffer
    flush();
    ob_flush();
}

Also as others have stated the php function set_time_limit() and the curl_setopt($ch, CURLOPT_TIMEOUT, 15) function can be used to avoid this situation. By the way the 15 on the end of the CURLOPT_TIMEOUT is in seconds as where some are set in milliseconds so watch out for that.

Sometimes tracking down a timeout problem can be a real bitch but just know the problem can be a browser timeout if you don't send new information every 30 seconds to a minute or your hosting account is killing the script for running too long.

I hope this helps out some more.

wesley · Aug 29, 2007

Also add ignore_user_abort(); at the top of your script.

This will let the script run in the background which means that your browser doesn't have to be open the entire time. Good if you save to db/txt instead of just outputting.

wie08e54 · Aug 29, 2007

- setopt(VERBOSE)
- ip protection on the target (use proxies)
- curl_init/curl_close?

plepco · Aug 29, 2007

Maybe run the script from the command line. I've done that before for actually a few different types of PHP scripts so I could see what the hell it's doing while it's running.

chatmasta · Aug 29, 2007

1. Make sure you have set_time_limit(0) to make it run as long as it needs.
2. Make sure you have ini_set('display_errors', TRUE); error_reporting(E_ALL); so you can see if it's actually an error somewhere causing it to stop.
3. If you are storing a shit ton of information in a variable or something, it may be using up too much memory. Do something like this: ini_set('memory_limit', '100M');

smaxor · Aug 29, 2007

I'd almost guarantee you're on shared hosting. There's some shared hosting companies out there that won't let a script run over X amount of time, period. I've never had to set the timeout options to 0 to make things work right. But maybe my curl setup is different. I do a lot of large file downloading and it doesn't need anything except simple stuff like

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$file = curl_exec($ch);

That should work. If it doesn't I'd email your hosting provider. Sure you can set the timeout to 0 which makes it unlimited but that's 99/100 times the default. CURLOPT_TIMEOUT is usually used to limit things in case the site you're pulling from sucks and drops you or something like that.

krazyjosh5 said:
Well, Im trying to program a scraper to pull pretty large chunks of data down from the tubes.

When I run it, though, it seems to stop pulling data around 1 minute from execution and at about 20k of parsed data. The pages are pretty damn big in terms of filesize but I dont imagine it wouldnt be something it couldnt handle... Its going through multiple pages of a directory of online listings and pulling specific pieces of information. Lots of RegExp lovin' goin on but thats all scrapers.

I have already tried the following:

Increased Max Execution times in PHP.ini

Increased Max Input times in PHP.ini

Max uploaded/downloaded filesize in PHP.ini = 8M

Still, the page randomly stops loading most of the way thru it (stops on the S's of an A-Z directory with a redraw of the page at the F's)

Anyone have any ideas on this? Aqeuitas? smaxor?
Suggestions?!

chatmasta · Aug 29, 2007

btw, hostgator lets you run scripts like this.

cashflowrusty · Aug 30, 2007

Not knowing the details of what you are trying to scrape, I can only give you a suggestion that I have used before. As the script is called over and over, time limits are not reached(hopefully), and memory use is kept low:

1) Generate a list of pages to scrape (another scrape?). One url per line, saved to a textfile (or database if you want to get complicated).

2) Now, create another script that opens the file, pulls the first line off, and rewrites the file. THEN do the CURL thing on that URL, saving the results how you would like.

3) When it is done, have it call itself until the file is empty.

Add into step 2: Use a proxy, and randomly or serially go through a list of proxies you have. This way, if you are scraping a single site, it's varied...

Also, here is the code(minus cookie/agent stuff) that I use often:

Code:

$url = 'http://whatever....';

$ch = curl_init();
    curl_setopt($ch, CURLOPT_POST, 1); // use if POSTing the request
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_URL, $url); 
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string); # concatinated like GET 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_PROXY, 'proxyname_or_ip:proxy_port_number');
$result = curl_exec($ch);

georgiecas1y · Aug 30, 2007

it would be easier to help if the OP replied...

Search

Search

cURL problems!!! Aequitas?

krazyjosh5

theres GOLD in dem tubes!

georgiecas1y

Banned

krazyjosh5

theres GOLD in dem tubes!

georgiecas1y

Banned

kyleirwin

New member

georgiecas1y

Banned

kyleirwin

New member

georgiecas1y

Banned

kyleirwin

New member

Aequitas

New member

wesley

New member

wie08e54

New member

plepco

New member

chatmasta

Well-known member

smaxor

New member

chatmasta

Well-known member

cashflowrusty

New member

georgiecas1y

Banned