Scraoing n00b. Help a brotha' out! Scrape wp-ecommerce product data.

greyhat

English Gent
Nov 30, 2009
853
8
0
UK
Edit: *Scraping - too soon to be getting back in to work!

Festive cheer to you all WF!

I have a wordpress e-commerce site and i need to get all of the product info in to a CSV file.

How much of a pain in the ass is this going to be and can anybody give me a few pointers please? I would prefer to attempt this in PHP with CURL as this is the language I am trying to learn at the moment. there are approximately 250 products and I just need a .csv with all the info, prices, image URL's etc..


Thanks!!

- G
 


You don't specify whether scraping is the primary purpose your learning PHP / cURL, but if that's the case, I'd quit while your ahead. PHP can scrape, but there are far better alternatives.

And providing your product pages follow a specific layout, it shouldn't be much of a problem, look up phpDOM and using xPath.

Although, you state that this your is site? The plugin probably stores all the data in a mysql database anyway?
 
Yes its wordpress e-commerce. My problem is that the product info is stored in one row in 'wp_posts' and then each individual product image is stored in it's own 'child' row in 'wp_posts'. Which means i cant just dump the table and go through and change it in to a .csv easily without having to manually match up loads of rows with their poarent rows etc...

I was thinking maybe an SQL query might do it but i am a beginner on that front too.. Any ideas?

Yes the product pages are all the same. I am open to suggestions. Thanks!
 
It's hard to tell how hard it would be to do down the SQL route, without seeing the actual data.

If the product pages are all the same, then scraping them shouldn't take long. Look into using phpDOM and xPath as I stated up there, or if you need it quick, just pay somebody else to do it (probably the easiest option).
 
I am looking in to xpath. We will see i have found a plugin for firefox the generates the xpath queries. We will see. Would you be interested in doing it if needs be? I dont have a great budget but, If you are not busy?
 
NO, JUST NO.

Do not scrape shit when you have access to the database directly. Look into using PHP PDO to access the database (makes life so much easier normalizing data coming out of the database no matter the DB type).

Here's a general idea of how you'd take the data and put into a tab-delimited CSV.

PHP:
<?php
$formattedCSVString = '';
$headerRow = array(
                    "Unique Merchant SKU",
                    "Product Name"
                    // Etc.
                    );
$productRows = array();

// Import DB and PDO object with $dbh from external file

ini_set('memory_limit', '-1');

// Product Query
// Instantiation of the PDO class into $dbh is in another file, but you get the idea
try {
    $query = "QUERY GOES HERE";

    $stmt = $dbh->prepare($query);
    $stmt->execute();

    while($row = $stmt->fetch(PDO::FETCH_ASSOC)){
        // Assign Products into the Product Row Array
        $productRows[] = array(
                            "Unique Merchant SKU" => $row["id"],
                            "Product Name" => $row["color_name"] . " " . $row["style_name"] . " " . $row["collection_name"] . " " . $row["category_name"]
                            // Etc.
                            );
    }
}
catch(PDOException $e){
    error_log('Database Error when getting Products Information: ' . $e->getMessage(), 0);
}

// Output header row
if ($headerRow){
    $formattedCSVString .= returnCSVPart($headerRow);
}

// Output the product rows
foreach ($productRows as $product){
    $formattedCSVString .= returnCSVPart($product);
}

// Write the formatted CSV string to the file
// Use this to get a UTF-8 Unicode CSV file that opens properly in Excel:
$tmp = chr(255).chr(254).mb_convert_encoding($formattedCSVString, 'UTF-16LE', 'UTF-8');
$write = fwrite($filePath, $tmp);
fclose($filePath);

// returnCSVPart function
// Return the input array as csv data, maintaining consistency with most CSV implementations
// * uses double-quotes as enclosure when necessary
// * uses double double-quotes to escape double-quotes 
// * uses CRLF as a line separator
function returnCSVPart($fields){
    $separator = '';
    $tempString = '';

    foreach ($fields as $field){
        if (preg_match( '/\\r|\\n|,|"/', $field)){
            $field = '"' . str_replace('"', '""', $field) . '"';
        }
        $tempString .= $separator . $field;
        $separator = "\t";
    }
    return $tempString . "\r\n";
}
?>
 
Thanks i get the idea of that but my problem is as i said each product is not just in one row it is spread out among rows linked by a primary key so the product info is in one row, then the additional images are in their own child rows and so on.. its not just a case of formatting each row in to a csv i need a way to group the data from the parent / child rows. Do you know what i mean?
 
Yeah, I get what you mean. I can't really say for certain without seeing your DB. The concept is pretty simple though once you break it down. You're basically linking two tables together using the primary key from the parent (or the same table to itself if that's where the images are) with that of the foreign key in the images table.
 
So my query would need to do something along the lines of:

For each 'post_id' select 'product_info', 'product_price' 'product_image' etc

AND for each row where 'parent_id' == (the current) 'post_id' select 'additional_image_url'

Does that make any sense!?

Thanks.
 
Sorry just seen your last reply. Will have a look in to it. Does that query above sound feasible?

Then once i have that data in an array just assign it to the right fields in the csv and loop through the whole DB, right?
 
Sorry just seen your last reply. Will have a look in to it. Does that query above sound feasible?

Then once i have that data in an array just assign it to the right fields in the csv and loop through the whole DB, right?

It's close, but here's the thing, you're going to want each additional image in its own column in the CSV right?

So what you're going to do is have to account for that with the query, either by making a subquery create a new field that's being selected and call it "additional_image_n" when there's an additional image that hasn't been added yet for the id or by just selecting all the main data and then call a new query while you're in the loop assigning DB data to the products information array to get each image assigned to that id.

Lots of ways to tackle it. You just have to get creative.
 
Rexibit: Makes sense, thanks. I'll be back! I'll get the queries doing what i want first then worry about the PHP

Lemsip: Thanks for that but it doesnt slove my problem of 'merging post data', unless i missed something?
 
Just gonna get my thoughts down, might make it easier. Anybody feel free to chime in!

So.. I mentioned it's wordpress e-commerce and the structure and such..

All of the information i need is in 'wp_posts' so i just created a DB with just that table in.

There is 'post_status' which is either 'inherit' or 'publish'. I have worked out that 'published' posts are the actual product pages and 'inherit' ones are the ones that just hold the URL of the additional image. So there is something to go on.

So my query can start out as. SELECT (all the fields i want) FROM 'wp_posts' WHERE 'post_status' = 'publish'

That gives me the main bulk of my data. Then to append the extra data from the 'inherit' rows. In the inherit rows they are paired 'post_parent' field macthing the 'ID' field of the parent post. Parent posts do not have a 'post_parent' (it always = 0).

:/

There is just one field i want from the 'inherit' (child) rows, it is 'guid'. That is where the image url is stored, each additional image has its own row.

Hmmm.. off for a smoke. Why did this guy have to use WP ecommerce!!?? :( ha.
 
Ok, i thought the images where in post_meta, in which case that plugin would have pulled it all together onto one line in a csv.

Im no php fan, but it looks like your on the right track, i guess you just need to loop through all the posts returned in your query, and in that loop run a new query where postparent = current post id and append.
 
Might try:

1. figure out how to do a 'select into' ... e.g. create a new table of only the posts marked 'inherit' to use temporarily. the new table would contain id, post_parent and guid for the image file you need.

2. figure out a sql join query between wp_posts and wp_posts_temporary where wp_posts.id = wp_posts_temporary.post_parent

... something like that.