Web scraping framework in PHP using jQuery selectors

danny

New member
Mar 10, 2007
298
4
0
New Jersey
https://github.com/dabbamont/domquery

This isn't fully tested and isn't well documented *yet*, but I'm using it as a sample for a few companies that are considering me for engineering positions. If you've done any work with scraping or data mining in PHP you'll see how powerful it is.

I've used this for a lot of scraping and it makes it EXTREMELY easy. I'll keep it short and sweet, so I'll post an example of a scraping class I wrote really quick with it (this is also included in the source).

Using the GoogleSearch class:
PHP:
$search = new GoogleSearch("xml");

$results = $search->getResults();

print_r($results);

The actual GoogleSearch class (built using DOMQuery)

Returns:
Code:
Array
(
    [1] => Array
        (
            [keyword] => xml
            [position] => 1
            [title] => XML - Wikipedia, the free encyclopedia
            [url] => http://en.wikipedia.org/wiki/XML
            [description] => Extensible Markup Language (XML) is a markup language that defines a set of   rules for encoding documents in a format that is both human-readable and ...
            [domain] => en.wikipedia.org
        )

    [2] => Array
        (
            [keyword] => xml
            [position] => 2
            [title] => XML Tutorial
            [url] => http://www.w3schools.com/xml/
            [description] => A well organized and easy to understand free tutorial with lots of examples and   source code.
            [domain] => www.w3schools.com
        )

    [3] => Array
        (
            [keyword] => xml
            [position] => 3
            [title] => XML Introduction - What is XML?
            [url] => http://www.w3schools.com/xml/xml_whatis.asp
            [description] => XML was designed to transport and store data. ... XML stands for EXtensible   Markup Language; XML is a markup language much like HTML; XML was   designed ...
            [domain] => www.w3schools.com
        )

    [4] => Array
        (
            [keyword] => xml
            [position] => 4
            [title] => Extensible Markup Language (XML)
            [url] => http://www.w3.org/XML/
            [description] => Main page for World Wide Web Consortium (W3C) XML activity and information.
            [domain] => www.w3.org
        )

    [5] => Array
        (
            [keyword] => xml
            [position] => 5
            [title] => XML From the Inside Out -- XML development, XML resources, XML ...
            [url] => http://www.xml.com/
            [description] => XML.com, where the XML community shares XML development resources and   solutions, features timely news, opinions, features, and tutorials; the Annotated ...
            [domain] => www.xml.com
        )

    [6] => Array
        (
            [keyword] => xml
            [position] => 6
            [title] => IBM developerWorks : XML tutorials, code, and forums
            [url] => http://www.ibm.com/developerworks/xml/
            [description] => Dec 11, 2012 ... The XML section on the developerWorks Web site is your resource for XML-  related tools, samples, standards information, education, news and ...
            [domain] => www.ibm.com
        )

    [7] => Array
        (
            [keyword] => xml
            [position] => 7
            [title] => Microsoft XML Downloads - MSDN - Microsoft
            [url] => http://msdn.microsoft.com/en-us/data/bb190600.aspx
            [description] => Extensible Markup Language (XML): Library, learning resources, downloads,   support, and community. Evaluate and find out how to install, deploy, and   maintain ...
            [domain] => msdn.microsoft.com
        )

    [8] => Array
        (
            [keyword] => xml
            [position] => 8
            [title] => XML at The Apache Foundation
            [url] => http://xml.apache.org/
            [description] => Provides commercial-quality standards-based XML solutions for Java, C++ and   Perl that are developed in an open and cooperative fashion. Includes XML and ...
            [domain] => xml.apache.org
        )

    [9] => Array
        (
            [keyword] => xml
            [position] => 9
            [title] => XML Tutorial - Introduction
            [url] => http://www.tizag.com/xmlTutorial/
            [description] => Learn the basics of XML with Tizag.com's XML beginner tutorial.
            [domain] => www.tizag.com
        )

    [10] => Array
        (
            [keyword] => xml
            [position] => 10
            [title] => The XML FAQ
            [url] => http://xml.silmaril.ie/
            [description] => FAQs maintained by Peter Flynn, part of the W3C's XML special interest group.
            [domain] => xml.silmaril.ie
        )

)


PHP:
class GoogleSearch {
    
    public $keyword = "";
    
    private $results = [];
    
    public function __construct($keyword) {
        $this->keyword = $keyword;
    }
    
    public function getResults($maxResults = 10) {
        
        if (count($this->results) > $maxResults) return $this->results;
        
        $curPage = 1;
        $curCount = 0;
        
        while ($curCount < $maxResults) {
            $curCount = $curPage * 10;
            $this->getPage($curPage);
            $curPage++;
        }
        return $this->results;
    }
    
    private function getPage($num) {
        $start = ($num - 1) * 10;
        $doc = new DOMQuery\Doc("http://www.optimum.net/Search?q=" . urlencode($this->keyword) . "&p=$num");
        
        $results = $doc->find("#websearch > div");
        $current = $start;
        
        foreach ($results as $result) {
            
            $current++;
            $link = $result->find("a");
            
            $this->results[$current] = [
                "keyword" => $this->keyword,
                "position" => $current,
                "title" => $link->text(),
                "url" => $link->attr("href"),
                "description" => $result->find("div")->text(),
                "domain" => parse_url($link->attr("href"))["host"]
            ];
        }
        
    }
    
    
}
 


Oh yeah I used that a few years ago. This is nothing like that. It compiles the selectors to xpath so it's about as fast as pure xpath, but wraps the DOM objects so you can use it just like jquery.
 
Why is GoogleSearch a class? AFAICT, you are not using any code in multiple places, so it's not to avoid DRY.

It's just one method calling another, this could be done in a lot less code, more clearly, in a simple function IMO.
 
Why is GoogleSearch a class? AFAICT, you are not using any code in multiple places, so it's not to avoid DRY.

It's just one method calling another, this could be done in a lot less code, more clearly, in a simple function IMO.

GoogleSearch is just an example, but with the way I do things locally I have my reasons. There'd be nothing wrong with doing it that way, maybe it should be simpler just for the sake of the example.
 
I dont see any non blocking multi threading... thanks for the post though.

Do you think working on this would be worth it? With the way PHP works for me it's always about short running scripts with multiple instances. Even if I did asynchronous HTTP requests with closures being passed as callbacks, I don't think it would ever be beneficial as far as performance goes.

To release this I dumbed down the HTTP requests to a file_get_contents because I have to refactor some stuff, but I'm doing something like ajaxsetup, setting headers, have persistent cookies with named instances, returning different types for JSON, HTML, XML responses, can post, can fill and submit selected form elements and return the new page document object. You see where I'm going.
 
Do you think working on this would be worth it? With the way PHP works for me it's always about short running scripts with multiple instances. Even if I did asynchronous HTTP requests with closures being passed as callbacks, I don't think it would ever be beneficial as far as performance goes.

To release this I dumbed down the HTTP requests to a file_get_contents because I have to refactor some stuff, but I'm doing something like ajaxsetup, setting headers, have persistent cookies with named instances, returning different types for JSON, HTML, XML responses, can post, can fill and submit selected form elements and return the new page document object. You see where I'm going.

Been there, done that.

https://github.com/mattseh/magicrequests :)

Also does async blah blah blah