[Python] My Spider Isnt Crawling

Hale.Pane

IDONTLIFT
Oct 13, 2011
681
11
0
I'm trying to learn python. I'm trying to scrape js from a test site. What am I missing ?

I'm using urllib and re module.

HTML:
http://pastebin.com/BWJFU3TD
 


#1 Don't use urllib. Use Requests: HTTP for Humans — Requests 1.2.3 documentation instead, it's much more powerful, and you will have far fewer issues.

#2 If you are going to iterate through a list, don't use a while loop and and incrementing a variable. Instead, use:

Code:
for url in urls:
  print url
#3 Regular expressions are a terrible way of parsing HTML. Your regex will work, in the most simple of cases:

Code:
In [4]: re.compile('<script>(.*)</script>').findall('<script>hello</script>')
Out[4]: ['hello']
But that assumes things such as the entire script being on one line and there being no attributes in the opening script tag, like so:

Code:
<script type="text/javascript">
If you used XPath, you could use the query

Code:
//script/text()

This avoids the previously mentioned regex issues.

Since you are not being lazy, and have showed an example of your efforts, I'll show you how I'd do it:


Code:
import requests
from lxml import etree

urls = ['http://www.bbc.co.uk/news/']

for url in urls:
    r = requests.get(url)
    dom = etree.HTML(r.text)
    print dom.xpath('//script/text()')
You will need to install lxml and requests for this to work. This has no error checking, so don't rely on it to download thousands of urls.


Hope that helps.
 
I used to write tutorials aimed at beginners.

Here's one that uses an easy Ruby crawler (Anemone) to scrape all posts of a blog and saves the contents to a database (MongoDB):

Scraping a blog with Anemone (Ruby web crawler) and MongoDB

It's actually the Ruby port of a Python tutorial that uses a Python web crawler (Scrapy) to do that exact same thing:

Crawl a website with scrapy - *.isBullsh.it

And here's some example code that Python and BeautifulSoup (html parser) to scrape Google results:

http://stackoverflow.com/questions/4371655/extract-google-search-results/4372167#4372167

I was just learning how to code back then though but it works.
 
I've been enlightened again. It's a pity Python doesn't have en equivalent of OWL API used in Java; I'm starting in Python, just like the OP and I managed to create a simple OWL/RDF document crawler.
 
I've been enlightened again. It's a pity Python doesn't have en equivalent of OWL API used in Java; I'm starting in Python, just like the OP and I managed to create a simple OWL/RDF document crawler.

I read about OWL for a few minutes and don't get the point. Could you enlighten me please?
 
I read about OWL for a few minutes and don't get the point. Could you enlighten me please?

OWL is used to store ontologies. My final project on my university has to do with this format which is based in XML; I had to make a crawler and a search engine (like swoogle.com) using OWL API. I was provided with OWL API, which allows you to, among other things, get specific properties from an ontology.

I created the crawler and the search engine (a very very basic one by the way) in Java. I'm not a fan of Java. In fact, I hate that language.

Later I started learning Python and I fell in love with it. I developed the crawler in less than 50 lines while in Java it took about 130 lines. I'm a Python noob, so my crawler is probably the most inefficient and bad designed, but I'm learning a lot of things.
Thanks for sharing your code Matt, when I was doing my research I found your crawler on github and that's how I managed to understand how a crawler is implemented.

I was looking for an alternative to OWL API and I found you can use Jython and that shit, but I'm not experienced enough, so I will pass from Python this time.