#1 Don't use urllib. Use
Requests: HTTP for Humans — Requests 1.2.3 documentation instead, it's much more powerful, and you will have far fewer issues.
#2 If you are going to iterate through a list, don't use a while loop and and incrementing a variable. Instead, use:
Code:
for url in urls:
print url
#3 Regular expressions are a terrible way of parsing HTML. Your regex will work, in the most simple of cases:
Code:
In [4]: re.compile('<script>(.*)</script>').findall('<script>hello</script>')
Out[4]: ['hello']
But that assumes things such as the entire script being on one line and there being no attributes in the opening script tag, like so:
Code:
<script type="text/javascript">
If you used XPath, you could use the query
This avoids the previously mentioned regex issues.
Since you are not being lazy, and have showed an example of your efforts, I'll show you how I'd do it:
Code:
import requests
from lxml import etree
urls = ['http://www.bbc.co.uk/news/']
for url in urls:
r = requests.get(url)
dom = etree.HTML(r.text)
print dom.xpath('//script/text()')
You will need to install lxml and requests for this to work. This has no error checking, so don't rely on it to download thousands of urls.
Hope that helps.