Free Online URL Scraper (Need Feedback!)

SpamHat · Sep 4, 2009

Howdy

Just finished building something and could do with a few more eyes on it to catch things (bugs) I may have missed.

Can anyone with a spare 90 seconds check it out:
gScrape - Harvest Unlimited SE URLs online - FREE

You need FF 3.5+ for it to work because of how it's built.

Cheers

Rexibit · Sep 6, 2009

It looks pretty. I noticed a couple of alignment issues with the tabs as you click them. Some are showing varying degrees of blue on the edges when the black overlays it.

I'm not really sure how it is scraping URLS. Is it based off of PR, SE rank, etc.? Even then, I am noticing different results. Is it going based off of country? I have several results that are from the Philippines. That wouldn't be targeted towards the results I would need.

I am guessing it is mainly to help people find sites to get keywords off of? It would be nice to choose which country you want to target for the SE.

SpamHat · Sep 6, 2009

Thanks for checking it out.

It's scraping the URL using Google.com and it's not currently technically possible to make it country specific with the method I'm using, although you can always add a "site:.nl" or whatever you want to target it.

It's good for grabbing lists of pligg sites or whatever to feed other tools - I use it myself quite a lot now.

OneJust · Sep 8, 2009

Heh a limit for FF 3.5 is really strict

Sadly I have 3.0.
If you want audience on your site you really should try to make it compatible with 3.0 at least. Many people have even older releases !

Wondering how you solved the google ban hammer ?
From my experience google bans pretty fast if you scrape them (I used 150 proxies and the problem was gone but that's not affordable for most).

Dunno if that works for you but you can target specific languages by adding "&lr=" as parameter to the search string.

mattseh · Sep 8, 2009

What's stopping you upgrading to 3.5?

And proxies are free if you know where to look

jackfitz · Sep 8, 2009

You should make people sign up to use it, not have registering as an option, that way you build your list.

Red_Virus · Sep 8, 2009

very nice tool, so far the results have been very good. I am signing up and It is definitely not limited to finding a bunch of pligg sites

monkeyman · Sep 8, 2009

OneJust said:
Wondering how you solved the google ban hammer ?
From my experience google bans pretty fast if you scrape them.

Multicurl ftw. Just scrape hundreds of pages once and giggle don't even have a change to ban you.

OneJust said:
(I used 150 proxies and the problem was gone but that's not affordable for most).

mattseh said:
And proxies are free if you know where to look

This.

SpamHat · Sep 10, 2009

OneJust said:
Heh a limit for FF 3.5 is really strict Sadly I have 3.0.
If you want audience on your site you really should try to make it compatible with 3.0 at least. Many people have even older releases !

Wondering how you solved the google ban hammer ?
From my experience google bans pretty fast if you scrape them (I used 150 proxies and the problem was gone but that's not affordable for most).

Dunno if that works for you but you can target specific languages by adding "&lr=" as parameter to the search string.

It has to be FF3.5 for the implementation I've used, which answers you second question: it doesn't get banned

Also I can't add url variables to target by country because of the specific methods involved.

jackfitz said:
You should make people sign up to use it, not have registering as an option, that way you build your list.

Yeah I thought about that but then just let anyone scrape 50 urls immediately so they can see if the tool is any good - if they like it they're probably gonna subscribe anyway.

Red_Virus said:
very nice tool, so far the results have been very good. I am signing up and It is definitely not limited to finding a bunch of pligg sites

Awesome - glad you like it

<edit>
@monkeyman: Yeah I would normally use multi curl or pipelining or something to suck the goodness from Google but I thought it would be less of a headache for a free public tool to take a different approach

</edit>

SpamHat · Oct 20, 2009

I just noticed that if you have the noscript plugin in FF the site will not function correctly.

To use the site you must add an exception for gscrape.

Just thought I should let you know.

turbolapp · Oct 20, 2009

It's not working for me. (I don't have the no script plugin)

phrench · Oct 20, 2009

The resulting URLs are totally different then the ones in the SERPs.
Are you filtering something or what?

SpamHat · Oct 20, 2009

@turbo: weird. What version of FF are you using?

@phrench: I'm collecting them in a weird way, but also I'm appending random keywords with each grab to extract more urls overall from G, so they will be different from a normal search.

Is this what you would ideally want?

If you just search for say, "apple pie" then Google will only return 1k max results, but if you search for "apple pie website", then "apple pie dog", then "apple pie england" you can get pretty much all the urls after a bunch of search's.

Lemmie know how you want it to work.

kblessinggr · Oct 20, 2009

Looks like a good way to see what your competitors are researching

SpamHat · Oct 20, 2009

kblessinggr said:
Looks like a good way to see what your competitors are researching

The aim's to build a list, not spy on people

Whip out Live Headers in FF and check what's going on - after the initial page loads there's no talking with my server.

All the scraping is done client side so it's totally private.

kblessinggr · Oct 21, 2009

SpamHat said:
The aim's to build a list, not spy on people

Whip out Live Headers in FF and check what's going on - after the initial page loads there's no talking with my server.

All the scraping is done client side so it's totally private.

Ok, so I open the initial page at gScrape - Harvest Unlimited SE URLs online - FREE (funny how it thought the latest version of Safari on Snow Leopard was an "old browser", should probably just say 'incompatible browser').

Why is it when I click 'go', it contacts gscrape.com? But you said after the initial page loads there's no talking with your server... (when running vmware behind little snitch, little snitch alerts of a connection to gscrape right after clicking go)

kblessinggr · Oct 21, 2009

When I do a search the javascript makes calls to these urls.

http://www.google.com/uds/GwebSearc...75d5397&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...start=8&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=16&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=24&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=32&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=40&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=48&q=fleshlight tv&key=notsupplied&v=1.0
http://www.google.com/uds/GwebSearc...tart=56&q=fleshlight tv&key=notsupplied&v=1.0

I notice each time I do a new search regardless of the word it'll append a random word to the query, for example above was just fleshlight, but if I click go again it'll send 'fleshlight settings', and if i click again it'll attach psychology, and so on.

SpamHat · Oct 21, 2009

@kblessinggr:

You need to learn how to use the tools you've got

Try using FF live headers and it's probably easier to see.
It doesn't contact the server with any of your data.

It does (of course) get the page images from the server which is probably what you are looking at.

Read my earlier posts and you'll see my explanation of why I append random words...

I've made a note to change the error message to incompatible. There's also some more in depth browser version checking I need to do as I think it's blocking some OK ones - and I know with some work IE8 can run it.

Search

Search

Free Online URL Scraper (Need Feedback!)

SpamHat

instant automation

Rexibit

Automation, I has it.

SpamHat

instant automation

OneJust

New member

mattseh

import this

jackfitz

Jack

Red_Virus

Skype: smartseoservices

monkeyman

New member

SpamHat

instant automation

SpamHat

instant automation

turbolapp

New member

phrench

Not yet banned

SpamHat

instant automation

kblessinggr

PedoBeard

SpamHat

instant automation

kblessinggr

PedoBeard

kblessinggr

PedoBeard

SpamHat

instant automation