The Botting Q&A Thread: Ask Away

hehejo · Mar 27, 2012

Jake232 said:
I haven't completely read your code, as it looks like a huge mindfuck. I'd suggest trying to split it up a little, into smaller functions.

I agree. I didn't know how I would split it, as it is just a loop and then 3 requests after each other.

Will have a look at your lib.

hehejo · Mar 28, 2012

Code:

@hydra.on_complete do
     finished = 1
end
@hydra.run
sleep(1) until (finished == 1)

finally, found a way to do it.

Now I can let the script wait to finish a bunch of URL's before continuing. Ideally I would be able to increase the queue on the fly but couldn't find a way to make that work.

Bofu2U · Mar 28, 2012

hehejo said:
Code:

@hydra.on_complete do finished = 1 end @hydra.run sleep(1) until (finished == 1)

finally, found a way to do it.

Now I can let the script wait to finish a bunch of URL's before continuing. Ideally I would be able to increase the queue on the fly but couldn't find a way to make that work.

I was going to say - isn't Hydra made for async requests? That's why it's not really waiting for your shit to complete, heh.

AYGJoe · Mar 28, 2012

Lyova said:
What is the best language to write a bot in? I'm talking about efficiency in writing the code and producing a final program. The only experience I have is with C#, but will gladly move to a language that will let me make bots more easily. And these are bots that will fill out those online giveaways etc.

C# + HTMLUnit / Selenium is amazing. And there's one word for hosting that comes to mind. Windows Azure

mattseh · Mar 28, 2012

Question: How can I subject myself to large amounts of JS?

Answer: CasperJS, a navigation scripting and testing utility for PhantomJS Lotsa fun yo!

eliquid · Mar 29, 2012

^^ thats some nice shit

Staccs · Apr 13, 2012

This might be more of a Ruby question than a botting question, but I already promised bumping this thread 2 weeks ago. Mechanize is next

Alright, so I'm trying to count the total outbound links any given page has, starting with a Wikipedia page.

Here's what I've got so far:

Code:

require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'restclient'

# Collect info
puts "What is your URL?"
url = gets.chomp
puts "Your URL is #{url}"

# Check keyword count
page = Nokogiri::HTML(RestClient.get(url))
link_total = page.css('a')
link_count = link_total.count

# set domain footprint, arrays
outbound_links = []
domain_footprint = "/wiki/"

link_total.each do |links|
 if links.include? domain_footprint == false
	outbound_links << links
	end
end

obl_count = outbound_links.count
puts "Your site has a total of #{link_count} links, with #{obl_count} outbound links."

It's counting the links, but it's telling me there are 0 outbound links. I think this has something to do with my loop and how I'm trying to add the URLs not containing the /wiki/ footprint to an array. Any tips?

dchuk · Apr 13, 2012

it's probably this line:

if links.include? domain_footprint == false

try this instead:

unless(links.include?(domain_footprint))

Note: I like using ( and ) in my ruby code, even though it's not idiomatic. It's easier for me to parse mentally though.

thehobbster · Apr 13, 2012

Staccs said:
This might be more of a Ruby question than a botting question, but I already promised bumping this thread 2 weeks ago. Mechanize is next

Alright, so I'm trying to count the total outbound links any given page has, starting with a Wikipedia page.

Here's what I've got so far:

Code:

require 'open-uri' require 'rubygems' require 'nokogiri' require 'restclient' # Collect info puts "What is your URL?" url = gets.chomp puts "Your URL is #{url}" # Check keyword count page = Nokogiri::HTML(RestClient.get(url)) link_total = page.css('a') link_count = link_total.count # set domain footprint, arrays outbound_links = [] domain_footprint = "/wiki/" link_total.each do |links| if links.include? domain_footprint == false outbound_links << links end end obl_count = outbound_links.count puts "Your site has a total of #{link_count} links, with #{obl_count} outbound links."

It's counting the links, but it's telling me there are 0 outbound links. I think this has something to do with my loop and how I'm trying to add the URLs not containing the /wiki/ footprint to an array. Any tips?

Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.

dchuk · Apr 13, 2012

thehobbster said:
Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.

You would use this gem https://github.com/pauldix/domainatrix to parse the url down to the root domain and then you can check to see if each url is an onsite or off site link. We do that at serpIQ and also in my Arachnid gem (link to relevant code: https://github.com/dchuk/Arachnid/blob/master/lib/arachnid.rb#L93)

Staccs · Apr 14, 2012

dchuk said:
it's probably this line:
try this instead:

Note: I like using ( and ) in my ruby code, even though it's not idiomatic. It's easier for me to parse mentally though.

I switched it out. Instead of the "page has 240 links, 0 outbound", it's giving "page has 240 links, 240 outbound".

I've tried changing the footprint, adding an else statement that adds the non-footprint links to an array, printing link_total to make sure it was scraping ok... no luck.

Anyways, I'm done toying with it for today. Tomorrow I'll give it another look before checking out Mechanize for the first time. Appreciate the help man.

thehobbster said:
Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.

You'll be writing cooler things soon

I've been trying to come up with small projects like this which I can get some use out of. LRTHW was the tutorial I feel I got the most out of.

dchuk · Apr 14, 2012

Staccs said:
I switched it out. Instead of the "page has 240 links, 0 outbound", it's giving "page has 240 links, 240 outbound".

I've tried changing the footprint, adding an else statement that adds the non-footprint links to an array, printing link_total to make sure it was scraping ok... no luck.

Anyways, I'm done toying with it for today. Tomorrow I'll give it another look before checking out Mechanize for the first time. Appreciate the help man.

You'll be writing cooler things soon I've been trying to come up with small projects like this which I can get some use out of. LRTHW was the tutorial I feel I got the most out of.

it's possible that your links in your array aren't strings but are like an xml object or something. try doing something like "links.to_s.include?" and see if that works

Staccs · Apr 16, 2012

dchuk said:
it's possible that your links in your array aren't strings but are like an xml object or something. try doing something like "links.to_s.include?" and see if that works

No dice :sadcrying4:

mattseh · Apr 16, 2012

I don't know ruby so can't help debug code, but one of the best debugging methods is print the results of everything, then see what doesn't match your expected results

Staccs · Apr 16, 2012

Bofu asked it nicely and it fixed itself:

Code:

require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'restclient'

# Collect info
puts "What is your URL?"
url = gets.chomp
puts "Your URL is #{url}"
puts "What is your domain footprint?"
footprint = gets.chomp

# Check keyword count
page = Nokogiri::HTML(RestClient.get(url))
link_total = page.css('a')
link_count = link_total.count

# set domain footprint, arrays
outbound_links = Array.new
internal_links = Array.new

domain_footprint = "#{footprint}"

link_total.each do |links|
 puts "Checking #{links['href']} against #{domain_footprint}"
 if links['href'].to_s.include? domain_footprint
  puts "It does have it in here."
 outbound_links << links
 else 
  puts "Doesn't have it in here."
 internal_links << links
 end
end

obl_count = outbound_links.count
itl_count = internal_links.count
# link_count for total

puts "This page has a total of #{link_count} links, with #{obl_count} outbound links."

mattseh said:
I don't know ruby so can't help debug code, but one of the best debugging methods is print the results of everything, then see what doesn't match your expected results

Noted, ty

Thanks bros. On to the next one.

Mahzkrieg · Apr 23, 2012

Check out Anemone (Ruby web crawler). Awesome DSL. http://anemone.rubyforge.org/information-and-examples.html

I wrote this quickly after seeing some attempts in this thread. Didn't handle any exceptions due to things like encountering rss feeds or malformed urls (like http// without a colon).

But it gives you an idea.

Code:

require 'anemone'
require 'domainatrix'

ROOT = 'http://isbullsh.it'
ROOT_DOMAIN = Domainatrix.parse(ROOT).domain
outbound_links = internal_links = hash_links = []
PATTERNS_TO_SKIP = Regexp.union /#/, /can add a bunch of regexp expressions of urls to not crawl/, /this will merge them into one pattern/

Anemone.crawl(ROOT) do |anemone|
  
  anemone.skip_links_like PATTERNS_TO_SKIP

  anemone.on_every_page do |page|

    # for ever href on the page
    page.doc.css("a").collect{|anchor| anchor['href']}.each do |href|
      absolute = page.to_absolute(href)

      if href[0] === '#'
        hash_links << href
        next
      end

      # or instead do something like `absolute = URI.parse(ROOT).merge(URI.parse(href))`
      # and then `if Domainatrix.parse(absolute).domain == ROOT_DOMAIN`
      if page.in_domain? absolute
        internal_links << href
      else
        outbound_links << href
      end
    end

    puts %Q{

      PAGE:           #{page.url}
      hash_links:     #{hash_links.size}
      internal_links: #{internal_links.size}
      outbound_links: #{outbound_links.size}
    }
  end
end

For an actual working example, here's a crawler that only crawls blog post (/YYYY/MM/heres-a-tile/) and pages (/page/4), then inserts the blog post and blog tag into MongoDB:

Code:

require 'anemone'
require 'mongo'

# Patterns
POST_WITHOUT_SLASH  = %r[\d{4}\/\d{2}\/[^\/]+$]   # http://isbullsh.it/2012/66/here-is-a-title  (301 redirects to slash)
POST_WITH_SLASH     = %r[\d{4}\/\d{2}\/[\w-]+\/$] # http://isbullsh.it/2012/66/here-is-a-title/
ANY_POST            = Regexp.union POST_WITHOUT_SLASH, POST_WITH_SLASH 
ANY_PAGE            = %r[page\/\d+]               # http://isbullsh.it/page/4 
ANY_PATTERN         = Regexp.union ANY_PAGE, ANY_POST

# MongoDB
db = Mongo::Connection.new.db("scraped")
posts_collection = db["posts"]

Anemone.crawl("http://isbullsh.it") do |anemone|
  
  anemone.focus_crawl do |page| 
    page.links.keep_if { |link| link.to_s.match(ANY_PATTERN) } # crawl only links that are pages or blog posts
  end

  anemone.on_pages_like(POST_WITH_SLASH) do |page|
    title = page.doc.at_xpath("//div[@role='main']/header/h1").text rescue nil
    tag = page.doc.at_xpath("//header/div[@class='post-data']/p/a").text rescue nil

    if title and tag
      post = {title: title, tag: tag}
      puts "Inserting #{post.inspect}"
      posts_collection.insert post
    end
  end
end

Staccs · May 1, 2012

Is it normal to get a few of these errors with https URLs?

Code:

C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'

mattseh · May 1, 2012

Staccs said:

Is it normal to get a few of these errors with https URLs?

Code:

C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'

This is like when firefox makes you manually verify a cert to view a page. Depending on your use case, you may want to disable cert checking. You may want to look into catching exceptions, especially if they are thrown by 404s etc (don't know ruby, extrapolating from python experiences)

dchuk · May 1, 2012

Staccs said:

Is it normal to get a few of these errors with https URLs?

Code:

C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'

try here: ruby on rails - SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed - Stack Overflow it has something to do with Ruby 1.9

lingot · May 8, 2012

I'm trying to save an image in watir-webdriver but it seems the image::save method is not implemented. Does anyone know how I may access loaded images for captcha solving purposes using watir-webdriver and firefox?

I found this thread where someone else hit the same dead end: Save image in firefox: problem? - Watir General | Google Groups

Any help would be appreciated.

The Botting Q&A Thread: Ask Away

Developer

Developer

Automation Specialist

Member

import this

Serpwoo.com

New member

Senior Botter

The Kwisatz Haderach

Senior Botter

New member

Senior Botter

New member

import this

New member

New member

New member

import this

Senior Botter

New member