The Botting Q&A Thread: Ask Away



Code:
@hydra.on_complete do
     finished = 1
end
@hydra.run
sleep(1) until (finished == 1)

finally, found a way to do it.

Now I can let the script wait to finish a bunch of URL's before continuing. Ideally I would be able to increase the queue on the fly but couldn't find a way to make that work.
 
Code:
@hydra.on_complete do
     finished = 1
end
@hydra.run
sleep(1) until (finished == 1)

finally, found a way to do it.

Now I can let the script wait to finish a bunch of URL's before continuing. Ideally I would be able to increase the queue on the fly but couldn't find a way to make that work.

I was going to say - isn't Hydra made for async requests? That's why it's not really waiting for your shit to complete, heh.
 
What is the best language to write a bot in? I'm talking about efficiency in writing the code and producing a final program. The only experience I have is with C#, but will gladly move to a language that will let me make bots more easily. And these are bots that will fill out those online giveaways etc.

C# + HTMLUnit / Selenium is amazing. And there's one word for hosting that comes to mind. Windows Azure :)
 
This might be more of a Ruby question than a botting question, but I already promised bumping this thread 2 weeks ago. Mechanize is next :)

Alright, so I'm trying to count the total outbound links any given page has, starting with a Wikipedia page.

Here's what I've got so far:
Code:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'restclient'

# Collect info
puts "What is your URL?"
url = gets.chomp
puts "Your URL is #{url}"

# Check keyword count
page = Nokogiri::HTML(RestClient.get(url))
link_total = page.css('a')
link_count = link_total.count

# set domain footprint, arrays
outbound_links = []
domain_footprint = "/wiki/"

link_total.each do |links|
 if links.include? domain_footprint == false
	outbound_links << links
	end
end

obl_count = outbound_links.count
puts "Your site has a total of #{link_count} links, with #{obl_count} outbound links."

It's counting the links, but it's telling me there are 0 outbound links. I think this has something to do with my loop and how I'm trying to add the URLs not containing the /wiki/ footprint to an array. Any tips?
 
This might be more of a Ruby question than a botting question, but I already promised bumping this thread 2 weeks ago. Mechanize is next :)

Alright, so I'm trying to count the total outbound links any given page has, starting with a Wikipedia page.

Here's what I've got so far:
Code:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'restclient'

# Collect info
puts "What is your URL?"
url = gets.chomp
puts "Your URL is #{url}"

# Check keyword count
page = Nokogiri::HTML(RestClient.get(url))
link_total = page.css('a')
link_count = link_total.count

# set domain footprint, arrays
outbound_links = []
domain_footprint = "/wiki/"

link_total.each do |links|
 if links.include? domain_footprint == false
	outbound_links << links
	end
end

obl_count = outbound_links.count
puts "Your site has a total of #{link_count} links, with #{obl_count} outbound links."

It's counting the links, but it's telling me there are 0 outbound links. I think this has something to do with my loop and how I'm trying to add the URLs not containing the /wiki/ footprint to an array. Any tips?

Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.
 
Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.

You would use this gem https://github.com/pauldix/domainatrix to parse the url down to the root domain and then you can check to see if each url is an onsite or off site link. We do that at serpIQ and also in my Arachnid gem (link to relevant code: https://github.com/dchuk/Arachnid/blob/master/lib/arachnid.rb#L93)
 
it's probably this line:
try this instead:

Note: I like using ( and ) in my ruby code, even though it's not idiomatic. It's easier for me to parse mentally though.
I switched it out. Instead of the "page has 240 links, 0 outbound", it's giving "page has 240 links, 240 outbound".

I've tried changing the footprint, adding an else statement that adds the non-footprint links to an array, printing link_total to make sure it was scraping ok... no luck.

Anyways, I'm done toying with it for today. Tomorrow I'll give it another look before checking out Mechanize for the first time. Appreciate the help man.

Dude, this is so bad ass. Good job!

I've been going through LRTHW the past week and a half or so nice and slow. I'm on exercise 26, the first test!

I was just thinking as I read your code, there has to be some way to pull the domain footprint and save it generically so you can generalize this code for any website.
You'll be writing cooler things soon :) I've been trying to come up with small projects like this which I can get some use out of. LRTHW was the tutorial I feel I got the most out of.
 
I switched it out. Instead of the "page has 240 links, 0 outbound", it's giving "page has 240 links, 240 outbound".

I've tried changing the footprint, adding an else statement that adds the non-footprint links to an array, printing link_total to make sure it was scraping ok... no luck.

Anyways, I'm done toying with it for today. Tomorrow I'll give it another look before checking out Mechanize for the first time. Appreciate the help man.


You'll be writing cooler things soon :) I've been trying to come up with small projects like this which I can get some use out of. LRTHW was the tutorial I feel I got the most out of.

it's possible that your links in your array aren't strings but are like an xml object or something. try doing something like "links.to_s.include?" and see if that works
 
I don't know ruby so can't help debug code, but one of the best debugging methods is print the results of everything, then see what doesn't match your expected results
 
Bofu asked it nicely and it fixed itself:

Code:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
require 'restclient'

# Collect info
puts "What is your URL?"
url = gets.chomp
puts "Your URL is #{url}"
puts "What is your domain footprint?"
footprint = gets.chomp

# Check keyword count
page = Nokogiri::HTML(RestClient.get(url))
link_total = page.css('a')
link_count = link_total.count

# set domain footprint, arrays
outbound_links = Array.new
internal_links = Array.new

domain_footprint = "#{footprint}"

link_total.each do |links|
 puts "Checking #{links['href']} against #{domain_footprint}"
 if links['href'].to_s.include? domain_footprint
  puts "It does have it in here."
 outbound_links << links
 else 
  puts "Doesn't have it in here."
 internal_links << links
 end
end

obl_count = outbound_links.count
itl_count = internal_links.count
# link_count for total

puts "This page has a total of #{link_count} links, with #{obl_count} outbound links."

I don't know ruby so can't help debug code, but one of the best debugging methods is print the results of everything, then see what doesn't match your expected results
Noted, ty

Thanks bros. On to the next one.
 
Check out Anemone (Ruby web crawler). Awesome DSL. http://anemone.rubyforge.org/information-and-examples.html

I wrote this quickly after seeing some attempts in this thread. Didn't handle any exceptions due to things like encountering rss feeds or malformed urls (like http// without a colon).

But it gives you an idea.

Code:
require 'anemone'
require 'domainatrix'

ROOT = 'http://isbullsh.it'
ROOT_DOMAIN = Domainatrix.parse(ROOT).domain
outbound_links = internal_links = hash_links = []
PATTERNS_TO_SKIP = Regexp.union /#/, /can add a bunch of regexp expressions of urls to not crawl/, /this will merge them into one pattern/

Anemone.crawl(ROOT) do |anemone|
  
  anemone.skip_links_like PATTERNS_TO_SKIP

  anemone.on_every_page do |page|

    # for ever href on the page
    page.doc.css("a").collect{|anchor| anchor['href']}.each do |href|
      absolute = page.to_absolute(href)

      if href[0] === '#'
        hash_links << href
        next
      end

      # or instead do something like `absolute = URI.parse(ROOT).merge(URI.parse(href))`
      # and then `if Domainatrix.parse(absolute).domain == ROOT_DOMAIN`
      if page.in_domain? absolute
        internal_links << href
      else
        outbound_links << href
      end
    end

    puts %Q{

      PAGE:           #{page.url}
      hash_links:     #{hash_links.size}
      internal_links: #{internal_links.size}
      outbound_links: #{outbound_links.size}
    }
  end
end

For an actual working example, here's a crawler that only crawls blog post (/YYYY/MM/heres-a-tile/) and pages (/page/4), then inserts the blog post and blog tag into MongoDB:

Code:
require 'anemone'
require 'mongo'

# Patterns
POST_WITHOUT_SLASH  = %r[\d{4}\/\d{2}\/[^\/]+$]   # http://isbullsh.it/2012/66/here-is-a-title  (301 redirects to slash)
POST_WITH_SLASH     = %r[\d{4}\/\d{2}\/[\w-]+\/$] # http://isbullsh.it/2012/66/here-is-a-title/
ANY_POST            = Regexp.union POST_WITHOUT_SLASH, POST_WITH_SLASH 
ANY_PAGE            = %r[page\/\d+]               # http://isbullsh.it/page/4 
ANY_PATTERN         = Regexp.union ANY_PAGE, ANY_POST

# MongoDB
db = Mongo::Connection.new.db("scraped")
posts_collection = db["posts"]

Anemone.crawl("http://isbullsh.it") do |anemone|
  
  anemone.focus_crawl do |page| 
    page.links.keep_if { |link| link.to_s.match(ANY_PATTERN) } # crawl only links that are pages or blog posts
  end

  anemone.on_pages_like(POST_WITH_SLASH) do |page|
    title = page.doc.at_xpath("//div[@role='main']/header/h1").text rescue nil
    tag = page.doc.at_xpath("//header/div[@class='post-data']/p/a").text rescue nil

    if title and tag
      post = {title: title, tag: tag}
      puts "Inserting #{post.inspect}"
      posts_collection.insert post
    end
  end
end
 
  • Like
Reactions: Staccs
Is it normal to get a few of these errors with https URLs?

Code:
C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'
 
Is it normal to get a few of these errors with https URLs?

Code:
C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'

This is like when firefox makes you manually verify a cert to view a page. Depending on your use case, you may want to disable cert checking. You may want to look into catching exceptions, especially if they are thrown by 404s etc (don't know ruby, extrapolating from python experiences)
 
Is it normal to get a few of these errors with https URLs?

Code:
C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect': SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `block in connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:44:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/timeout.rb:89:in `timeout'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent/ssl_reuse.rb:70:in `connect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:637:in `do_start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/1.9.1/net/http.rb:632:in `start'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:511:in `connection_for'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.6/lib/net/http/persistent.rb:806:in `request'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:251:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:918:in `response_redirect'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize/http/agent.rb:286:in `fetch'
        from C:/RailsInstaller/Ruby1.9.2/lib/ruby/gems/1.9.1/gems/mechanize-2.4/lib/mechanize.rb:407:in `get'
        from visit.rb:7:in `block in <main>'
        from visit.rb:5:in `each_line'
        from visit.rb:5:in `<main>'

try here: ruby on rails - SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed - Stack Overflow it has something to do with Ruby 1.9