DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Snippets has posted 5883 posts at DZone. View Full User Profile

Scraping Google Search Results With Hpricot

06.12.2007
| 11473 views |
  • submit to reddit
        // snagged from http://g-module.rubyforge.org/

require 'rubygems'
require 'cgi'
require 'open-uri'
require 'hpricot'

q = %w{meine kleine suchanfrage}.map { |w| CGI.escape(w) }.join("+")
url = "http://www.google.com/search?q=#{q}"
doc = Hpricot(open(url).read)
lucky_url = (doc/"div[@class='g'] a").first["href"]
system 'open #{lucky_url}'
    

Comments

Peter Szinek replied on Wed, 2007/06/13 - 1:10am

Here is a detailed tutorial for google scraping: http://scrubyt.org/scrapin-google-in-no-sec/

Peter Szinek replied on Wed, 2007/06/13 - 1:10am

The same code with scRUBYt! (http://scrubyt.org) - this one also crawls to the next page, yielding 20 results. require 'rubygems' require 'scrubyt' google_data = Scrubyt::Extractor.define do fetch 'http://www.google.com/ncr' fill_textfield 'q', 'ruby' submit link "Ruby Programming Language/@href" next_page "Next", :limit => 2 end puts google_data.to_xml Result: http://www.ruby-lang.org/ http://www.ruby-lang.org/en/20020101.html http://en.wikipedia.org/wiki/Ruby_programming_language http://en.wikipedia.org/wiki/Ruby http://www.rubyonrails.org/ http://www.rubycentral.com/ http://www.rubycentral.com/book/ http://www.w3.org/TR/ruby/ http://www.zenspider.com/Languages/Ruby/QuickRef.html http://poignantguide.net/ http://www.rubynz.com/ http://www.ruby-doc.org/ http://tryruby.hobix.com/ http://www.rubycentral.org/ http://www.gemstone.org/ruby.html http://whytheluckystiff.net/ruby/pickaxe/ http://intertwingly.net/blog/ http://lotusmedia.org/ http://rubyforge.org/frs/?group_id=167 http://www.oreillynet.com/ruby/ For those who think this is not robust (it isn't indeed, since if you change the search query, it breaks), scRUBYt! is able to export a production extractor: require 'rubygems' require 'scrubyt' google_data = Scrubyt::Extractor.define do fetch("http://www.google.com/ncr") fill_textfield("q", "anything else") submit link "/html/body/div/div/div/a" next_page "Next", :limit => 2 end puts google_data.to_xml