DZone Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world
Ruby Web Crawler
NB. Again, this script was found in my old source code folder, it may not be fully working.
This Ruby script reads in a list of links from links.dat, it then picks out the ones it can easily spider and gets a list of URLs from each page listed in links.dat. Every new URL it finds will be added to newlinks.dat for later spidering by another bot running along side this one.
require 'socket'
links = File.open("links.dat")
while links.gets do
#domain = ($_ =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/);
if %r{http://([^/]+)/([^/]+)}i =~ $_
domain,path = $1, $2
end
if proto="http"
begin
t = TCPSocket.new(domain, 'www')
rescue
puts "error: #{$!}"
else
t.print "GET /"+path+" HTTP/1.0\n\n"
answer = t.gets(nil)
t.close
end
if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ answer
proto, url, text = $1, $2, $3
end
print proto+"://"+url+"\n"
old = File.open("newlinks.dat")
new = File.open("links.dat.tmp", File::WRONLY|File::TRUNC|File::CREAT)
while old.gets do
if $_ != proto+"://"+url
new.print $_
end
end
new.print proto+"://"+url
old.close
new.close
File.rename("newlinks.dat", "links.dat.orig")
File.rename("links.dat.tmp", "newlinks.dat")
end
end
links.close




