I've recently been messing around with Jens Kraemer's fantastic web crawler module, Rdig. One thing I noticed about it, however is that it's not coded to optimally crawl a website while including or excluding certain URL patterns.
First off, the start_url has to get past the include and exclude url pattern filters or else the site won't get crawled.
Secondly, because documents go through the pattern filter before they get added to the queue, pages that can be accessed only from pages that don't get past the pattern filter won't be seen at all.
The solution to this is fairly simple. I altered the code so that documents would go through the filters except for the include and exclude filters before it gets added to the queue, and then run the documents through the include and exclude filters before it gets added to the index. So here is what I did:
Changed the order of members of the filter_chain array in rdig.rb:
def filter_chain
@filter_chain ||= {
# filter chain for http crawling
:http => [
:scheme_filter_http,
:fix_relative_uri,
:normalize_uri,
{ :hostname_filter => :include_hosts },
RDig::UrlFilters::VisitedUrlFilter,
{ RDig::UrlFilters::UrlInclusionFilter => :include_documents },
{ RDig::UrlFilters::UrlExclusionFilter => :exclude_documents }
],
# filter chain for file system crawling
:file => [
:scheme_filter_file,
{ RDig::UrlFilters::PathInclusionFilter => :include_documents },
{ RDig::UrlFilters::PathExclusionFilter => :exclude_documents }
]
}
end
Replaced the definition of the apply (line 55) method in url_filters.rb with the following:
def apply_first(document) # applies 0-4 of @filters array
@filters[0..4].each { |filter|
return nil unless filter.call(document)
}
return document
end
def apply_second(document) # applies 5-6 of @filters array
@filters[5..6].each { |filter|
return nil unless filter.call(document)
}
return document
end
In the add_url method definition in crawler.rb, I changed apply method call to the following:
doc = filterchain.apply_first(doc)
Changed the process_document method definition in crawler.rb to the following:
def process_document(doc, filterchain)
doc.fetch
# add links from this document to the queue
doc.content[:links].each { |url|
add_url(url, filterchain, doc)
} unless doc.content[:links].nil?
return unless @etag_filter.apply(doc)
doc = filterchain.apply_second(doc)
if doc
@indexer << doc if doc.needs_indexing?
end
rescue
puts "error processing document #{doc.uri.to_s}: #{$!}"
puts "Trace: #{$!.backtrace.join("\n")}" if RDig::config.verbose
end
Also, if you're having problems appending to an existing index, make sure that line 103 of config.rb starts with 'cfg' and not 'config'.

Comments
Couldn't figure this out for the life of me. Re-visited the post three times in sheer frustration... then I read the last line.
Guess I should be more thorough in future!
Post new comment