Ruby rdig modification

I've recently been messing around with Jens Kraemer's fantastic web crawler module, Rdig. One thing I noticed about it, however is that it's not coded to optimally crawl a website while including or excluding certain URL patterns.

First off, the start_url has to get past the include and exclude url pattern filters or else the site won't get crawled.

Secondly, because documents go through the pattern filter before they get added to the queue, pages that can be accessed only from pages that don't get past the pattern filter won't be seen at all.

The solution to this is fairly simple. I altered the code so that documents would go through the filters except for the include and exclude filters before it gets added to the queue, and then run the documents through the include and exclude filters before it gets added to the index. So here is what I did:

Changed the order of members of the filter_chain array in rdig.rb:


    def filter_chain
      @filter_chain ||= {
        # filter chain for http crawling
        :http => [
          :scheme_filter_http,
          :fix_relative_uri,
          :normalize_uri,
          { :hostname_filter => :include_hosts },
          RDig::UrlFilters::VisitedUrlFilter,         
          { RDig::UrlFilters::UrlInclusionFilter => :include_documents },
          { RDig::UrlFilters::UrlExclusionFilter => :exclude_documents } 
        ],
        # filter chain for file system crawling
        :file => [
          :scheme_filter_file,
          { RDig::UrlFilters::PathInclusionFilter => :include_documents },
          { RDig::UrlFilters::PathExclusionFilter => :exclude_documents }
        ]
      }
         
    end

Replaced the definition of the apply (line 55) method in url_filters.rb with the following:


      def apply_first(document) # applies 0-4 of @filters array
        @filters[0..4].each { |filter|
          return nil unless filter.call(document)
        }
        return document
      end
      
      def apply_second(document) # applies 5-6 of @filters array
        @filters[5..6].each { |filter|
          return nil unless filter.call(document)
        }
        return document
      end

In the add_url method definition in crawler.rb, I changed apply method call to the following:


      doc = filterchain.apply_first(doc)

Changed the process_document method definition in crawler.rb to the following:


    def process_document(doc, filterchain)
      doc.fetch
      # add links from this document to the queue
      doc.content[:links].each { |url| 
        add_url(url, filterchain, doc) 
      } unless doc.content[:links].nil?

      return unless @etag_filter.apply(doc)
      doc = filterchain.apply_second(doc)
      if doc
        @indexer << doc if doc.needs_indexing?
      end
    rescue
      puts "error processing document #{doc.uri.to_s}: #{$!}"
      puts "Trace: #{$!.backtrace.join("\n")}" if RDig::config.verbose
    end

Also, if you're having problems appending to an existing index, make sure that line 103 of config.rb starts with 'cfg' and not 'config'.


Comments

Submitted by eve isk (not verified) on August 18, 2008 - 20:16.

Couldn't figure this out for the life of me. Re-visited the post three times in sheer frustration... then I read the last line.

Guess I should be more thorough in future!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

Captcha
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

Odani Interactive is an Internet services company located in New York City. Our core competencies are online marketing (in particular direct response advertising) and web development.