Mechanizeを放棄 | Pujiaxun的空间

连接总是被重置

基于Mechanize的爬虫总是会遇到这样的错误，查了很多资料也没有找到解决方案，在Mechanize的Repo上关于这个问题的issue也一大堆。

Net::HTTP::Persistent::Error: too many connection resets (due to Connection reset by peer - Errno::ECONNRESET) after 2 requests on 14759220

今天看到一篇文章似乎提供了解决方案。
Defeating the Infamous Mechanize “Too Many Connection Resets” Bug

解决方案

我来翻译一下这篇文章主要内容：

当你在Ruby中使用Mechanize写一个爬虫的时候，是否曾遇见过这样非常讨厌又恼人的错误？
这个问题恶心Mechanize用户很多年了，而且从来没有被真正的解决掉。有很多传说中的巫术和黑魔法一样的建议来解决这个，但一个能打的都没有。你可以在Mechanize Issue #123
了解所有的情况。

我认为最根本的原因是，隐藏在其中的Net::HTTP是如何在一个POST请求之后，处理重置连接的问题的，而前文提到的Issue中有一些证据可以支持这个理论。基于这个假设，我搞了一个解决方案，这几个月在产品中一直都能愉悦的工作。

这并不是一个真正解决Mechanize或Net::HTTP::Persistent的方案，而且一定有你真的想让错误被抛出的边缘情况，但是实践中，我发现了一个很简单的处理方案，可以搞定“too many connection resets”，那就是强制连接关闭和重建，然后就是简单粗暴的重试。这在大量数据的爬虫产品中一直运行的很好，间歇地忍受一下这个破问题。

重新包装一下’Mechanize::HTTP::Agent#fetch’就OK啦，这个方法主要是用来做GET/PUT/POST/HEAD等请求的底层请求。这个包装捕获到这个蛋疼的异常，然后用’shutdown’方法来快速的重建一个新的HTTP连接，然后重新尝试’fetch’。

在你的项目里载入下面的猴子补丁应该能在多数情况下都让这个操蛋的错误消失。

class Mechanize::HTTP::Agent
  MAX_RESET_RETRIES = 10

  # We need to replace the core Mechanize HTTP method:
  #
  #   Mechanize::HTTP::Agent#fetch
  #
  # with a wrapper that handles the infamous "too many connection resets"
  # Mechanize bug that is described here:
  #
  #   https://github.com/sparklemotion/mechanize/issues/123
  #
  # The wrapper shuts down the persistent HTTP connection when it fails with
  # this error, and simply tries again. In practice, this only ever needs to
  # be retried once, but I am going to let it retry a few times
  # (MAX_RESET_RETRIES), just in case.
  #
  def fetch_with_retry(
    uri,
    method    = :get,
    headers   = {},
    params    = [],
    referer   = current_page,
    redirects = 0
  )
    action      = "#{method.to_s.upcase} #{uri.to_s}"
    retry_count = 0

    begin
      fetch_without_retry(uri, method, headers, params, referer, redirects)
    rescue Net::HTTP::Persistent::Error => e
      # Pass on any other type of error.
      raise unless e.message =~ /too many connection resets/

      # Pass on the error if we've tried too many times.
      if retry_count >= MAX_RESET_RETRIES
        puts "**** WARN: Mechanize retried connection reset #{MAX_RESET_RETRIES} times and never succeeded: #{action}"
        raise
      end

      # Otherwise, shutdown the persistent HTTP connection and try again.
      puts "**** WARN: Mechanize retrying connection reset error: #{action}"
      retry_count += 1
      self.http.shutdown
      retry
    end
  end

  # Alias so #fetch actually uses our new #fetch_with_retry to wrap the
  # old one aliased as #fetch_without_retry.
  alias_method :fetch_without_retry, :fetch
  alias_method :fetch, :fetch_with_retry
end

我来再”翻译“一下这个文章的意思，就是去修改Mechanize的fetch方法，让它出错的时候闭嘴，并且自动重新尝试请求。

我试了一下， 没有任何卵用 。

放弃Mechanize

这个问题不解决，爬虫的效率大打折扣。就我们学校那个土豆服务器，一半的POST请求都会遇到这个错误。

而眼下真的找不到什么解决方案，所以，我要放弃Mechanize了，如果找不到合适的gem，我可能要上Python大法了。

如有疏漏，欢迎评论指出，或者前往Github提出issue~谢谢