scrapy之request

Mar 11, 2016


这篇文章主要记录scrapy中的关于request的一些知识

  • 一般可以直接用request发起网络请求,request方法可以带一些常用的参数

      request部分源码:
      class Request(object_ref):
    
          def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                       cookies=None, meta=None, encoding='utf-8', priority=0,
                       dont_filter=False, errback=None):
    
              self._encoding = encoding  # this one has to be set first
              self.method = str(method).upper()
              self._set_url(url)
              self._set_body(body)
              assert isinstance(priority, int), "Request priority not an integer: %r" % priority
              self.priority = priority
    
              assert callback or not errback, "Cannot use errback without a callback"
              self.callback = callback
              self.errback = errback
    
              self.cookies = cookies or {}
              self.headers = Headers(headers or {}, encoding=encoding)
              self.dont_filter = dont_filter
    
              self._meta = dict(meta) if meta else None
    
  • 在研究scrapy_redis的时候发现源码中的去重方案里即有request又有request_fingerprint. 经过对比总结如下:

      request_fingerprint叫做请求的指纹信息,存储的是请求的资源哈希标志,这个理解可能还是不准确,源码注释如下:
      The request fingerprint is a hash that uniquely identifies the resource the request points to.
      request_fingerprint能过滤掉请求统一资源的request,存储在dupefilter的redis数据库的set集合类型里
      request_fingerprint实例:
      "18f32856cf15db66a6271f3a414bef6e4620fce3"
        
      通过request_fingerprint过滤的request会被存储在用redis数据库的sorted set 集合类型重写的queue里,这又是一层过滤。
      request中包含的信息包括url,callback等上面Request中的参数
      request实例:
      redis 127.0.0.1:6379> zrange smzdm:requests 0 0
      1) "\x80\x02}q\x01(U\x04bodyq\x02U\x00U\_encodingq\x03U\x05utf-8q\x04U\cookiesq\x05}q\x06U\x04metaq\}q\U\x05depthq\K\x04sU\heade
      rsq\}q\x0bU\Refererq\x0c]q\U\x1ahttp://faxian.smzdm.com/p4q\x0easU\x03urlq\x0fX\x1f\x00\x00\x00http://www.smzdm.com/p/6046409/U\
      x0bdont_filterq\x10\x89U\priorityq\x11K\x00U\callbackq\x12U\x0cparse_detailq\x13U\x06methodq\x14U\x03GETq\x15U\errbackq\x16Nu."
      redis 127.0.0.1:6379>
    

参考文章:


结语:

坚持每天进步一点点…