self.crawl¶
self.crawl(url, **kwargs)¶
self.crawl
is the main interface to tell pyspider which url(s) should be crawled.
Parameters:¶
url¶
the url or url list to be crawled.
callback¶
the method to parse the response. _default: __call__
_
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
the following parameters are optional
age¶
the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
...
Every pages parsed by the callback
index_page
would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.
priority¶
the priority of task to be scheduled, higher the better. default: 0
def index_page(self):
self.crawl('http://www.example.org/page2.html', callback=self.index_page)
self.crawl('http://www.example.org/233.html', callback=self.detail_page,
priority=1)
The page
233.html
would be crawled beforepage2.html
. Use this parameter can do a BFS and reduce the number of tasks in queue(which may cost more memory resources).
exetime¶
the executed time of task in unix timestamp. default: 0(immediately)
import time
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
exetime=time.time()+30*60)
The page would be crawled 30 minutes later.
retries¶
retry times while failed. default: 3
itag¶
a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it's changed. default: None
def index_page(self, response):
for item in response.doc('.item').items():
self.crawl(item.find('a').attr.url, callback=self.detail_page,
itag=item.find('.update-time').text())
In the sample,
.update-time
is used as itag. If it's not changed, the request would be discarded.
Or you can use itag
with Handler.crawl_config
to specify the script version if you want to restart all of the tasks.
class Handler(BaseHandler):
crawl_config = {
'itag': 'v223'
}
Change the value of itag after you modified the script and click run button again. It doesn't matter if not set before.
auto_recrawl¶
when enabled, task would be recrawled every age
time. default: False
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
age=5*60*60, auto_recrawl=True)
The page would be restarted every
age
5 hours.
method¶
HTTP method to use. default: GET
params¶
dictionary of URL parameters to append to the URL.
def on_start(self):
self.crawl('http://httpbin.org/get', callback=self.callback,
params={'a': 123, 'b': 'c'})
self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
The two requests are the same.
data¶
the body to attach to the request. If a dictionary is provided, form-encoding will take place.
def on_start(self):
self.crawl('http://httpbin.org/post', callback=self.callback,
method='POST', data={'a': 123, 'b': 'c'})
files¶
dictionary of {field: {filename: 'content'}}
files to multipart upload.`
user_agent¶
the User-Agent of the request
headers¶
dictionary of headers to send.
cookies¶
dictionary of cookies to attach to this request.
connect_timeout¶
timeout for initial connection in seconds. default: 20
timeout¶
maximum time in seconds to fetch the page. default: 120
allow_redirects¶
follow 30x
redirect default: True
validate_cert¶
For HTTPS requests, validate the server’s certificate? default: True
proxy¶
proxy server of username:password@hostname:port
to use, only http proxy is supported currently.
class Handler(BaseHandler):
crawl_config = {
'proxy': 'localhost:8080'
}
Handler.crawl_config
can be used withproxy
to set a proxy for whole project.
etag¶
use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
last_modified¶
use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True
fetch_type¶
set to js
to enable JavaScript fetcher. default: None
js_script¶
JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write("binux"); }
.
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
fetch_type='js', js_script='''
function() {
window.scrollTo(0,document.body.scrollHeight);
return 123;
}
''')
The script would scroll the page to bottom. The value returned in function could be captured via
Response.js_script_result
.
js_run_at¶
run JavaScript specified via js_script
at document-start
or document-end
. default: document-end
js_viewport_width/js_viewport_height¶
set the size of the viewport for the JavaScript fetcher of the layout process.
load_images¶
load images when JavaScript fetcher enabled. default: False
save¶
a object pass to the callback method, can be visit via response.save
.
def on_start(self):
self.crawl('http://www.example.org/', callback=self.callback,
save={'a': 123})
def callback(self, response):
return response.save['a']
123
would be returned incallback
taskid¶
unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
Only url is md5 -ed as taskid by default, the code above add
data
of POST request as part of taskid.
force_update¶
force update task params even if the task is in ACTIVE
status.
cancel¶
cancel a task, should be used with force_update
to cancel a active task. To cancel an auto_recrawl
task, you should set auto_recrawl=False
as well.
cURL command¶
self.crawl(curl_command)
cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and "Copy as cURL".
You can use cURL command as the first argument of self.crawl
. It will parse the command and make the HTTP request just like curl do.
@config(**kwargs)¶
default parameters of self.crawl
when use the decorated method as callback. For example:
@config(age=15*60)
def index_page(self, response):
self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
self.crawl('http://www.example.org/product-233', callback=self.detail_page)
@config(age=10*24*60*60)
def detail_page(self, response):
return {...}
age
of list-1.html
is 15min while the age
of product-233.html
is 10days. Because the callback of product-233.html
is detail_page
, means it's a detail_page
so it shares the config of detail_page
.
Handler.crawl_config = {}¶
default parameters of self.crawl
for the whole project. The parameters in crawl_config
for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
class Handler(BaseHandler):
crawl_config = {
'headers': {
'User-Agent': 'GoogleBot',
}
}
...
crawl_config set a project level user-agent.