Working with Results¶
Downloading and viewing your data from WebUI is convenient, but may not suitable for computer.
Working with ResultDB¶
Although resultdb is only designed for result preview, not suitable for large scale storage. But if you want to grab data from resultdb, there are some simple snippets using database API that can help you to connect and select the data.
from pyspider.database import connect_database
resultdb = connect_database("<your resutldb connection url>")
for project in resultdb.projects:
for result in resultdb.select(project):
assert result['taskid']
assert result['url']
assert result['result']
The result['result']
is the object submitted by return
statement from your script.
Working with ResultWorker¶
In product environment, you may want to connect pyspider to your system / post-processing pipeline, rather than store it into resultdb. It's highly recommended to override ResultWorker.
from pyspider.result import ResultWorker
class MyResultWorker(ResultWorker):
def on_result(self, task, result):
assert task['taskid']
assert task['project']
assert task['url']
assert result
# your processing code goes here
result
is the object submitted by return
statement from your script.
You can put this script (e.g., my_result_worker.py
) at the folder where you launch pyspider. Add argument for result_worker
subcommand:
pyspider result_worker --result-cls=my_result_worker.MyResultWorker
Or
{
...
"result_worker": {
"result_cls": "my_result_worker.MyResultWorker"
}
...
}
if you are using config file. Please refer to Deployment
Design Your Own Database Schema¶
The results stored in database is encoded as JSON for compatibility. It's highly recommended to design your own database, and override the ResultWorker described above.
TIPS about Results¶
Want to return more than one result in callback?¶
As resultdb de-duplicate results by taskid(url), the latest will overwrite previous results.
One workaround is using send_message
API to make a fake
taskid for each result.
def detail_page(self, response):
for li in response.doc('li').items():
self.send_message(self.project_name, {
...
}, url=response.url+"#"+li('a.product-sku').text())
def on_message(self, project, msg):
return msg
See Also: apis/self.send_message