Deployment¶
Since pyspider has various components, you can just run pyspider
to start a standalone and third service free instance. Or using MySQL or MongoDB and RabbitMQ to deploy a distributed crawl cluster.
To deploy pyspider in product environment, running component in each process and store data in database service is more reliable and flexible.
Installation¶
To deploy pyspider components in each single processes, you need at least one database service. pyspider now supports MySQL, MongoDB and PostgreSQL. You can choose one of them.
And you need a message queue service to connect the components together. You can use RabbitMQ, Beanstalk or Redis as message queue.
pip install --allow-all-external pyspider[all]
Even if you had install pyspider using
pip
before. Install withpyspider[all]
is necessary to install the requirements for MySQL/MongoDB/RabbitMQ.
if you are using Ubuntu, try:
apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml
to install binary packages.
Deployment¶
This document is based on MySQL + RabbitMQ
config.json¶
Although you can use command-line to specify the parameters. A config file is a better choice.
{
"taskdb": "mysql+taskdb://username:password@host:port/taskdb",
"projectdb": "mysql+projectdb://username:password@host:port/projectdb",
"resultdb": "mysql+resultdb://username:password@host:port/resultdb",
"message_queue": "amqp://username:password@host:port/%2F",
"webui": {
"username": "some_name",
"password": "some_passwd",
"need-auth": true
}
}
you can get complete options by running pyspider --help
and pyspider webui --help
for subcommands. "webui"
in JSON is configs for subcommands. You can add parameters for other components similar to this one.
Database Connection URI¶
"taskdb"
, "projectdb”
, "resultdb"
is using database connection URI with format below:
mysql:
mysql+type://user:passwd@host:port/database
sqlite:
# relative path
sqlite+type:///path/to/database.db
# absolute path
sqlite+type:////path/to/database.db
# memory database
sqlite+type://
mongodb:
mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
more: http://docs.mongodb.org/manual/reference/connection-string/
sqlalchemy:
sqlalchemy+postgresql+type://user:passwd@host:port/database
sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
local:
local+projectdb://filepath,filepath
type:
should be one of `taskdb`, `projectdb`, `resultdb`.
Message Queue URL¶
You can use connection URL to specify the message queue:
rabbitmq:
amqp://username:password@host:5672/%2F
Refer: https://www.rabbitmq.com/uri-spec.html
beanstalk:
beanstalk://host:11300/
redis:
redis://host:6379/db
redis://host1:port1,host2:port2,...,hostn:portn (for redis 3.x in cluster mode)
builtin:
None
Hint for postgresql: you need to create database with encoding utf8 by your own. pyspider will not create database for you.
running¶
You should run components alone with subcommands. You may add &
after command to make it running in background and use screen or nohup to prevent exit after your ssh session ends. It's recommended to manage components with Supervisor.
# start **only one** scheduler instance
pyspider -c config.json scheduler
# phantomjs
pyspider -c config.json phantomjs
# start fetcher / processor / result_worker instances as many as your needs
pyspider -c config.json --phantomjs-proxy="localhost:25555" fetcher
pyspider -c config.json processor
pyspider -c config.json result_worker
# start webui, set `--scheduler-rpc` if scheduler is not running on the same host as webui
pyspider -c config.json webui