Deployment of demo.pyspider.org

demo.pyspider.org is running on three VPSs connected together with private network using tinc.

1vCore 4GB RAM 1vCore 2GB RAM * 2
database
message queue
scheduler
phantomjs * 2
phantomjs-lb * 1
fetcher * 1
fetcher-lb * 1
processor * 2
result-worker * 1
webui * 4
webui-lb * 1
nginx * 1

All components are running inside docker containers.

database / message queue / scheduler

The database is postgresql and the message queue is redis.

Scheduler may have a lot of database operations, it's better to put it close to the database.

docker run --name postgres -v /data/postgres/:/var/lib/postgresql/data -d -p $LOCAL_IP:5432:5432 -e POSTGRES_PASSWORD="" postgres
docker run --name redis -d -p  $LOCAL_IP:6379:6379 redis
docker run --name scheduler -d -p $LOCAL_IP:23333:23333 --restart=always binux/pyspider \
 --taskdb "sqlalchemy+postgresql+taskdb://[email protected]/taskdb" \
 --resultdb "sqlalchemy+postgresql+resultdb://[email protected]/resultdb" \
 --projectdb "sqlalchemy+postgresql+projectdb://[email protected]/projectdb" \
 --message-queue "redis://10.21.0.7:6379/1" \
 scheduler --inqueue-limit 5000 --delete-time 43200

other components

fetcher, processor, result_worker are running on two boxes with same configuration managed with docker-compose.

phantomjs:
  image: 'binux/pyspider:latest'
  command: phantomjs
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,23333,24444'
  expose:
    - '25555'
  mem_limit: 512m
  restart: always
phantomjs-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - phantomjs
  restart: always

fetcher:
  image: 'binux/pyspider:latest'
  command: '--message-queue "redis://10.21.0.7:6379/1" --phantomjs-proxy "phantomjs:80" fetcher --xmlrpc'
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,25555,23333'
  links:
    - 'phantomjs-lb:phantomjs'
  mem_limit: 128m
  restart: always
fetcher-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - fetcher
  restart: always

processor:
  image: 'binux/pyspider:latest'
  command: '--projectdb "sqlalchemy+postgresql+projectdb://[email protected]/projectdb" --message-queue "redis://10.21.0.7:6379/1" processor'
  cpu_shares: 512
  mem_limit: 256m
  restart: always

result-worker:
  image: 'binux/pyspider:latest'
  command: '--taskdb "sqlalchemy+postgresql+taskdb://[email protected]/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://[email protected]/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://[email protected]/resultdb" --message-queue "redis://10.21.0.7:6379/1" result_worker'
  cpu_shares: 512
  mem_limit: 256m
  restart: always

webui:
  image: 'binux/pyspider:latest'
  command: '--taskdb "sqlalchemy+postgresql+taskdb://[email protected]/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://[email protected]/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://[email protected]/resultdb" --message-queue "redis://10.21.0.7:6379/1" webui --max-rate 0.2 --max-burst 3 --scheduler-rpc "http://o4.i.binux.me:23333/" --fetcher-rpc "http://fetcher/"'

  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=24444,25555,23333'
  links:
    - 'fetcher-lb:fetcher'
  mem_limit: 256m
  restart: always
webui-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - webui
  restart: always

nginx:
  image: 'nginx'
  links:
    - 'webui-lb:HAPROXY'
  ports:
    - '0.0.0.0:80:80'
  volumes:
    - /home/binux/nfs/profile/nginx/nginx.conf:/etc/nginx/nginx.conf
    - /home/binux/nfs/profile/nginx/conf.d/:/etc/nginx/conf.d/
  restart: always

With the config, you can change the scale by docker-compose scale phantomjs=2 processor=2 webui=4 when you need.

load balance

phantomjs-lb, fetcher-lb, webui-lb are automaticlly configed haproxy, allow any number of upstreams.

phantomjs

phantomjs have memory leak issue, memory limit applied, and it's recommended to restart it every hour.

fetcher

fetcher is implemented with aync IO, it supportes 100 concurrent connections. If the upstream queue are not choked, one fetcher should be enough.

processor

processor is CPU bound component, recommended number of instance is number of CPU cores + 1~2 or CPU cores * 10%~15% when you have more then 20 cores.

result-worker

If you didn't override result-worker, it only write results into database, and should be very fast.