下载和处理文件和图像¶
Scrapy提供可重复使用的item pipelines,用于下载附加到特定项目的文件(例如,当您抓取产品并想要在本地下载其图像时)。这些pipelines共享一些功能和结构(我们称之为media pipelines),但通常你会使用文件管道或图像管道。
这两个管道都实现以下功能:
- 避免重复下载最近下载的媒体
- 指定存储媒体的位置(文件系统目录,Amazon S3存储桶)
图像管道具有用于处理图像的一些额外功能:
- 将所有下载的图像转换为常用格式(JPG)和模式(RGB)
- 缩略图生成
- 检查图像宽度/高度,以确保它们满足最小约束
管道还保持当前正被调度下载的那些媒体URL的内部队列,并将包含相同媒体的那些响应连接到那个队列。这避免了多个项目共享时多次下载相同的媒体。
Using the Files Pipeline¶
The typical workflow, when using the FilesPipeline
goes like this:
- 在Spider中,您要抓取一个项目,并将所需的网址放入
file_urls
字段中。 - 该item从Spider返回并转到item pipeline。
- 当项目到达
FilesPipeline
时,file_urls
字段中的URL计划使用标准Scrapy调度程序和下载程序(这意味着调度程序和下载程序中间件被重用)但具有较高的优先级,在其他页面被删除之前对其进行处理。该item在该特定流水线阶段保持“锁定”,直到文件完成下载(或由于某种原因失败)。 - When the files are downloaded, another field (
files
) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from thefile_urls
field) , and the file checksum. The files in the list of thefiles
field will retain the same order of the originalfile_urls
field. If some file failed downloading, an error will be logged and the file won’t be present in thefiles
field.
Using the Images Pipeline¶
使用ImagesPipeline
就像使用FilesPipeline
,但使用的默认字段名称不同:您使用image_urls
项目,并且它将填充关于下载的图像的信息的images
字段。
The advantage of using the ImagesPipeline
for image files is that you can configure some extra functions like generating thumbnails and filtering the images based on their size.
Images管道使用Pillow缩略图并将图像标准化为JPEG / RGB格式,因此您需要安装此库才能使用它。Python图像库(PIL)在大多数情况下也应该可以正常工作,但已知会在某些设置中引起麻烦,因此我们建议使用Pillow代替PIL。
Enabling your Media Pipeline¶
To enable your media pipeline you must first add it to your project ITEM_PIPELINES
setting.
For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
For Files Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
Note
You can also use both the Files and Images Pipeline at the same time.
Then, configure the target storage setting to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES
setting.
For the Files Pipeline, set the FILES_STORE
setting:
FILES_STORE = '/path/to/valid/dir'
For the Images Pipeline, set the IMAGES_STORE
setting:
IMAGES_STORE = '/path/to/valid/dir'
Supported Storage¶
File system is currently the only officially supported storage, but there is also support for storing files in Amazon S3.
File system storage¶
The files are stored using a SHA1 hash of their URLs for the file names.
For example, the following image URL:
http://www.example.com/image.jpg
Whose SHA1 hash is:
3afec3b4765f8f0a07b78f98c07b83f013567a0a
将下载并存储在以下文件中:
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg
Where:
<IMAGES_STORE>
是在设置IMAGES_STORE
中定义的Images Pipeline.full
是将完整图片与缩略图(如果使用)分开的子目录。有关详细信息,请参阅图像缩略图生成。
Amazon S3 storage¶
FILES_STORE
and IMAGES_STORE
can represent an Amazon S3 bucket. Scrapy will automatically upload the files to the bucket.
For example, this is a valid IMAGES_STORE
value:
IMAGES_STORE = 's3://bucket/images'
You can modify the Access Control List (ACL) policy used for the stored files, which is defined by the FILES_STORE_S3_ACL
and IMAGES_STORE_S3_ACL
settings. By default, the ACL is set to private
. To make the files publicly available use the public-read
policy:
IMAGES_STORE_S3_ACL = 'public-read'
For more information, see canned ACLs in the Amazon S3 Developer Guide.
Usage example¶
In order to use a media pipeline first, enable it.
然后,如果Spider使用URLs关键字(分别为文件或图像管道)返回一个带有URL键(file_urls
或image_urls
)的dict,管道将把结果放在相应的键文件
或图片
)。
如果您更喜欢使用item
,那么请使用必要的字段定义一个自定义项目,例如Images pipeline的示例:
import scrapy
class MyItem(scrapy.Item):
# ... other item fields ...
image_urls = scrapy.Field()
images = scrapy.Field()
如果要对URLs键或结果键使用其他字段名称,也可以覆盖它。
对于Files Pipline,设置FILES_URLS_FIELD
和/或FILES_RESULT_FIELD
设置:
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
对于Images Pipeline,请设置IMAGES_URLS_FIELD
和/或IMAGES_RESULT_FIELD
设置:
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
如果您需要更复杂的内容并想要覆盖自定义管道行为,请参阅Extending the Media Pipelines。
如果有多个从ImagePipeline继承的图像管道,并且您希望在不同的管道中具有不同的设置,则可以设置以管道类的大写名称开头的设置键。例如。如果您的管道称为MyPipeline,并且您想要定制IMAGES_URLS_FIELD,那么您将定义MYPIPELINE_IMAGES_URLS_FIELD设置,并使用您的自定义设置。
Additional features¶
File expiration¶
The Image Pipeline avoids downloading files that were downloaded recently. To adjust this retention delay use the FILES_EXPIRES
setting (or IMAGES_EXPIRES
, in case of Images Pipeline), which specifies the delay in number of days:
# 120 days of delay for files expiration
FILES_EXPIRES = 120
# 30 days of delay for images expiration
IMAGES_EXPIRES = 30
The default value for both settings is 90 days.
If you have pipeline that subclasses FilesPipeline and you’d like to have different setting for it you can set setting keys preceded by uppercase class name. E.g. given pipeline class called MyPipeline you can set setting key:
MYPIPELINE_FILES_EXPIRES = 180
and pipeline class MyPipeline will have expiration time set to 180.
Thumbnail generation for images¶
The Images Pipeline can automatically create thumbnails of the downloaded images.
In order use this feature, you must set IMAGES_THUMBS
to a dictionary where the keys are the thumbnail names and the values are their dimensions.
For example:
IMAGES_THUMBS = {
'small': (50, 50),
'big': (270, 270),
}
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
<size_name>
is the one specified in theIMAGES_THUMBS
dictionary keys (small
,big
, etc)<image_id>
is the SHA1 hash of the image url
使用small
和big
缩略图名称存储的图像文件示例:
<IMAGES_STORE>/full/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/small/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
<IMAGES_STORE>/thumbs/big/63bbfea82b8880ed33cdb762aa11fab722a90a24.jpg
The first one is the full image, as downloaded from the site.
Filtering out small images¶
When using the Images Pipeline, you can drop images which are too small, by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT
and IMAGES_MIN_WIDTH
settings.
For example:
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
Note
The size constraints don’t affect thumbnail generation at all.
It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum sizes will be saved. For the above example, images of sizes (105 x 105) or (105 x 200) or (200 x 105) will all be dropped because at least one dimension is shorter than the constraint.
By default, there are no size constraints, so all images are processed.
Extending the Media Pipelines¶
See here the methods that you can override in your custom Files Pipeline:
- class
scrapy.pipelines.files.
FilesPipeline
¶ -
get_media_requests
(item, info)¶ As seen on the workflow, the pipeline will get the URLs of the images to download from the item. In order to do this, you can override the
get_media_requests()
method and return a Request for each file URL:def get_media_requests(self, item, info): for file_url in item['file_urls']: yield scrapy.Request(file_url)
Those requests will be processed by the pipeline and, when they have finished downloading, the results will be sent to the
item_completed()
method, as a list of 2-element tuples. Each tuple will contain(success, file_info_or_error)
where:success
is a boolean which isTrue
if the image was downloaded successfully orFalse
if it failed for some reasonfile_info_or_error
is a dict containing the following keys (if success isTrue
) or a Twisted Failure if there was a problem.url
- the url where the file was downloaded from. This is the url of the request returned from theget_media_requests()
method.path
- the path (relative toFILES_STORE
) where the file was storedchecksum
- a MD5 hash of the image contents
The list of tuples received by
item_completed()
is guaranteed to retain the same order of the requests returned from theget_media_requests()
method.Here’s a typical value of the
results
argument:[(True, {'checksum': '2b00042f7481c7b056c4b410d28f33cf', 'path': 'full/0a79c461a4062ac383dc4fade7bc09f1384a3910.jpg', 'url': 'http://www.example.com/files/product1.pdf'}), (False, Failure(...))]
By default the
get_media_requests()
method returnsNone
which means there are no files to download for the item.
-
item_completed
(results, items, info)¶ 当单个项目的所有文件请求已完成(完成下载,或由于某种原因失败)时调用
FilesPipeline.item_completed()
方法。item_completed()
方法必须返回将发送到后续项目pipline阶段的输出,因此您必须返回(或删除)项目,就像在任何pipline中一样。这里是
item_completed()
方法的一个例子,其中我们将下载的文件路径(传入结果)存储在file_paths
项目字段中,不包含任何文件:from scrapy.exceptions import DropItem def item_completed(self, results, item, info): file_paths = [x['path'] for ok, x in results if ok] if not file_paths: raise DropItem("Item contains no files") item['file_paths'] = file_paths return item
默认情况下,
item_completed()
方法返回该项目。
-
请参阅此处您可以在自定义图像管道中覆盖的方法:
- class
scrapy.pipelines.images.
ImagesPipeline
¶ ImagesPipeline
是FilesPipeline
的扩展,可自定义字段名称并添加图像的自定义行为。-
get_media_requests
(item, info)¶ 与
FilesPipeline.get_media_requests()
方法的工作方式相同,但对图片网址使用不同的字段名称。必须为每个图片网址返回一个请求。
-
item_completed
(results, items, info)¶ 当单个项目的所有图像请求都已完成(完成下载,或由于某种原因失败)时,调用
ImagesPipeline.item_completed()
方法。与
FilesPipeline.item_completed()
方法的工作方式相同,但使用不同的字段名称来存储图像下载结果。默认情况下,
item_completed()
方法返回该项目。
-
Custom Images pipeline example¶
这里是Image pipeline的一个完整的例子,其方法如上所示:
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item