加速的话,通用的思路是:单线程->多线程->多进程->分布式,采用优化的调度策略,可以达到对网络资源的充分利用。用tomorrow实现的简单多线程参考代码如下:
import requests
from tomorrow import threads
@threads(5)
def download(url):
return requests.get(url)
if __name__ == "__main__":
responses = [download(f'http://www.aaa.com/{fid}.txt') for fid in range(1, 1001)]
html = [response.text for response in responses]
tomorrow:tomorrow3 · PyPI
也可以参考我之前在网络搜索课上讲的爬虫优化部分:13.爬虫进阶:多线程、API及请求头设置_哔哩哔哩_bilibili