Python爬虫笔记

[toc]

基础知识

爬虫：模拟客户端发送网络请求，获取响应，按照规则提取数据的程序

Get与Post的区别

Post安全：传输过程中Get将数据放在请求的URL中，Post所有操作对用户不可见，可用作注册登录。
Post数据量大：Get受URL长度限制传送的数据量较小，Post传送数据量大，可用于传输大文件。

###学习步骤

python语法知识
Python中的内置库urllib，http等，用于下载网页
正则表达式re、BeautifulSoup(bs4)、Xpath(lxml)等网页解析工具
简单网站的爬取，了解爬取数据过程
了解爬虫的反爬机制，header、robot、时间间隔、代理IP、隐含字段等
特殊网站的爬取，解决登录、Cookie、动态网页问题
爬虫与数据库结合，将爬取数据进行存储
应用Python多线程、多进程进行爬取，提高效率
学习爬虫框架，Scrapy、PySpider等
分布式爬虫解决数据量庞大的需求

python内置urllib库

urllib是python的内置库，能够完成向服务器发出请求并获取网页的功能。

python3中将urllib与urllib2整合为唯一urlib，分成以下模块

urllib.request 用于打开和读取url，最重要
uellib.error 用于处理request引起的异常
urllib.parse 用于解析URL
urllib.roborparser 用于解析robots.txt文件

关于python3与python2对于urllib的区别

request的使用

import urllib.request
response=urllib.request.urlopen('http://baidu.com')
result=response.read()	#因编码问题产生的乱码，使用response.read().decode('utf-8')
print(result)			#打印得到的html网页

urlopen方法

urlopen是request中的一个方法，作用是打开一个url，常用参数：
- url：url网址
- data：如果添加data参数则为post请求，默认无data参数，为get请求
- timeout：超时时间
urlopen会返回一个类文件对象，可以对该对象进行各种操作
- **read()**：读取html
- **geturl()**：返回url，用于看是否有重定向
- **info()**：返回元信息，如HTTP中的headers
- **getcode()**：返回HTTP状态码
urliopen的参数也可以是一个Request对象

Request类

class Request:	
def __init__(self, url, data=None, headers={},
             origin_req_host=None, unverifiable=False,
             method=None):

Request是一个类，初始化中包括请求需要的各种参数：

url、data与urlopen中作用相同。
headers是HTTP请求的报文信息，可以让爬虫伪装成浏览器，主要是User_Agent，从浏览器复制即可。

使用Request

import urllib.request
headers = {'User_Agent': ''}
response = urllib.request.Request('http://python.org/', headers=headers)
html = urllib.request.urlopen(response)
result = html.read().decode('utf-8')
print(result)

error的使用

error属性里包括了两个重要的exception类，URLError类和HTTPError类

URLError类
1
2
3
4
5
def __init__(self, reason, filename=None):
self.args = reason,
self.reason = reason
if filename is not None:
self.filename = filename
URLError类是OSError的子类，没有自己的行为，但可以作为error里面所有其他类型的积累使用

URLError类初始化定义了reason参数，当使用URLError类的对象时，可以查看错误的reason

HTTPError类

def __init__(self, url, code, msg, hdrs, fp):
    self.code = code
    self.msg = msg
    self.hdrs = hdrs
    self.fp = fp
    self.filename = url

HTTPError是URLError的子类，当HTTP发生错误时抛出HTTPError

使用HTTPError类的对象时，可以查看状态码，headers等

使用两类exception的例子

import urllib.request
import urllib.error
try:
    headers = {'User_Agent': 'Mozilla/5.0 (X11; Ubuntu; 
                Linux x86_64; rv:57.0) Gecko/20100101 
                Firefox/57.0'}
    response = urllib.request.Request('http://python.org/', headers=headers)
    html = urllib.request.urlopen(response)
    result = html.read().decode('utf-8')
except urllib.error.URLError as e:
    if hasattr(e, 'reason'):
        print('错误原因是' + str(e.reason))
except urllib.error.HTTPError as e:
    if hasattr(e, 'code'):
        print('错误状态码是' + str(e.code))
else:
    print('请求成功通过。')

设置headers

有些网站会对请求进行识别，那么在headers中添加属性将爬虫伪装成浏览器。常用属性：

User_agent：请求身份，从浏览器中Request Headers复制即可

Referer：请求的来源，可以添加为网站

Proxy（代理）设置

有些网站会检测某一段时间某个IP的访问次数，如果访问次数过多会禁止访问。可以设置代理服务器帮助工作，没隔一段时间更换一个代理。

#使用request包中的ProxyHandler类的对象与build_opener方法来创建opener，使用opener的open()方法来打开一个url，相当于urlopen()
proxy={'http':'115.193.101.21:61234'}
proxy_handler=urllib.request.ProxyHandler(proxy)
opener=urllib.request.build_opener(proxy_hander)
response=opener.open(request)

代理ip可以在下列网站找到

http://www.xicidaili.com/ http://www.66ip.cn/

http://www.mimiip.com/gngao/ http://www.kuaidaili.com/

requests库

requests是一个用户处理URL资源的第三方库，简化了urllib的操作

官网文档

发送请求

import requests

r = requests.get('https://api.github.com/events')
r = requests.post('http://httpbin.org/post', data = {'key':'value'})

带参数的Get请求

使用params关键字参数，会将字典中的内容拼接为带参数的url，还可以将列表作为value

1 2	payload = {'key1': 'value1', 'key2': 'value2'} r = requests.get("http://httpbin.org/get", params=payload)

带参数的Post请求

通常，只需要传递字典形式的数据即可:

1 2	payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post("http://httpbin.org/post", data=payload)

还可以为 data 参数传入一个元组列表。在表单中多个元素使用同一 key 的时候，这种方式尤其有效：

1 2	payload = (('key1', 'value1'), ('key1', 'value2')) r = requests.post('http://httpbin.org/post', data=payload)

响应内容

文本响应内容text

1 2	r = requests.get('https://api.github.com/events') r.text # 得到服务器响应内容

Requests会自动解码，若解码方式不当会出现乱码，可指定解码方式

1
2
3

r = requests.get('https://api.github.com/events')
r.encoding='utf-8'	# 指定解码方式
r.text

二进制响应内容content

1
2
3

r.content 				#会得到二进制响应内容 b'something'
r.content.decode() 		# 二进制解码
r.content.decode('gbk') # 指定二进制解码格式

Json响应内容

requests中内置JSON解码器来处理JSON内容

1
2
3

r = requests.get('https://api.github.com/events')
r.json()
# [{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

如果解码失败，会抛出异常

解码成功不意味着响应成功，需要再根据响应内容判断，使用r.status_code或r.raise_for_status()

成功响应状态码200或requests.codes.ok

如果发送了错误请求，可以使用```Response.raise_for_status``抛出异常

>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404

>>> bad_r.raise_for_status()
Traceback (most recent call last):
  File "requests/models.py", line 832, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error

定制请求头

只需要传递dict给headers即可

1 2	headers = {'user-agent': 'my-app/0.0.1'} r = requests.get(url, headers=headers)

超时

`timeout` 并不是整个下载响应的时间限制，而是如果服务器在 `timeout` 秒内没有应答，将会引发一个异常


```python
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

cookies

要在请求中传入Cookies，只需准备一个dict传入cookies参数

1 2	cs = {'token': '12345', 'status': 'working'} r = requests.get(url, cookies=cs)

使用session

Session会自动保存所有请求的cookie，创建后用法与request相同

1 2	session = requests.Session() response = session.get('http:xxx')

retring

session 通过session自动保存cookies

数据提取方法

Json

json.load 将json字符串转化为字典

json dumps 将字典转化为json字符串

参数：

ensure_assccii：优先转化成ascii码 indent：换行空格效果

实现爬取百度翻译结果

手机版网页简单

callback字段没用，直接删掉

XPath

XPath是一门在XML文档中提取信息的语言，使用特别简便。

提取方法：

使用/与//组合，通过路径提取元素
使用@属性名提取属性
使用@属性名=属性提取元素

函数：

start-with、end-with函数
contains函数
and
text() 返回元素的文本内容

技巧

二次提取，第一次提取一个元素，第二次在该元素的基础上进行提取，提取路径前要加上'.'，否则会再次从头开始提取。

1
2
3

for page in res_xpath.xpath("//div[@id='content']//ul[@class='list-col list-col5 list-express slide-item']"):
    for books in page.xpath('./li'):
    # 加点表示从上一级提取的位置开始查找

安装XPath helper后可以复制路径，复制的路径最好不要使用*，代价较大，用具体的数据代替
1
2
//*[@id='content'] #复制结果
//div[@id='content'] #修改后
若结果中包含大量空格换行，使用字符串的strip()函数去除

xpath可以进行逻辑判断

1	/button[@value='submit' or @name='tijiao']

若提取到的数据是列表，需要取列表中的具体元素
1
book['title'] = books.xpath("./div[@class='cover']/a/@title")[0]

保存爬取的数据

# 方法一 借助json保存 后两个参数的作用分别为换行保存与使用合适的编码方式
with open("douban.txt", 'w') as f:
    f.write(json.dumps(result_dict, indent=2, ensure_ascii=False))
    
#方法二 直接保存
with open('douban.txt', 'a') as f:
    for film in films:
        f.write(str(film))
        f.write('\n')

正则表达式

对字符串进行匹配，正则表达式规则较多，使用时慢慢熟练即可

####正则表达式规则

常用（.*?）来匹配任意长度的字符，如提取网址，？为非贪婪匹配

20130515113723855

re模块

python自带re模块，提供了对正则表达式的支持，主要用到的方法如下

#返回pattern对象
re.compile(string[,flag])  
#以下为匹配所用函数
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

pattern

pattern理解为匹配模式，创建方法如下

1	pattern = re.compile('str')

flag

flag是匹配模式，含义如下

re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）
re.M(全拼：MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
re.S(全拼：DOTALL): 点任意匹配模式，改变'.'的行为
re.L(全拼：LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
re.U(全拼：UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

re.match(pattern, string[, flags]) 方法

从开头开始匹配，匹配成功返回一个匹配对象，若匹配不成功返回null，pattern结束时不再向后匹配

pattern=re.compile(r'hello')

result=re.match(pattern,'hello')		 # ->hello
result=re.match(pattern,'hell')			 # ->null
result=re.match(pattern,'hello world')	 # ->hello

re.search(pattern, string[, flags]) 方法
从任意位置开始匹配，匹配成功返回一个匹配对象，匹配不成功返回null

re.spilt(pattern, string[, maxsplit]) 方法

按照能够匹配的子串将string分割后返回列表

maxsplit指定最大分割次数，不指定将全部分割

1
2
3

pattern=re.compile(r'\d+')	#\d为数字
re.split(pattern,'one1two2three3four4')
# ->['one','two','three','four','']

re.findall(pattern, string[, flags])

搜索string，以列表形式返回全部能匹配的字串

1
2
3

pattern=re.compile(r'\d+')
re.split(pattern,'one1two2three3four4')
# ->['1','2','3','4']

re.finditer(pattern, string[, flags])

与findal() 方法类似，返回的结果不是列表而是迭代器

pattern = re.compile(r'\d+')
for m in re.finditer(pattern,'one1two2three3four4'):
    print m.group(),
 
### 输出 ###
# 1 2 3 4

匹配对象的方法

上述re模块函数返回内容分为两种，返回匹配对象后者返回匹配列表

常用的匹配对象方法有group() groups()、还有关于位置的start() end() span()

group([group1..]) 方法

返回整个的匹配对象，在pattern依据分组()分割

pattern=re.compile(r'(\w+)\+(\w+)')

re.match(pattern,'我12345+abcde').group()	# ->'我12345+abcde'
re.match(pattern,'我12345+abcde').group(1)	# ->'我12345'
re.match(pattern,'我12345+abcde').group(2)	# ->'abcde'

分组不仅仅得到想得到匹配的整个字符串，还能得到整个字符串里的特定字符串

groups() 方法

返回一个含有所有匹配子组的元组，匹配失败返回空元组

html响应与elements可能不一样，以html响应为准

技巧

首先应查看请求中有无对应的数据，从preview中查看
xpath爬取静态网页，动态网页获取json
手机版网页简单，爬取容易
获取json的请求可修改参数

Dong