Python Requests库编码问题


Python Requests库判断网页内容编码是下面这些策略,按优先节排列:

  1. 使用响应头里面'content-type'字段里面的'charset'填写的内容
  2. 使用 `chardet` 这个库,通过网页内容来猜测编码


def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
         return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'



def apparent_encoding(self):
    """The apparent encoding, provided by the chardet library."""
    return chardet.detect(self.content)['encoding']

在策略1里面使用'ISO-8859-1'这个编码,逻辑比较奇怪,这个在github也有人问过这个问题 issue. 不过这个策略在RFC里面是有规定的

github comment:

RFC 2616 is very clear here: if no encoding is declared in the Content-Type header, the encoding for text/html is assumed to be ISO-8859-1. If you know better, you are encouraged to either decode Response.content yourself or to set Response.encoding to the relevant encoding.

RFC 2616:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

上面这些策略可能出现的问题以这个 网页为例。

In [1]: url = ''

In [2]: import requests

In [3]: r = requests.get(url)

In [4]: r
Out[4]: <Response [200]>

In [5]: r.encoding
Out[5]: 'ISO-8859-1'

In [6]: from bs4 import BeautifulSoup

In [7]: bs = BeautifulSoup(r.text)

In [8]: bs.find('title')

In [9]: bs = BeautifulSoup(r.content)

In [10]: bs.find('title')
Out[10]: <title>新闻酸菜馆</title>

In [11]: r.headers
Out[11]: {'Content-Type': 'text/xml', 'Content-Encoding': 'gzip', 'Last-Modified': 'Wed, 17 Jan 2018 06:51:40 GMT', 'Accept-Ranges': 'bytes', 'ETag': '"0666ca05f8fd31:0"', 'Vary': 'Accept-Encoding', 'Server': 'Microsoft-IIS/8.5', 'X-Powered-By': 'ASP.NET', 'Date': 'Fri, 19 Jan 2018 03:54:29 GMT', 'Content-Length': '16782'}