python爬虫,requests使用,网页采集案列:搜狗爬取人物信息
一、初识爬虫,requests使用
requests介绍:
Request支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自动响应内容的编码,支持国际化的URL和POST数据自动编码。requests会自动实现持久连接keep-alive
# 导入模块
import requests
# 目标URL
url = "https://www.sogou.com/"
response = requests.get(url=url) # 发起请求,并接受
# 接受的页面进行解析
page_text = response.text
# 打印出来
print(page_text)
# 保存到本地
with open("sogou.html", "w", encoding="utf-8") as fp:
fp.write(page_text)
print("结束")
二、网页采集案列:搜狗爬取人物信息
# 导入模块,输入url
import requests
url = "https://www.sogou.com/web?"
# 模拟浏览器UA,防止被发现是个爬虫
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36)"
}
# 输入提示框(要搜索的东西)
name = input("输入一个人名:")
# 构造payload,模拟真实数据包
param = {
"type": "getpinyin",
"query": name
}
# 发起请求并接受请求到的内容
response = requests.get(url, params=param, headers=headers)
# 文本方式读取
page_txt = response.text
# 保存网页
filename = name + ".html"
with open(filename, "w", encoding="utf-8") as fp:
fp.write(page_txt)
print("succeed")