Python & Web

## 屏幕抓取

- 屏幕抓取是程序下载网页并且从中提取信息的过程. 如果你想在你的程序中使用在线的网页所包含的信息, 就可以使用这个技术. 如果所涉及的网页是动态的那就更有用了, 也就是说网页是不停变化的. 不然就要每次都下载网页, 然后手动提取信息才行.  Example: 使用urllib获取网页的html源码, 然后使用正则表达式提取信息. 简单的urllib抓取有很多问题, 两个比较好的解决方案是使用程序调用**Tidy(Python库),** 进行XHTML解析. 另一个是使用**Beautiful Soup库,** 它专门为了屏幕抓取设计. 
```
#抓取google首页中有多少google单词出现.
from urllib import request
import re
p = re.compile('google')
text = request.urlopen('http://www.google.com').readline()
text = bytes.decode(text)
print(p.findall(text).__sizeof__()) # 408
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python & Web #17

屏幕抓取

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Python & Web #17

Description

屏幕抓取

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions