Catalog
  1. 1. 题目
  2. 2. 项目简介
  3. 3. 功能描述
  4. 4. 代码清单
    1. 4.1. 爬虫
    2. 4.2. 查找
数据结构课程设计实验六

题目

豆瓣电影信息爬取

项目简介

  1. 数据收集:利用爬虫从多个网站收集数据至少1万条

  2. 数据整理:将数据按照格式,存储到外部文件(不许用数据库存储)

  3. 数据查询:对海量数据进行查找

功能描述

1.爬取豆瓣电影相关信息15000条至.csv文件中

2.根据电影评分降序排列

3.用户输入搜索的分数,即可通过二分查找得到所有此分数的电影信息

代码清单

爬虫

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import requests
from lxml import etree
import csv

f = open("movie.csv","w",encoding="GB18030",newline="")
writer = csv.DictWriter(f,fieldnames=["电影排名","电影名称","演员表","电影评分","电影评论数","电影短评"])
writer.writeheader()

for x in range(0,226,25):
url = "https://movie.douban.com/top250?start={}&filter=".format(x)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
data = requests.get(url, headers=headers).text
print("正在爬取--{}--网址的数据".format(url))
response = requests.get(url=url,headers=headers)
html_obj = etree.HTML(response.text)
div_list = html_obj.xpath('//div[@class="item"]')

for div in div_list:
rank = div.xpath('div[1]/em/text()')[0]
movie_name = div.xpath('div[2]/div[1]/a/span/text()')
director = div.xpath('div[2]/div[2]/p/text()')
grade = div.xpath('div[2]/div[2]/div/span[2]/text()')[0]
comment_num = div.xpath('div[2]/div[2]/div/span[4]/text()')[0].replace("人评价","")

try:
short_comment = div.xpath('div[2]/div[2]/p[2]/span/text()')[0]
except Exception as e:
short_comment="没有短评"
movie_name_string = ''

for movie in movie_name:
movie_name_string+=movie.replace("\xa0","")
movie_director_string = ''

for movie_director in director:
movie_director_string+=movie_director.replace('\n',"").replace("\xa0","").replace(" ","")
movie_dict = {"电影排名":rank,"电影名称":movie_name_string,"演员表":movie_director_string,"电影评分":grade,"电影评论数":comment_num,"电影短评":short_comment}
writer.writerow(movie_dict)
f.close()

查找

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import csv


def search(csv_file):
with open(csv_file, "r") as f:
reader = csv.reader(f)
sortedlist = sorted(reader, key=lambda x:x[3], reverse=True)

return sortedlist


def binary_search(lists, key):
high = 1 #跳过表头
low = len(lists) - 2
while low >= high:
mid = int((low + high) / 2)
if float(key) < float(lists[mid][3]):
high = mid - 1
elif float(key) > float(lists[mid][3]):
low = mid + 1
else:
print(lists[mid])
index1 = mid - 1
index2 = mid + 1
while(float(key) == float(lists[index1][3])):
print(lists[index1])
index1 -= 1
while(float(key) == float(lists[index2][3])):
print(lists[index2])
index2 += 1
break


key = input("请输入电影评分(若为整数则直接输入整数,若为小数则保留一位):")
movies = search('movie.csv')
binary_search(movies,key)
Author: Christopher Shen
Link: https://www.pasxsenger.com/2020/04/08/数据结构课程设计实验六/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • 微信
  • 支付寶