Compare commits
12 Commits
130999364f
...
d64bf93988
| Author | SHA1 | Date | |
|---|---|---|---|
| d64bf93988 | |||
| 83d1b21686 | |||
| 7b16c384d3 | |||
| e04a611dbc | |||
| 1856f3e9fc | |||
| 89909d2781 | |||
| ac98ac0057 | |||
| 4994310f14 | |||
| 31d0525cd0 | |||
| c618528a0a | |||
| 5e396796ca | |||
| baea50bfa0 |
198
BUG_FIXES_SUMMARY.md
Normal file
198
BUG_FIXES_SUMMARY.md
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
# 爬虫Bug修复总结
|
||||||
|
|
||||||
|
## 修复的问题列表
|
||||||
|
|
||||||
|
### 1. 新华网 - 不保存文章内容
|
||||||
|
**问题**: 新华网爬取的文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑,增加了更多内容选择器
|
||||||
|
- 修复了文章页面判断逻辑
|
||||||
|
- 添加了对新华网特定HTML结构的支持
|
||||||
|
|
||||||
|
### 2. 中国政府网 - 两个标题问题
|
||||||
|
**问题**: 爬取到文章后,打开文章详情会有两个标题存在
|
||||||
|
**修复**:
|
||||||
|
- 优化了标题提取逻辑,优先选择带有class="title"的h1标签
|
||||||
|
- 改进了标题去重机制
|
||||||
|
|
||||||
|
### 3. 人民网 - 乱码和404问题
|
||||||
|
**问题**: 爬取文章后会乱码,会有404,视频没有下载下来
|
||||||
|
**修复**:
|
||||||
|
- 添加了特殊的请求头配置
|
||||||
|
- 修复了编码问题,确保使用UTF-8编码
|
||||||
|
- 改进了错误处理机制
|
||||||
|
- 优化了视频下载逻辑
|
||||||
|
|
||||||
|
### 4. 央视网 - 没有保存视频
|
||||||
|
**问题**: 央视网的视频没有被正确下载和保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了对data-src、data-url等视频源属性的支持
|
||||||
|
- 添加了央视网特定的视频处理逻辑
|
||||||
|
- 改进了视频下载的错误处理和日志记录
|
||||||
|
|
||||||
|
### 5. 求是网 - 两个标题问题
|
||||||
|
**问题**: 打开文章详情会有两个标题
|
||||||
|
**修复**:
|
||||||
|
- 优化了标题提取逻辑
|
||||||
|
- 改进了标题去重机制
|
||||||
|
|
||||||
|
### 6. 解放军报 - 类别爬取问题
|
||||||
|
**问题**: 会把类别都爬下来
|
||||||
|
**修复**:
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
- 优化了内容区域识别
|
||||||
|
|
||||||
|
### 7. 光明日报 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 8. 中国日报 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 9. 工人日报 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 10. 科技日报 - 无法爬取
|
||||||
|
**问题**: 无法正常爬取文章
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
|
||||||
|
### 11. 人民政协报 - 爬取错误
|
||||||
|
**问题**: 爬取过程中出现错误
|
||||||
|
**修复**:
|
||||||
|
- 优化了错误处理机制
|
||||||
|
- 改进了文章结构识别
|
||||||
|
|
||||||
|
### 12. 中国纪检监察报 - 无法爬取
|
||||||
|
**问题**: 无法正常爬取文章
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
|
||||||
|
### 13. 中国新闻社 - 爬取非文章部分
|
||||||
|
**问题**: 爬取了非文章的部分内容
|
||||||
|
**修复**:
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
- 优化了内容区域识别
|
||||||
|
|
||||||
|
### 14. 学习时报 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 15. 中国青年报 - 无法爬取
|
||||||
|
**问题**: 无法正常爬取文章
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
|
||||||
|
### 16. 中国妇女报 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 17. 法治日报 - 无法爬取
|
||||||
|
**问题**: 无法正常爬取文章
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
|
||||||
|
### 18. 农民日报 - 正文未被爬取
|
||||||
|
**问题**: 文章正文没有被正确爬取
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 19. 学习强国 - 无法爬取
|
||||||
|
**问题**: 无法正常爬取文章
|
||||||
|
**修复**:
|
||||||
|
- 更新了文章结构识别逻辑
|
||||||
|
- 改进了文章页面判断逻辑
|
||||||
|
|
||||||
|
### 20. 旗帜网 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
### 21. 中国网 - 不保存文章内容
|
||||||
|
**问题**: 文章内容没有被正确保存
|
||||||
|
**修复**:
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 添加了对article-body等特定class的支持
|
||||||
|
|
||||||
|
## 主要修复内容
|
||||||
|
|
||||||
|
### 1. 文章结构识别优化
|
||||||
|
- 为每个网站添加了更精确的标题和内容选择器
|
||||||
|
- 增加了对多种HTML结构的支持
|
||||||
|
- 优化了选择器的优先级
|
||||||
|
|
||||||
|
### 2. 文章页面判断改进
|
||||||
|
- 改进了文章页面的识别逻辑
|
||||||
|
- 增加了URL路径模式的判断
|
||||||
|
- 优化了页面类型识别
|
||||||
|
|
||||||
|
### 3. 编码和请求优化
|
||||||
|
- 修复了人民网的乱码问题
|
||||||
|
- 添加了特殊的请求头配置
|
||||||
|
- 改进了错误处理机制
|
||||||
|
|
||||||
|
### 4. 视频下载增强
|
||||||
|
- 增加了对多种视频源属性的支持
|
||||||
|
- 添加了央视网特定的视频处理
|
||||||
|
- 改进了视频下载的错误处理
|
||||||
|
|
||||||
|
### 5. URL配置更新
|
||||||
|
- 将部分网站的URL从HTTP更新为HTTPS
|
||||||
|
- 确保使用正确的域名和协议
|
||||||
|
|
||||||
|
## 技术改进
|
||||||
|
|
||||||
|
### 1. 错误处理
|
||||||
|
- 添加了更完善的异常处理
|
||||||
|
- 改进了错误日志记录
|
||||||
|
- 增加了重试机制
|
||||||
|
|
||||||
|
### 2. 内容识别
|
||||||
|
- 增加了更多内容选择器
|
||||||
|
- 优化了选择器的优先级
|
||||||
|
- 添加了对特殊HTML结构的支持
|
||||||
|
|
||||||
|
### 3. 媒体处理
|
||||||
|
- 改进了图片和视频的下载逻辑
|
||||||
|
- 增加了对多种媒体源的支持
|
||||||
|
- 优化了媒体文件的保存
|
||||||
|
|
||||||
|
### 4. 性能优化
|
||||||
|
- 改进了请求超时设置
|
||||||
|
- 优化了编码处理
|
||||||
|
- 减少了不必要的请求
|
||||||
|
|
||||||
|
## 测试建议
|
||||||
|
|
||||||
|
1. **单个测试**: 对每个修复的网站进行单独测试
|
||||||
|
2. **批量测试**: 使用批量爬取命令测试所有网站
|
||||||
|
3. **内容验证**: 检查爬取的文章内容是否完整
|
||||||
|
4. **媒体验证**: 确认图片和视频是否正确下载
|
||||||
|
5. **错误监控**: 监控爬取过程中的错误日志
|
||||||
|
|
||||||
|
## 后续优化建议
|
||||||
|
|
||||||
|
1. **动态适配**: 考虑添加动态适配机制,自动适应网站结构变化
|
||||||
|
2. **智能识别**: 使用机器学习技术提高内容识别的准确性
|
||||||
|
3. **反爬虫处理**: 添加更复杂的反爬虫绕过机制
|
||||||
|
4. **性能监控**: 添加性能监控和统计功能
|
||||||
|
5. **内容质量**: 增加内容质量检测和过滤机制
|
||||||
|
|
||||||
179
CRAWLER_README.md
Normal file
179
CRAWLER_README.md
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
# 中央主流媒体爬虫系统
|
||||||
|
|
||||||
|
本项目是一个专门用于爬取中央主流媒体的Django爬虫系统,支持爬取18家中央主流媒体及其子网站、客户端和新媒体平台。
|
||||||
|
|
||||||
|
## 支持的媒体列表
|
||||||
|
|
||||||
|
### 18家中央主流媒体
|
||||||
|
1. **人民日报** - 人民网、人民日报客户端、人民日报报纸
|
||||||
|
2. **新华社** - 新华网、新华网主站、新华社移动端
|
||||||
|
3. **中央广播电视总台** - 央视网、央视新闻、央视移动端
|
||||||
|
4. **求是** - 求是网、求是移动端
|
||||||
|
5. **解放军报** - 解放军报、解放军报移动端
|
||||||
|
6. **光明日报** - 光明日报、光明日报移动端
|
||||||
|
7. **经济日报** - 经济日报、经济日报移动端
|
||||||
|
8. **中国日报** - 中国日报、中国日报移动端
|
||||||
|
9. **工人日报** - 工人日报、工人日报移动端
|
||||||
|
10. **科技日报** - 科技日报、科技日报移动端
|
||||||
|
11. **人民政协报** - 人民政协报、人民政协报移动端
|
||||||
|
12. **中国纪检监察报** - 中国纪检监察报、中国纪检监察报移动端
|
||||||
|
13. **中国新闻社** - 中国新闻社、中国新闻社移动端
|
||||||
|
14. **学习时报** - 学习时报、学习时报移动端
|
||||||
|
15. **中国青年报** - 中国青年报、中国青年报移动端
|
||||||
|
16. **中国妇女报** - 中国妇女报、中国妇女报移动端
|
||||||
|
17. **法治日报** - 法治日报、法治日报移动端
|
||||||
|
18. **农民日报** - 农民日报、农民日报移动端
|
||||||
|
|
||||||
|
### 特殊平台
|
||||||
|
19. **学习强国** - 中央媒体学习号及省级以上学习平台
|
||||||
|
20. **旗帜网** - 旗帜网及其移动端
|
||||||
|
21. **中国网** - 主网及中国网一省份(不转发二级子网站)
|
||||||
|
|
||||||
|
## 使用方法
|
||||||
|
|
||||||
|
### 1. 单个媒体爬取
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 爬取人民日报所有平台
|
||||||
|
python manage.py crawl_rmrb
|
||||||
|
|
||||||
|
# 爬取人民日报特定平台
|
||||||
|
python manage.py crawl_rmrb --platform peopleapp # 只爬取客户端
|
||||||
|
python manage.py crawl_rmrb --platform people # 只爬取人民网
|
||||||
|
python manage.py crawl_rmrb --platform paper # 只爬取报纸
|
||||||
|
|
||||||
|
# 爬取新华社所有平台
|
||||||
|
python manage.py crawl_xinhua
|
||||||
|
|
||||||
|
# 爬取央视所有平台
|
||||||
|
python manage.py crawl_cctv
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 批量爬取所有媒体
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 爬取所有中央主流媒体
|
||||||
|
python manage.py crawl_all_media
|
||||||
|
|
||||||
|
# 爬取指定媒体
|
||||||
|
python manage.py crawl_all_media --media rmrb,xinhua,cctv
|
||||||
|
|
||||||
|
# 爬取指定平台类型
|
||||||
|
python manage.py crawl_all_media --platform web # 只爬取网站
|
||||||
|
python manage.py crawl_all_media --platform mobile # 只爬取移动端
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 导出文章数据
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 导出所有文章为JSON格式
|
||||||
|
python manage.py export_articles --format json
|
||||||
|
|
||||||
|
# 导出指定网站的文章为CSV格式
|
||||||
|
python manage.py export_articles --format csv --website "人民日报客户端"
|
||||||
|
|
||||||
|
# 导出为Word文档(包含媒体文件)
|
||||||
|
python manage.py export_articles --format docx --include-media
|
||||||
|
|
||||||
|
# 导出为ZIP包(包含文章数据和媒体文件)
|
||||||
|
python manage.py export_articles --format json --include-media
|
||||||
|
```
|
||||||
|
|
||||||
|
## 可用的爬虫命令
|
||||||
|
|
||||||
|
| 命令 | 媒体名称 | 说明 |
|
||||||
|
|------|----------|------|
|
||||||
|
| `crawl_rmrb` | 人民日报 | 爬取人民网、客户端、报纸 |
|
||||||
|
| `crawl_xinhua` | 新华社 | 爬取新华网、主站、移动端 |
|
||||||
|
| `crawl_cctv` | 中央广播电视总台 | 爬取央视网、央视新闻、移动端 |
|
||||||
|
| `crawl_qiushi` | 求是 | 爬取求是网、移动端 |
|
||||||
|
| `crawl_pla` | 解放军报 | 爬取解放军报、移动端 |
|
||||||
|
| `crawl_gmrb` | 光明日报 | 爬取光明日报、移动端 |
|
||||||
|
| `crawl_jjrb` | 经济日报 | 爬取经济日报、移动端 |
|
||||||
|
| `crawl_chinadaily` | 中国日报 | 爬取中国日报、移动端 |
|
||||||
|
| `crawl_grrb` | 工人日报 | 爬取工人日报、移动端 |
|
||||||
|
| `crawl_kjrb` | 科技日报 | 爬取科技日报、移动端 |
|
||||||
|
| `crawl_rmzxb` | 人民政协报 | 爬取人民政协报、移动端 |
|
||||||
|
| `crawl_zgjwjc` | 中国纪检监察报 | 爬取中国纪检监察报、移动端 |
|
||||||
|
| `crawl_chinanews` | 中国新闻社 | 爬取中国新闻社、移动端 |
|
||||||
|
| `crawl_xxsb` | 学习时报 | 爬取学习时报、移动端 |
|
||||||
|
| `crawl_zgqnb` | 中国青年报 | 爬取中国青年报、移动端 |
|
||||||
|
| `crawl_zgfnb` | 中国妇女报 | 爬取中国妇女报、移动端 |
|
||||||
|
| `crawl_fzrb` | 法治日报 | 爬取法治日报、移动端 |
|
||||||
|
| `crawl_nmrb` | 农民日报 | 爬取农民日报、移动端 |
|
||||||
|
| `crawl_xuexi` | 学习强国 | 爬取中央媒体学习号及省级平台 |
|
||||||
|
| `crawl_qizhi` | 旗帜网 | 爬取旗帜网、移动端 |
|
||||||
|
| `crawl_china` | 中国网 | 爬取主网及一省份 |
|
||||||
|
| `crawl_all_media` | 所有媒体 | 批量爬取所有中央主流媒体 |
|
||||||
|
|
||||||
|
## 平台选项
|
||||||
|
|
||||||
|
每个爬虫命令都支持以下平台选项:
|
||||||
|
|
||||||
|
- `all` (默认): 爬取所有平台
|
||||||
|
- `web`: 只爬取网站版本
|
||||||
|
- `mobile`: 只爬取移动端版本
|
||||||
|
- 特定平台: 每个媒体可能有特定的平台选项
|
||||||
|
|
||||||
|
## 数据导出格式
|
||||||
|
|
||||||
|
支持以下导出格式:
|
||||||
|
|
||||||
|
- `json`: JSON格式,便于程序处理
|
||||||
|
- `csv`: CSV格式,便于Excel打开
|
||||||
|
- `docx`: Word文档格式,包含格式化的文章内容
|
||||||
|
|
||||||
|
## 媒体文件处理
|
||||||
|
|
||||||
|
系统会自动下载文章中的图片和视频文件,并保存到本地媒体目录。导出时可以选择是否包含媒体文件。
|
||||||
|
|
||||||
|
## 注意事项
|
||||||
|
|
||||||
|
1. **爬取频率**: 建议控制爬取频率,避免对目标网站造成过大压力
|
||||||
|
2. **数据存储**: 爬取的数据会存储在Django数据库中,确保有足够的存储空间
|
||||||
|
3. **网络环境**: 某些网站可能需要特定的网络环境才能访问
|
||||||
|
4. **反爬虫**: 部分网站可能有反爬虫机制,需要适当调整爬取策略
|
||||||
|
|
||||||
|
## 技术特性
|
||||||
|
|
||||||
|
- **智能识别**: 自动识别文章页面和内容区域
|
||||||
|
- **媒体下载**: 自动下载文章中的图片和视频
|
||||||
|
- **去重处理**: 自动避免重复爬取相同文章
|
||||||
|
- **错误处理**: 完善的错误处理和日志记录
|
||||||
|
- **可扩展**: 易于添加新的媒体网站
|
||||||
|
|
||||||
|
## 依赖要求
|
||||||
|
|
||||||
|
- Django 3.0+
|
||||||
|
- requests
|
||||||
|
- beautifulsoup4
|
||||||
|
- python-docx (用于Word导出)
|
||||||
|
- Pillow (用于图片处理)
|
||||||
|
|
||||||
|
## 安装依赖
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## 数据库迁移
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python manage.py makemigrations
|
||||||
|
python manage.py migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
## 运行爬虫
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 启动Django服务器
|
||||||
|
python manage.py runserver
|
||||||
|
|
||||||
|
# 运行爬虫
|
||||||
|
python manage.py crawl_all_media
|
||||||
|
```
|
||||||
|
|
||||||
|
## 查看结果
|
||||||
|
|
||||||
|
爬取完成后,可以通过Django管理界面或导出命令查看爬取的文章数据。
|
||||||
|
|
||||||
285
IMPLEMENTATION_SUMMARY.md
Normal file
285
IMPLEMENTATION_SUMMARY.md
Normal file
@@ -0,0 +1,285 @@
|
|||||||
|
# 中央主流媒体爬虫系统实现总结
|
||||||
|
|
||||||
|
## 项目概述
|
||||||
|
|
||||||
|
本项目成功实现了对18家中央主流媒体及其子网站、客户端、新媒体平台的爬虫系统。系统基于Django框架构建,具有高度的可扩展性和稳定性。
|
||||||
|
|
||||||
|
## 已实现的媒体列表
|
||||||
|
|
||||||
|
### 18家中央主流媒体
|
||||||
|
1. **人民日报** (`crawl_rmrb.py`)
|
||||||
|
- 人民网 (http://www.people.com.cn)
|
||||||
|
- 人民日报客户端 (https://www.peopleapp.com)
|
||||||
|
- 人民日报报纸 (http://paper.people.com.cn)
|
||||||
|
|
||||||
|
2. **新华社** (`crawl_xinhua.py`)
|
||||||
|
- 新华网 (https://www.news.cn)
|
||||||
|
- 新华网主站 (http://www.xinhuanet.com)
|
||||||
|
- 新华社移动端 (https://m.xinhuanet.com)
|
||||||
|
|
||||||
|
3. **中央广播电视总台** (`crawl_cctv.py`)
|
||||||
|
- 央视网 (https://www.cctv.com)
|
||||||
|
- 央视新闻 (https://news.cctv.com)
|
||||||
|
- 央视移动端 (https://m.cctv.com)
|
||||||
|
|
||||||
|
4. **求是** (`crawl_qiushi.py`)
|
||||||
|
- 求是网 (http://www.qstheory.cn)
|
||||||
|
- 求是移动端 (http://m.qstheory.cn)
|
||||||
|
|
||||||
|
5. **解放军报** (`crawl_pla.py`)
|
||||||
|
- 解放军报 (http://www.81.cn)
|
||||||
|
- 解放军报移动端 (http://m.81.cn)
|
||||||
|
|
||||||
|
6. **光明日报** (`crawl_gmrb.py`)
|
||||||
|
- 光明日报 (https://www.gmw.cn)
|
||||||
|
- 光明日报移动端 (https://m.gmw.cn)
|
||||||
|
|
||||||
|
7. **经济日报** (`crawl_jjrb.py`)
|
||||||
|
- 经济日报 (https://www.ce.cn)
|
||||||
|
- 经济日报移动端 (https://m.ce.cn)
|
||||||
|
|
||||||
|
8. **中国日报** (`crawl_chinadaily.py`)
|
||||||
|
- 中国日报 (https://www.chinadaily.com.cn)
|
||||||
|
- 中国日报移动端 (https://m.chinadaily.com.cn)
|
||||||
|
|
||||||
|
9. **工人日报** (`crawl_grrb.py`)
|
||||||
|
- 工人日报 (http://www.workercn.cn)
|
||||||
|
- 工人日报移动端 (http://m.workercn.cn)
|
||||||
|
|
||||||
|
10. **科技日报** (`crawl_kjrb.py`)
|
||||||
|
- 科技日报 (http://digitalpaper.stdaily.com)
|
||||||
|
- 科技日报移动端 (http://m.stdaily.com)
|
||||||
|
|
||||||
|
11. **人民政协报** (`crawl_rmzxb.py`)
|
||||||
|
- 人民政协报 (http://www.rmzxb.com.cn)
|
||||||
|
- 人民政协报移动端 (http://m.rmzxb.com.cn)
|
||||||
|
|
||||||
|
12. **中国纪检监察报** (`crawl_zgjwjc.py`)
|
||||||
|
- 中国纪检监察报 (http://www.jjjcb.cn)
|
||||||
|
- 中国纪检监察报移动端 (http://m.jjjcb.cn)
|
||||||
|
|
||||||
|
13. **中国新闻社** (`crawl_chinanews.py`)
|
||||||
|
- 中国新闻社 (https://www.chinanews.com.cn)
|
||||||
|
- 中国新闻社移动端 (https://m.chinanews.com.cn)
|
||||||
|
|
||||||
|
14. **学习时报** (`crawl_xxsb.py`)
|
||||||
|
- 学习时报 (http://www.studytimes.cn)
|
||||||
|
- 学习时报移动端 (http://m.studytimes.cn)
|
||||||
|
|
||||||
|
15. **中国青年报** (`crawl_zgqnb.py`)
|
||||||
|
- 中国青年报 (https://www.cyol.com)
|
||||||
|
- 中国青年报移动端 (https://m.cyol.com)
|
||||||
|
|
||||||
|
16. **中国妇女报** (`crawl_zgfnb.py`)
|
||||||
|
- 中国妇女报 (http://www.cnwomen.com.cn)
|
||||||
|
- 中国妇女报移动端 (http://m.cnwomen.com.cn)
|
||||||
|
|
||||||
|
17. **法治日报** (`crawl_fzrb.py`)
|
||||||
|
- 法治日报 (http://www.legaldaily.com.cn)
|
||||||
|
- 法治日报移动端 (http://m.legaldaily.com.cn)
|
||||||
|
|
||||||
|
18. **农民日报** (`crawl_nmrb.py`)
|
||||||
|
- 农民日报 (http://www.farmer.com.cn)
|
||||||
|
- 农民日报移动端 (http://m.farmer.com.cn)
|
||||||
|
|
||||||
|
### 特殊平台
|
||||||
|
19. **学习强国** (`crawl_xuexi.py`)
|
||||||
|
- 学习强国主站 (https://www.xuexi.cn)
|
||||||
|
- 中央媒体学习号及省级以上学习平台
|
||||||
|
|
||||||
|
20. **旗帜网** (`crawl_qizhi.py`)
|
||||||
|
- 旗帜网 (http://www.qizhiwang.org.cn)
|
||||||
|
- 旗帜网移动端 (http://m.qizhiwang.org.cn)
|
||||||
|
|
||||||
|
21. **中国网** (`crawl_china.py`)
|
||||||
|
- 中国网主网 (http://www.china.com.cn)
|
||||||
|
- 中国网一省份(不转发二级子网站)
|
||||||
|
|
||||||
|
## 技术实现
|
||||||
|
|
||||||
|
### 1. 爬虫架构
|
||||||
|
- **Django管理命令**: 每个媒体都有独立的爬虫命令
|
||||||
|
- **模块化设计**: 易于维护和扩展
|
||||||
|
- **统一接口**: 所有爬虫使用相同的核心爬取逻辑
|
||||||
|
|
||||||
|
### 2. 核心功能
|
||||||
|
- **智能识别**: 自动识别文章页面和内容区域
|
||||||
|
- **媒体下载**: 自动下载文章中的图片和视频
|
||||||
|
- **去重处理**: 避免重复爬取相同文章
|
||||||
|
- **错误处理**: 完善的异常处理机制
|
||||||
|
|
||||||
|
### 3. 数据处理
|
||||||
|
- **数据模型**: Website和Article模型
|
||||||
|
- **数据导出**: 支持JSON、CSV、Word格式
|
||||||
|
- **媒体文件**: 自动下载和管理媒体文件
|
||||||
|
|
||||||
|
### 4. 批量操作
|
||||||
|
- **批量爬取**: `crawl_all_media`命令支持批量爬取
|
||||||
|
- **选择性爬取**: 支持指定特定媒体或平台
|
||||||
|
- **统计功能**: 提供爬取统计信息
|
||||||
|
|
||||||
|
## 文件结构
|
||||||
|
|
||||||
|
```
|
||||||
|
core/management/commands/
|
||||||
|
├── crawl_rmrb.py # 人民日报爬虫
|
||||||
|
├── crawl_xinhua.py # 新华社爬虫
|
||||||
|
├── crawl_cctv.py # 央视爬虫
|
||||||
|
├── crawl_qiushi.py # 求是爬虫
|
||||||
|
├── crawl_pla.py # 解放军报爬虫
|
||||||
|
├── crawl_gmrb.py # 光明日报爬虫
|
||||||
|
├── crawl_jjrb.py # 经济日报爬虫
|
||||||
|
├── crawl_chinadaily.py # 中国日报爬虫
|
||||||
|
├── crawl_grrb.py # 工人日报爬虫
|
||||||
|
├── crawl_kjrb.py # 科技日报爬虫
|
||||||
|
├── crawl_rmzxb.py # 人民政协报爬虫
|
||||||
|
├── crawl_zgjwjc.py # 中国纪检监察报爬虫
|
||||||
|
├── crawl_chinanews.py # 中国新闻社爬虫
|
||||||
|
├── crawl_xxsb.py # 学习时报爬虫
|
||||||
|
├── crawl_zgqnb.py # 中国青年报爬虫
|
||||||
|
├── crawl_zgfnb.py # 中国妇女报爬虫
|
||||||
|
├── crawl_fzrb.py # 法治日报爬虫
|
||||||
|
├── crawl_nmrb.py # 农民日报爬虫
|
||||||
|
├── crawl_xuexi.py # 学习强国爬虫
|
||||||
|
├── crawl_qizhi.py # 旗帜网爬虫
|
||||||
|
├── crawl_china.py # 中国网爬虫
|
||||||
|
├── crawl_all_media.py # 批量爬取命令
|
||||||
|
└── export_articles.py # 数据导出命令
|
||||||
|
|
||||||
|
core/
|
||||||
|
├── models.py # 数据模型
|
||||||
|
├── utils.py # 核心爬取逻辑
|
||||||
|
└── views.py # 视图函数
|
||||||
|
|
||||||
|
docs/
|
||||||
|
├── CRAWLER_README.md # 使用说明
|
||||||
|
└── IMPLEMENTATION_SUMMARY.md # 实现总结
|
||||||
|
|
||||||
|
test_crawlers.py # 测试脚本
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用方法
|
||||||
|
|
||||||
|
### 1. 单个媒体爬取
|
||||||
|
```bash
|
||||||
|
# 爬取人民日报所有平台
|
||||||
|
python manage.py crawl_rmrb
|
||||||
|
|
||||||
|
# 爬取特定平台
|
||||||
|
python manage.py crawl_rmrb --platform peopleapp
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 批量爬取
|
||||||
|
```bash
|
||||||
|
# 爬取所有媒体
|
||||||
|
python manage.py crawl_all_media
|
||||||
|
|
||||||
|
# 爬取指定媒体
|
||||||
|
python manage.py crawl_all_media --media rmrb,xinhua,cctv
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 数据导出
|
||||||
|
```bash
|
||||||
|
# 导出为JSON格式
|
||||||
|
python manage.py export_articles --format json
|
||||||
|
|
||||||
|
# 导出为Word文档
|
||||||
|
python manage.py export_articles --format docx --include-media
|
||||||
|
```
|
||||||
|
|
||||||
|
## 技术特性
|
||||||
|
|
||||||
|
### 1. 智能识别
|
||||||
|
- 针对不同网站的文章结构进行优化
|
||||||
|
- 自动识别标题、内容、图片等元素
|
||||||
|
- 支持多种HTML结构模式
|
||||||
|
|
||||||
|
### 2. 媒体处理
|
||||||
|
- 自动下载文章中的图片和视频
|
||||||
|
- 本地化存储媒体文件
|
||||||
|
- 支持多种媒体格式
|
||||||
|
|
||||||
|
### 3. 数据管理
|
||||||
|
- 去重机制避免重复数据
|
||||||
|
- 支持增量爬取
|
||||||
|
- 完善的数据导出功能
|
||||||
|
|
||||||
|
### 4. 错误处理
|
||||||
|
- 网络异常处理
|
||||||
|
- 解析错误处理
|
||||||
|
- 数据库异常处理
|
||||||
|
|
||||||
|
## 扩展性
|
||||||
|
|
||||||
|
### 1. 添加新媒体
|
||||||
|
- 复制现有爬虫文件
|
||||||
|
- 修改网站配置
|
||||||
|
- 更新核心逻辑(如需要)
|
||||||
|
|
||||||
|
### 2. 自定义爬取逻辑
|
||||||
|
- 在`utils.py`中添加特定网站的处理逻辑
|
||||||
|
- 支持自定义文章识别规则
|
||||||
|
- 支持自定义内容提取规则
|
||||||
|
|
||||||
|
### 3. 数据格式扩展
|
||||||
|
- 支持更多导出格式
|
||||||
|
- 支持自定义数据字段
|
||||||
|
- 支持数据转换和清洗
|
||||||
|
|
||||||
|
## 性能优化
|
||||||
|
|
||||||
|
### 1. 并发控制
|
||||||
|
- 控制爬取频率
|
||||||
|
- 避免对目标网站造成压力
|
||||||
|
- 支持断点续爬
|
||||||
|
|
||||||
|
### 2. 资源管理
|
||||||
|
- 内存使用优化
|
||||||
|
- 磁盘空间管理
|
||||||
|
- 网络带宽控制
|
||||||
|
|
||||||
|
### 3. 数据存储
|
||||||
|
- 数据库索引优化
|
||||||
|
- 媒体文件存储优化
|
||||||
|
- 查询性能优化
|
||||||
|
|
||||||
|
## 安全考虑
|
||||||
|
|
||||||
|
### 1. 网络安全
|
||||||
|
- 使用合适的User-Agent
|
||||||
|
- 控制请求频率
|
||||||
|
- 遵守robots.txt
|
||||||
|
|
||||||
|
### 2. 数据安全
|
||||||
|
- 数据备份机制
|
||||||
|
- 访问权限控制
|
||||||
|
- 敏感信息保护
|
||||||
|
|
||||||
|
## 维护建议
|
||||||
|
|
||||||
|
### 1. 定期更新
|
||||||
|
- 监控网站结构变化
|
||||||
|
- 更新爬取规则
|
||||||
|
- 维护依赖包版本
|
||||||
|
|
||||||
|
### 2. 监控告警
|
||||||
|
- 爬取状态监控
|
||||||
|
- 错误日志分析
|
||||||
|
- 性能指标监控
|
||||||
|
|
||||||
|
### 3. 数据质量
|
||||||
|
- 定期数据验证
|
||||||
|
- 内容质量检查
|
||||||
|
- 数据完整性验证
|
||||||
|
|
||||||
|
## 总结
|
||||||
|
|
||||||
|
本项目成功实现了对18家中央主流媒体的全面爬取支持,具有以下特点:
|
||||||
|
|
||||||
|
1. **全面覆盖**: 支持所有指定的中央主流媒体
|
||||||
|
2. **技术先进**: 采用现代化的爬虫技术栈
|
||||||
|
3. **易于使用**: 提供简单易用的命令行接口
|
||||||
|
4. **高度可扩展**: 支持快速添加新的媒体网站
|
||||||
|
5. **稳定可靠**: 具备完善的错误处理和恢复机制
|
||||||
|
|
||||||
|
该系统为中央主流媒体的内容采集和分析提供了强有力的技术支撑,可以满足各种应用场景的需求。
|
||||||
238
core/admin.py
238
core/admin.py
@@ -1,31 +1,50 @@
|
|||||||
from django.contrib import admin
|
|
||||||
from django.contrib.admin import AdminSite
|
|
||||||
from .models import Website, Article
|
from .models import Website, Article
|
||||||
# 添加actions相关的导入
|
# 添加actions相关的导入
|
||||||
from django.contrib import messages
|
from django.contrib import messages
|
||||||
from django.http import HttpResponseRedirect
|
|
||||||
# 添加导出功能所需导入
|
# 添加导出功能所需导入
|
||||||
import csv
|
import csv
|
||||||
from django.http import HttpResponse
|
from django.http import HttpResponse
|
||||||
import json
|
import json
|
||||||
|
# 添加视图函数需要的导入
|
||||||
|
from django.shortcuts import render, redirect
|
||||||
|
from django.urls import path
|
||||||
|
from django.contrib import admin
|
||||||
|
from django.core.management import call_command
|
||||||
|
|
||||||
|
# 添加运行爬虫的视图函数
|
||||||
|
def run_crawler_view(request):
|
||||||
|
"""
|
||||||
|
管理后台运行爬虫的视图
|
||||||
|
"""
|
||||||
|
if request.method == 'POST':
|
||||||
|
website_name = request.POST.get('website_name')
|
||||||
|
if not website_name:
|
||||||
|
messages.error(request, '请选择要爬取的网站')
|
||||||
|
return redirect('admin:core_article_changelist')
|
||||||
|
|
||||||
# 创建自定义管理站点
|
try:
|
||||||
class NewsCnAdminSite(AdminSite):
|
# 动态获取网站对象
|
||||||
site_header = "新华网管理后台"
|
website = Website.objects.get(name=website_name)
|
||||||
site_title = "新华网管理"
|
|
||||||
index_title = "新华网内容管理"
|
|
||||||
|
|
||||||
|
# 根据网站对象确定要执行的爬虫命令
|
||||||
|
# 移除默认的通用爬虫,每个网站必须配置自己的爬虫命令
|
||||||
|
crawler_name = getattr(website, 'crawler_command', None)
|
||||||
|
|
||||||
class DongfangyancaoAdminSite(AdminSite):
|
# 如果网站没有配置爬虫命令,则报错
|
||||||
site_header = "东方烟草报管理后台"
|
if not crawler_name:
|
||||||
site_title = "东方烟草报管理"
|
messages.error(request, f'网站 {website_name} 未配置爬虫命令')
|
||||||
index_title = "东方烟草报内容管理"
|
return redirect('admin:core_article_changelist')
|
||||||
|
|
||||||
|
# 运行爬虫命令,传递网站名称
|
||||||
|
call_command(crawler_name, website_name)
|
||||||
|
|
||||||
# 实例化管理站点
|
messages.success(request, f'成功执行爬虫: {website_name}')
|
||||||
news_cn_admin = NewsCnAdminSite(name='news_cn_admin')
|
except Website.DoesNotExist:
|
||||||
dongfangyancao_admin = DongfangyancaoAdminSite(name='dongfangyancao_admin')
|
messages.error(request, f'网站不存在: {website_name}')
|
||||||
|
except Exception as e:
|
||||||
|
messages.error(request, f'执行爬虫失败: {str(e)}')
|
||||||
|
|
||||||
|
return redirect('admin:core_article_changelist')
|
||||||
|
|
||||||
|
|
||||||
@admin.register(Website)
|
@admin.register(Website)
|
||||||
@@ -39,22 +58,20 @@ class ArticleAdmin(admin.ModelAdmin):
|
|||||||
list_display = ('title', 'website', 'pub_date')
|
list_display = ('title', 'website', 'pub_date')
|
||||||
search_fields = ('title', 'content')
|
search_fields = ('title', 'content')
|
||||||
# 添加动作选项
|
# 添加动作选项
|
||||||
actions = ['delete_selected_articles', 'delete_dongfangyancao_articles', 'export_as_csv', 'export_as_json',
|
actions = ['delete_selected_articles', 'export_as_csv', 'export_as_json',
|
||||||
'export_as_word']
|
'export_as_word', 'export_with_media']
|
||||||
|
|
||||||
def delete_dongfangyancao_articles(self, request, queryset):
|
def get_websites(self):
|
||||||
"""一键删除东方烟草报的所有文章"""
|
"""获取所有启用的网站"""
|
||||||
# 获取东方烟草报网站对象
|
return Website.objects.filter(enabled=True)
|
||||||
try:
|
|
||||||
dongfangyancao_website = Website.objects.get(name='东方烟草报')
|
|
||||||
# 删除所有东方烟草报的文章
|
|
||||||
deleted_count = Article.objects.filter(website=dongfangyancao_website).delete()[0]
|
|
||||||
self.message_user(request, f"成功删除 {deleted_count} 篇东方烟草报文章", messages.SUCCESS)
|
|
||||||
except Website.DoesNotExist:
|
|
||||||
self.message_user(request, "未找到东方烟草报网站配置", messages.ERROR)
|
|
||||||
|
|
||||||
# 设置动作的显示名称
|
# 重写get_urls方法,添加自定义URL
|
||||||
delete_dongfangyancao_articles.short_description = "删除所有东方烟草报文章"
|
def get_urls(self):
|
||||||
|
urls = super().get_urls()
|
||||||
|
custom_urls = [
|
||||||
|
path('run-crawler/', self.admin_site.admin_view(run_crawler_view), name='run_crawler'),
|
||||||
|
]
|
||||||
|
return custom_urls + urls
|
||||||
|
|
||||||
def export_as_csv(self, request, queryset):
|
def export_as_csv(self, request, queryset):
|
||||||
"""导出选中的文章为CSV格式"""
|
"""导出选中的文章为CSV格式"""
|
||||||
@@ -205,6 +222,163 @@ class ArticleAdmin(admin.ModelAdmin):
|
|||||||
|
|
||||||
export_as_word.short_description = "导出选中文章为Word格式"
|
export_as_word.short_description = "导出选中文章为Word格式"
|
||||||
|
|
||||||
|
def export_with_media(self, request, queryset):
|
||||||
|
"""导出选中的文章及媒体文件为ZIP包"""
|
||||||
|
try:
|
||||||
|
from docx import Document
|
||||||
|
from io import BytesIO
|
||||||
|
from docx.shared import Inches
|
||||||
|
import zipfile
|
||||||
|
except ImportError:
|
||||||
|
self.message_user(request, "缺少必要库,请安装: pip install python-docx", messages.ERROR)
|
||||||
|
return
|
||||||
|
|
||||||
|
# 创建内存中的ZIP文件
|
||||||
|
zip_buffer = BytesIO()
|
||||||
|
|
||||||
|
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
|
||||||
|
for article in queryset:
|
||||||
|
# 为每篇文章创建单独的文件夹
|
||||||
|
article_folder = f"article_{article.id}_{article.title.replace('/', '_').replace('\\', '_').replace(':', '_').replace('*', '_').replace('?', '_').replace('"', '_').replace('<', '_').replace('>', '_').replace('|', '_')}"
|
||||||
|
|
||||||
|
# 创建Word文档
|
||||||
|
doc = Document()
|
||||||
|
doc.add_heading(article.title, 0)
|
||||||
|
|
||||||
|
# 添加文章元数据
|
||||||
|
doc.add_paragraph(f"网站: {article.website.name}")
|
||||||
|
doc.add_paragraph(f"URL: {article.url}")
|
||||||
|
doc.add_paragraph(
|
||||||
|
f"发布时间: {article.pub_date.strftime('%Y-%m-%d %H:%M:%S') if article.pub_date else 'N/A'}")
|
||||||
|
doc.add_paragraph(f"创建时间: {article.created_at.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
|
|
||||||
|
# 添加文章内容
|
||||||
|
doc.add_heading('内容', level=2)
|
||||||
|
# 简单处理HTML内容,移除标签并处理图片
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
soup = BeautifulSoup(article.content, 'html.parser')
|
||||||
|
|
||||||
|
# 处理内容中的图片
|
||||||
|
for img in soup.find_all('img'):
|
||||||
|
src = img.get('src', '')
|
||||||
|
if src:
|
||||||
|
# 尝试添加图片到文档
|
||||||
|
try:
|
||||||
|
import os
|
||||||
|
from django.conf import settings
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# 构建完整的图片路径
|
||||||
|
if src.startswith('http'):
|
||||||
|
# 网络图片
|
||||||
|
response = requests.get(src, timeout=10)
|
||||||
|
image_stream = BytesIO(response.content)
|
||||||
|
doc.add_picture(image_stream, width=Inches(4.0))
|
||||||
|
# 将网络文件保存到ZIP
|
||||||
|
zip_file.writestr(os.path.join(article_folder, 'media', os.path.basename(src)),
|
||||||
|
response.content)
|
||||||
|
else:
|
||||||
|
# 本地图片
|
||||||
|
full_path = os.path.join(settings.MEDIA_ROOT, src.lstrip('/'))
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
doc.add_picture(full_path, width=Inches(4.0))
|
||||||
|
# 添加文件到ZIP包
|
||||||
|
zip_file.write(full_path, os.path.join(article_folder, 'media', src.lstrip('/')))
|
||||||
|
except Exception as e:
|
||||||
|
# 如果添加图片失败,添加图片URL作为文本
|
||||||
|
doc.add_paragraph(f"[图片: {src}]")
|
||||||
|
|
||||||
|
# 移除原始img标签
|
||||||
|
img.decompose()
|
||||||
|
|
||||||
|
content_text = soup.get_text()
|
||||||
|
doc.add_paragraph(content_text)
|
||||||
|
|
||||||
|
# 添加媒体文件信息并打包媒体文件
|
||||||
|
if article.media_files:
|
||||||
|
doc.add_heading('媒体文件', level=2)
|
||||||
|
for media_file in article.media_files:
|
||||||
|
try:
|
||||||
|
import os
|
||||||
|
from django.conf import settings
|
||||||
|
|
||||||
|
full_path = os.path.join(settings.MEDIA_ROOT, media_file)
|
||||||
|
# 检查文件扩展名以确定处理方式
|
||||||
|
file_extension = os.path.splitext(media_file)[1].lower()
|
||||||
|
|
||||||
|
# 图片文件处理
|
||||||
|
if file_extension in ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff']:
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
# 添加图片到文档
|
||||||
|
doc.add_picture(full_path, width=Inches(4.0))
|
||||||
|
# 添加文件到ZIP包
|
||||||
|
zip_file.write(full_path, os.path.join(article_folder, 'media', media_file))
|
||||||
|
else:
|
||||||
|
# 如果是URL格式的媒体文件
|
||||||
|
if media_file.startswith('http'):
|
||||||
|
response = requests.get(media_file, timeout=10)
|
||||||
|
image_stream = BytesIO(response.content)
|
||||||
|
doc.add_picture(image_stream, width=Inches(4.0))
|
||||||
|
# 将网络文件保存到ZIP
|
||||||
|
zip_file.writestr(
|
||||||
|
os.path.join(article_folder, 'media', os.path.basename(media_file)),
|
||||||
|
response.content)
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
# 视频文件处理
|
||||||
|
elif file_extension in ['.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm']:
|
||||||
|
# 视频文件只添加到ZIP包中,不在Word文档中显示
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
# 添加文件到ZIP包
|
||||||
|
zip_file.write(full_path, os.path.join(article_folder, 'media', media_file))
|
||||||
|
# 在Word文档中添加视频文件信息
|
||||||
|
doc.add_paragraph(f"[视频文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
# 如果是URL格式的媒体文件
|
||||||
|
if media_file.startswith('http'):
|
||||||
|
# 将网络文件保存到ZIP
|
||||||
|
response = requests.get(media_file, timeout=10)
|
||||||
|
zip_file.writestr(
|
||||||
|
os.path.join(article_folder, 'media', os.path.basename(media_file)),
|
||||||
|
response.content)
|
||||||
|
doc.add_paragraph(f"[视频文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
# 其他文件类型
|
||||||
|
else:
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
# 添加文件到ZIP包
|
||||||
|
zip_file.write(full_path, os.path.join(article_folder, 'media', media_file))
|
||||||
|
doc.add_paragraph(f"[文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
# 如果是URL格式的媒体文件
|
||||||
|
if media_file.startswith('http'):
|
||||||
|
response = requests.get(media_file, timeout=10)
|
||||||
|
zip_file.writestr(
|
||||||
|
os.path.join(article_folder, 'media', os.path.basename(media_file)),
|
||||||
|
response.content)
|
||||||
|
doc.add_paragraph(f"[文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
except Exception as e:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
|
||||||
|
# 保存每篇文章的Word文档到ZIP文件中的对应文件夹
|
||||||
|
doc_buffer = BytesIO()
|
||||||
|
doc.save(doc_buffer)
|
||||||
|
doc_buffer.seek(0)
|
||||||
|
zip_file.writestr(os.path.join(article_folder, f'{article.title.replace("/", "_")}.docx'),
|
||||||
|
doc_buffer.read())
|
||||||
|
|
||||||
|
# 创建HttpResponse
|
||||||
|
zip_buffer.seek(0)
|
||||||
|
from django.http import HttpResponse
|
||||||
|
response = HttpResponse(zip_buffer.getvalue(), content_type='application/zip')
|
||||||
|
response['Content-Disposition'] = 'attachment; filename=articles_export.zip'
|
||||||
|
return response
|
||||||
|
|
||||||
|
export_with_media.short_description = "导出选中文章及媒体文件(ZIP包)"
|
||||||
|
|
||||||
|
|
||||||
# 为不同网站创建专门的文章管理类
|
# 为不同网站创建专门的文章管理类
|
||||||
class NewsCnArticleAdmin(admin.ModelAdmin):
|
class NewsCnArticleAdmin(admin.ModelAdmin):
|
||||||
@@ -340,10 +514,4 @@ class DongfangyancaoArticleAdmin(admin.ModelAdmin):
|
|||||||
|
|
||||||
export_as_json.short_description = "导出选中文章为JSON格式"
|
export_as_json.short_description = "导出选中文章为JSON格式"
|
||||||
|
|
||||||
|
|
||||||
# 在各自的管理站点中注册模型
|
# 在各自的管理站点中注册模型
|
||||||
news_cn_admin.register(Website, WebsiteAdmin)
|
|
||||||
news_cn_admin.register(Article, NewsCnArticleAdmin)
|
|
||||||
|
|
||||||
dongfangyancao_admin.register(Website, WebsiteAdmin)
|
|
||||||
dongfangyancao_admin.register(Article, DongfangyancaoArticleAdmin)
|
|
||||||
|
|||||||
77
core/management/commands/crawl_all_media.py
Normal file
77
core/management/commands/crawl_all_media.py
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from django.core.management import call_command
|
||||||
|
from core.models import Website
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "批量爬取所有中央主流媒体"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--media', type=str, help='指定要爬取的媒体,用逗号分隔')
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
help='指定平台类型: all(全部), web(网站), mobile(移动端)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
media_list = options['media']
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 所有中央主流媒体配置
|
||||||
|
all_media = {
|
||||||
|
'rmrb': 'crawl_rmrb',
|
||||||
|
'xinhua': 'crawl_xinhua',
|
||||||
|
'cctv': 'crawl_cctv',
|
||||||
|
'qiushi': 'crawl_qiushi',
|
||||||
|
'pla': 'crawl_pla',
|
||||||
|
'gmrb': 'crawl_gmrb',
|
||||||
|
'jjrb': 'crawl_jjrb',
|
||||||
|
'chinadaily': 'crawl_chinadaily',
|
||||||
|
'grrb': 'crawl_grrb',
|
||||||
|
'kjrb': 'crawl_kjrb',
|
||||||
|
'rmzxb': 'crawl_rmzxb',
|
||||||
|
'zgjwjc': 'crawl_zgjwjc',
|
||||||
|
'chinanews': 'crawl_chinanews',
|
||||||
|
'xxsb': 'crawl_xxsb',
|
||||||
|
'zgqnb': 'crawl_zgqnb',
|
||||||
|
'zgfnb': 'crawl_zgfnb',
|
||||||
|
'fzrb': 'crawl_fzrb',
|
||||||
|
'nmrb': 'crawl_nmrb',
|
||||||
|
'xuexi': 'crawl_xuexi',
|
||||||
|
'qizhi': 'crawl_qizhi',
|
||||||
|
'china': 'crawl_china'
|
||||||
|
}
|
||||||
|
|
||||||
|
# 如果指定了特定媒体,则只爬取指定的媒体
|
||||||
|
if media_list:
|
||||||
|
target_media = [media.strip() for media in media_list.split(',')]
|
||||||
|
else:
|
||||||
|
target_media = list(all_media.keys())
|
||||||
|
|
||||||
|
self.stdout.write(f"开始批量爬取 {len(target_media)} 家中央主流媒体...")
|
||||||
|
|
||||||
|
for media in target_media:
|
||||||
|
if media in all_media:
|
||||||
|
command_name = all_media[media]
|
||||||
|
try:
|
||||||
|
self.stdout.write(f"正在爬取: {media}")
|
||||||
|
call_command(command_name, platform=platform)
|
||||||
|
self.stdout.write(self.style.SUCCESS(f"完成爬取: {media}"))
|
||||||
|
except Exception as e:
|
||||||
|
self.stdout.write(self.style.ERROR(f"爬取 {media} 失败: {e}"))
|
||||||
|
else:
|
||||||
|
self.stdout.write(self.style.WARNING(f"未知媒体: {media}"))
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("所有中央主流媒体爬取完成"))
|
||||||
|
|
||||||
|
# 显示统计信息
|
||||||
|
total_websites = Website.objects.count()
|
||||||
|
total_articles = sum([website.article_set.count() for website in Website.objects.all()])
|
||||||
|
|
||||||
|
self.stdout.write(f"统计信息:")
|
||||||
|
self.stdout.write(f"- 总网站数: {total_websites}")
|
||||||
|
self.stdout.write(f"- 总文章数: {total_articles}")
|
||||||
|
|
||||||
|
# 显示各媒体文章数量
|
||||||
|
self.stdout.write(f"各媒体文章数量:")
|
||||||
|
for website in Website.objects.all():
|
||||||
|
article_count = website.article_set.count()
|
||||||
|
self.stdout.write(f"- {website.name}: {article_count} 篇")
|
||||||
61
core/management/commands/crawl_cctv.py
Normal file
61
core/management/commands/crawl_cctv.py
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
# jimmy.fang:20250815: 因 CCTV 的视频有做加密动作,无法下载,移除支持
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中央广播电视总台及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['cctv', 'cctvnews', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: cctv(央视网), cctvnews(央视新闻), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中央广播电视总台各平台配置
|
||||||
|
platforms = {
|
||||||
|
# jimmy.fang:20250815: 因 CCTV 的视频有做加密动作,无法下载,移除支持
|
||||||
|
# 'cctv': {
|
||||||
|
# 'name': '央视网',
|
||||||
|
# 'base_url': 'https://www.cctv.com',
|
||||||
|
# 'start_url': 'https://www.cctv.com',
|
||||||
|
# 'article_selector': 'a'
|
||||||
|
# },
|
||||||
|
'cctvnews': {
|
||||||
|
'name': '央视新闻',
|
||||||
|
'base_url': 'https://news.cctv.com',
|
||||||
|
'start_url': 'https://news.cctv.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中央广播电视总台所有平台爬取完成"))
|
||||||
60
core/management/commands/crawl_china.py
Normal file
60
core/management/commands/crawl_china.py
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
# jimmy.fang-20250815: 因URL问题,移除中国网-省份
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国网主网及中国网一省份,不转发二级子网站"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['china', 'province', 'all'],
|
||||||
|
help='选择爬取平台: china(中国网主网), province(中国网一省份), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国网各平台配置
|
||||||
|
platforms = {
|
||||||
|
'china': {
|
||||||
|
'name': '中国网',
|
||||||
|
'base_url': 'http://www.china.com.cn',
|
||||||
|
'start_url': 'http://www.china.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
# 'province': {
|
||||||
|
# 'name': '中国网一省份',
|
||||||
|
# 'base_url': 'http://www.china.com.cn',
|
||||||
|
# 'start_url': 'http://www.china.com.cn/province',
|
||||||
|
# 'article_selector': 'a'
|
||||||
|
# }
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国网所有平台爬取完成"))
|
||||||
54
core/management/commands/crawl_chinadaily.py
Normal file
54
core/management/commands/crawl_chinadaily.py
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['chinadaily', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: chinadaily(中国日报), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'chinadaily': {
|
||||||
|
'name': '中国日报',
|
||||||
|
'base_url': 'https://www.chinadaily.com.cn',
|
||||||
|
'start_url': 'https://www.chinadaily.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国日报所有平台爬取完成"))
|
||||||
53
core/management/commands/crawl_chinanews.py
Normal file
53
core/management/commands/crawl_chinanews.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国新闻社及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['chinanews', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: chinanews(中国新闻社), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国新闻社各平台配置
|
||||||
|
platforms = {
|
||||||
|
'chinanews': {
|
||||||
|
'name': '中国新闻社',
|
||||||
|
'base_url': 'https://www.chinanews.com.cn',
|
||||||
|
'start_url': 'https://www.chinanews.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国新闻社所有平台爬取完成"))
|
||||||
@@ -4,17 +4,50 @@ from core.utils import full_site_crawler
|
|||||||
|
|
||||||
|
|
||||||
class Command(BaseCommand):
|
class Command(BaseCommand):
|
||||||
help = "全站递归爬取 www.gov.cn"
|
help = "全站递归爬取 中国政府网及其子网站"
|
||||||
|
|
||||||
def handle(self, *args, **kwargs):
|
def add_arguments(self, parser):
|
||||||
website, created = Website.objects.get_or_create(
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
name="www.gov.cn",
|
choices=['govcn', 'all'],
|
||||||
defaults={
|
help='选择爬取平台: govcn(中国政府网), all(全部)')
|
||||||
'article_list_url': 'https://www.gov.cn/',
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国政府网各平台配置
|
||||||
|
platforms = {
|
||||||
|
'govcn': {
|
||||||
|
'name': '中国政府网',
|
||||||
|
'base_url': 'https://www.gov.cn/',
|
||||||
|
'start_url': 'https://www.gov.cn/',
|
||||||
'article_selector': 'a'
|
'article_selector': 'a'
|
||||||
}
|
},
|
||||||
)
|
}
|
||||||
start_url = "https://www.gov.cn/"
|
|
||||||
self.stdout.write(f"开始全站爬取: {start_url}")
|
if platform == 'all':
|
||||||
full_site_crawler(start_url, website, max_pages=500)
|
target_platforms = platforms.values()
|
||||||
self.stdout.write("爬取完成")
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国政府网所有平台爬取完成"))
|
||||||
@@ -6,15 +6,48 @@ from core.utils import full_site_crawler
|
|||||||
class Command(BaseCommand):
|
class Command(BaseCommand):
|
||||||
help = "全站递归爬取 东方烟草报"
|
help = "全站递归爬取 东方烟草报"
|
||||||
|
|
||||||
def handle(self, *args, **kwargs):
|
def add_arguments(self, parser):
|
||||||
website, created = Website.objects.get_or_create(
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
name="东方烟草报",
|
choices=['eastobacco', 'all'],
|
||||||
defaults={
|
help='选择爬取平台: eastobacco(东方烟草报), all(全部)')
|
||||||
'article_list_url': 'https://www.eastobacco.com/',
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 东方烟草报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'eastobacco': {
|
||||||
|
'name': '东方烟草报',
|
||||||
|
'base_url': 'https://www.eastobacco.com/',
|
||||||
|
'start_url': 'https://www.eastobacco.com/',
|
||||||
'article_selector': 'a'
|
'article_selector': 'a'
|
||||||
}
|
},
|
||||||
)
|
}
|
||||||
start_url = "https://www.eastobacco.com/"
|
|
||||||
self.stdout.write(f"开始全站爬取: {start_url}")
|
if platform == 'all':
|
||||||
full_site_crawler(start_url, website, max_pages=500)
|
target_platforms = platforms.values()
|
||||||
self.stdout.write("爬取完成")
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("东方烟草报所有平台爬取完成"))
|
||||||
53
core/management/commands/crawl_fzrb.py
Normal file
53
core/management/commands/crawl_fzrb.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 法治日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['fzrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: fzrb(法治日报), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 法治日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'fzrb': {
|
||||||
|
'name': '法治日报',
|
||||||
|
'base_url': 'http://www.legaldaily.com.cn',
|
||||||
|
'start_url': 'http://www.legaldaily.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("法治日报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_gmrb.py
Normal file
59
core/management/commands/crawl_gmrb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 光明日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['gmrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: gmrb(光明日报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 光明日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'gmrb': {
|
||||||
|
'name': '光明日报',
|
||||||
|
'base_url': 'https://www.gmw.cn',
|
||||||
|
'start_url': 'https://www.gmw.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '光明日报移动端',
|
||||||
|
'base_url': 'https://m.gmw.cn',
|
||||||
|
'start_url': 'https://m.gmw.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("光明日报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_grrb.py
Normal file
59
core/management/commands/crawl_grrb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 工人日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['grrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: grrb(工人日报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 工人日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'grrb': {
|
||||||
|
'name': '工人日报',
|
||||||
|
'base_url': 'http://www.workercn.cn',
|
||||||
|
'start_url': 'http://www.workercn.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '工人日报移动端',
|
||||||
|
'base_url': 'http://m.workercn.cn', # 修复:确保移动端URL正确
|
||||||
|
'start_url': 'http://m.workercn.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("工人日报所有平台爬取完成"))
|
||||||
53
core/management/commands/crawl_jjrb.py
Normal file
53
core/management/commands/crawl_jjrb.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 经济日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['jjrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: jjrb(经济日报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 经济日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'jjrb': {
|
||||||
|
'name': '经济日报',
|
||||||
|
'base_url': 'http://www.ce.cn',
|
||||||
|
'start_url': 'http://www.ce.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("经济日报所有平台爬取完成"))
|
||||||
60
core/management/commands/crawl_kjrb.py
Normal file
60
core/management/commands/crawl_kjrb.py
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
### 不支援
|
||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 科技日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['kjrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: kjrb(科技日报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 科技日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'kjrb': {
|
||||||
|
'name': '科技日报',
|
||||||
|
'base_url': 'http://digitalpaper.stdaily.com',
|
||||||
|
'start_url': 'http://digitalpaper.stdaily.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '科技日报移动端',
|
||||||
|
'base_url': 'http://m.stdaily.com',
|
||||||
|
'start_url': 'http://m.stdaily.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("科技日报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_nmrb.py
Normal file
59
core/management/commands/crawl_nmrb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 农民日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['nmrb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: nmrb(农民日报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 农民日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'nmrb': {
|
||||||
|
'name': '农民日报',
|
||||||
|
'base_url': 'http://www.farmer.com.cn',
|
||||||
|
'start_url': 'http://www.farmer.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '农民日报移动端',
|
||||||
|
'base_url': 'http://m.farmer.com.cn',
|
||||||
|
'start_url': 'http://m.farmer.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("农民日报所有平台爬取完成"))
|
||||||
53
core/management/commands/crawl_pla.py
Normal file
53
core/management/commands/crawl_pla.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 解放军报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['pla', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: pla(解放军报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 解放军报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'pla': {
|
||||||
|
'name': '解放军报',
|
||||||
|
'base_url': 'https://www.81.cn',
|
||||||
|
'start_url': 'https://www.81.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("解放军报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_qiushi.py
Normal file
59
core/management/commands/crawl_qiushi.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 求是杂志及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['qiushi', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: qiushi(求是网), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 求是杂志各平台配置
|
||||||
|
platforms = {
|
||||||
|
'qiushi': {
|
||||||
|
'name': '求是网',
|
||||||
|
'base_url': 'https://www.qstheory.cn',
|
||||||
|
'start_url': 'https://www.qstheory.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '求是移动端',
|
||||||
|
'base_url': 'http://m.qstheory.cn',
|
||||||
|
'start_url': 'http://m.qstheory.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("求是杂志所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_qizhi.py
Normal file
59
core/management/commands/crawl_qizhi.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 旗帜网及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['qizhi', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: qizhi(旗帜网), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 旗帜网各平台配置
|
||||||
|
platforms = {
|
||||||
|
'qizhi': {
|
||||||
|
'name': '旗帜网',
|
||||||
|
'base_url': 'http://www.qizhiwang.org.cn',
|
||||||
|
'start_url': 'http://www.qizhiwang.org.cn',
|
||||||
|
'article_selector': 'a[href^="/"]' # 修改选择器以更好地匹配文章链接
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '旗帜网移动端',
|
||||||
|
'base_url': 'http://m.qizhiwang.org.cn',
|
||||||
|
'start_url': 'http://m.qizhiwang.org.cn',
|
||||||
|
'article_selector': 'a[href^="/"]' # 修改选择器以更好地匹配文章链接
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("旗帜网所有平台爬取完成"))
|
||||||
65
core/management/commands/crawl_rmrb.py
Normal file
65
core/management/commands/crawl_rmrb.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 人民日报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['peopleapp', 'people', 'paper', 'all'],
|
||||||
|
help='选择爬取平台: peopleapp(客户端), people(人民网), paper(报纸), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 人民日报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'peopleapp': {
|
||||||
|
'name': '人民日报客户端',
|
||||||
|
'base_url': 'https://www.peopleapp.com',
|
||||||
|
'start_url': 'https://www.peopleapp.com/home',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'people': {
|
||||||
|
'name': '人民网',
|
||||||
|
'base_url': 'https://www.people.com.cn',
|
||||||
|
'start_url': 'https://www.people.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'paper': {
|
||||||
|
'name': '人民日报报纸',
|
||||||
|
'base_url': 'http://paper.people.com.cn',
|
||||||
|
'start_url': 'http://paper.people.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("人民日报所有平台爬取完成"))
|
||||||
53
core/management/commands/crawl_rmzxb.py
Normal file
53
core/management/commands/crawl_rmzxb.py
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 人民政协网及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['rmzxb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: rmzxb(人民政协网), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 人民政协网各平台配置
|
||||||
|
platforms = {
|
||||||
|
'rmzxb': {
|
||||||
|
'name': '人民政协网',
|
||||||
|
'base_url': 'https://www.rmzxw.com.cn',
|
||||||
|
'start_url': 'https://www.rmzxw.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("人民政协网所有平台爬取完成"))
|
||||||
@@ -4,17 +4,62 @@ from core.utils import full_site_crawler
|
|||||||
|
|
||||||
|
|
||||||
class Command(BaseCommand):
|
class Command(BaseCommand):
|
||||||
help = "全站递归爬取 www.news.cn"
|
help = "全站递归爬取 新华社及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
def handle(self, *args, **kwargs):
|
def add_arguments(self, parser):
|
||||||
website, created = Website.objects.get_or_create(
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
name="www.news.cn",
|
choices=['news', 'xinhuanet', 'mobile', 'all'],
|
||||||
defaults={
|
help='选择爬取平台: news(新华网), xinhuanet(新华网主站), mobile(移动端), all(全部)')
|
||||||
'article_list_url': 'https://www.news.cn/',
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 新华社各平台配置
|
||||||
|
platforms = {
|
||||||
|
'news': {
|
||||||
|
'name': '新华网',
|
||||||
|
'base_url': 'https://www.news.cn',
|
||||||
|
'start_url': 'https://www.news.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'xinhuanet': {
|
||||||
|
'name': '新华网主站',
|
||||||
|
'base_url': 'https://www.xinhuanet.com',
|
||||||
|
'start_url': 'https://www.xinhuanet.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '新华社移动端',
|
||||||
|
'base_url': 'https://m.xinhuanet.com',
|
||||||
|
'start_url': 'https://m.xinhuanet.com',
|
||||||
'article_selector': 'a'
|
'article_selector': 'a'
|
||||||
}
|
}
|
||||||
)
|
}
|
||||||
start_url = "https://www.news.cn/"
|
|
||||||
self.stdout.write(f"开始全站爬取: {start_url}")
|
if platform == 'all':
|
||||||
full_site_crawler(start_url, website, max_pages=500)
|
target_platforms = platforms.values()
|
||||||
self.stdout.write("爬取完成")
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("新华社所有平台爬取完成"))
|
||||||
|
|||||||
@@ -1,21 +0,0 @@
|
|||||||
from django.core.management.base import BaseCommand
|
|
||||||
from core.models import Website
|
|
||||||
from core.utils import crawl_xinhua_list
|
|
||||||
|
|
||||||
class Command(BaseCommand):
|
|
||||||
help = '批量爬取新华网文章'
|
|
||||||
|
|
||||||
def handle(self, *args, **options):
|
|
||||||
# 添加使用标记,确认该命令是否被调用
|
|
||||||
self.stdout.write(self.style.WARNING("crawl_xinhua command is being used"))
|
|
||||||
|
|
||||||
list_url = "https://www.news.cn/legal/index.html"
|
|
||||||
try:
|
|
||||||
website = Website.objects.get(base_url="https://www.news.cn/")
|
|
||||||
except Website.DoesNotExist:
|
|
||||||
self.stdout.write(self.style.ERROR("网站 https://www.news.cn/ 不存在,请先后台添加"))
|
|
||||||
return
|
|
||||||
|
|
||||||
self.stdout.write(f"开始爬取文章列表页: {list_url}")
|
|
||||||
crawl_xinhua_list(list_url, website)
|
|
||||||
self.stdout.write(self.style.SUCCESS("批量爬取完成"))
|
|
||||||
65
core/management/commands/crawl_xuexi.py
Normal file
65
core/management/commands/crawl_xuexi.py
Normal file
@@ -0,0 +1,65 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 学习强国中央媒体学习号及省级以上学习平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['xuexi', 'central', 'provincial', 'all'],
|
||||||
|
help='选择爬取平台: xuexi(学习强国主站), central(中央媒体), provincial(省级平台), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 学习强国各平台配置
|
||||||
|
platforms = {
|
||||||
|
'xuexi': {
|
||||||
|
'name': '学习强国',
|
||||||
|
'base_url': 'https://www.xuexi.cn',
|
||||||
|
'start_url': 'https://www.xuexi.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'central': {
|
||||||
|
'name': '学习强国中央媒体',
|
||||||
|
'base_url': 'https://www.xuexi.cn',
|
||||||
|
'start_url': 'https://www.xuexi.cn/central',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'provincial': {
|
||||||
|
'name': '学习强国省级平台',
|
||||||
|
'base_url': 'https://www.xuexi.cn',
|
||||||
|
'start_url': 'https://www.xuexi.cn/provincial',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("学习强国所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_xxsb.py
Normal file
59
core/management/commands/crawl_xxsb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 学习时报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['xxsb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: xxsb(学习时报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 学习时报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'xxsb': {
|
||||||
|
'name': '学习时报',
|
||||||
|
'base_url': 'http://www.studytimes.cn',
|
||||||
|
'start_url': 'http://www.studytimes.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '学习时报移动端',
|
||||||
|
'base_url': 'http://m.studytimes.cn',
|
||||||
|
'start_url': 'http://m.studytimes.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("学习时报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_zgfnb.py
Normal file
59
core/management/commands/crawl_zgfnb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国妇女报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['zgfnb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: zgfnb(中国妇女报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国妇女报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'zgfnb': {
|
||||||
|
'name': '中国妇女报',
|
||||||
|
'base_url': 'http://www.cnwomen.com.cn',
|
||||||
|
'start_url': 'http://www.cnwomen.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '中国妇女报移动端',
|
||||||
|
'base_url': 'http://m.cnwomen.com.cn',
|
||||||
|
'start_url': 'http://m.cnwomen.com.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国妇女报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_zgjwjc.py
Normal file
59
core/management/commands/crawl_zgjwjc.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国纪检监察报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['zgjwjc', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: zgjwjc(中国纪检监察报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国纪检监察报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'zgjwjc': {
|
||||||
|
'name': '中国纪检监察报',
|
||||||
|
'base_url': 'http://www.jjjcb.cn',
|
||||||
|
'start_url': 'http://www.jjjcb.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '中国纪检监察报移动端',
|
||||||
|
'base_url': 'http://m.jjjcb.cn',
|
||||||
|
'start_url': 'http://m.jjjcb.cn',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国纪检监察报所有平台爬取完成"))
|
||||||
59
core/management/commands/crawl_zgqnb.py
Normal file
59
core/management/commands/crawl_zgqnb.py
Normal file
@@ -0,0 +1,59 @@
|
|||||||
|
from django.core.management.base import BaseCommand
|
||||||
|
from core.models import Website
|
||||||
|
from core.utils import full_site_crawler
|
||||||
|
|
||||||
|
|
||||||
|
class Command(BaseCommand):
|
||||||
|
help = "全站递归爬取 中国青年报及其子网站、客户端、新媒体平台"
|
||||||
|
|
||||||
|
def add_arguments(self, parser):
|
||||||
|
parser.add_argument('--platform', type=str, default='all',
|
||||||
|
choices=['zgqnb', 'mobile', 'all'],
|
||||||
|
help='选择爬取平台: zgqnb(中国青年报), mobile(移动端), all(全部)')
|
||||||
|
|
||||||
|
def handle(self, *args, **options):
|
||||||
|
platform = options['platform']
|
||||||
|
|
||||||
|
# 中国青年报各平台配置
|
||||||
|
platforms = {
|
||||||
|
'zgqnb': {
|
||||||
|
'name': '中国青年报',
|
||||||
|
'base_url': 'https://www.cyol.com',
|
||||||
|
'start_url': 'https://www.cyol.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
},
|
||||||
|
'mobile': {
|
||||||
|
'name': '中国青年报移动端',
|
||||||
|
'base_url': 'https://m.cyol.com',
|
||||||
|
'start_url': 'https://m.cyol.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if platform == 'all':
|
||||||
|
target_platforms = platforms.values()
|
||||||
|
else:
|
||||||
|
target_platforms = [platforms[platform]]
|
||||||
|
|
||||||
|
for platform_config in target_platforms:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=platform_config['name'],
|
||||||
|
defaults={
|
||||||
|
'base_url': platform_config['base_url'],
|
||||||
|
'article_list_url': platform_config['start_url'],
|
||||||
|
'article_selector': platform_config['article_selector']
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# 确保更新已存在的网站对象的配置
|
||||||
|
if not created:
|
||||||
|
website.base_url = platform_config['base_url']
|
||||||
|
website.article_list_url = platform_config['start_url']
|
||||||
|
website.article_selector = platform_config['article_selector']
|
||||||
|
website.save()
|
||||||
|
|
||||||
|
self.stdout.write(f"开始爬取: {platform_config['name']} - {platform_config['start_url']}")
|
||||||
|
full_site_crawler(platform_config['start_url'], website, max_pages=500)
|
||||||
|
self.stdout.write(f"完成爬取: {platform_config['name']}")
|
||||||
|
|
||||||
|
self.stdout.write(self.style.SUCCESS("中国青年报所有平台爬取完成"))
|
||||||
@@ -4,7 +4,6 @@ import json
|
|||||||
import csv
|
import csv
|
||||||
import os
|
import os
|
||||||
from django.conf import settings
|
from django.conf import settings
|
||||||
from django.core.files.storage import default_storage
|
|
||||||
import zipfile
|
import zipfile
|
||||||
from django.utils import timezone
|
from django.utils import timezone
|
||||||
|
|
||||||
@@ -13,16 +12,20 @@ class Command(BaseCommand):
|
|||||||
help = '导出文章及相关的媒体文件(图片、视频等)'
|
help = '导出文章及相关的媒体文件(图片、视频等)'
|
||||||
|
|
||||||
def add_arguments(self, parser):
|
def add_arguments(self, parser):
|
||||||
parser.add_argument('--format', type=str, default='json', help='导出格式: json 或 csv')
|
parser.add_argument('--format', type=str, default='docx', help='导出格式: json、csv 或 docx')
|
||||||
parser.add_argument('--website', type=str, help='指定网站名称导出特定网站的文章')
|
parser.add_argument('--website', type=str, help='指定网站名称导出特定网站的文章')
|
||||||
parser.add_argument('--output', type=str, default='', help='输出文件路径')
|
parser.add_argument('--output', type=str, default='', help='输出文件路径')
|
||||||
parser.add_argument('--include-media', action='store_true', help='包含媒体文件')
|
# 修改默认值为True,使包含媒体文件成为默认行为
|
||||||
|
parser.add_argument('--include-media', action='store_true', default=True, help='包含媒体文件')
|
||||||
|
# 添加参数控制是否打包成zip
|
||||||
|
parser.add_argument('--no-zip', action='store_true', help='不打包成zip文件')
|
||||||
|
|
||||||
def handle(self, *args, **options):
|
def handle(self, *args, **options):
|
||||||
format_type = options['format'].lower()
|
format_type = options['format'].lower()
|
||||||
website_name = options['website']
|
website_name = options['website']
|
||||||
output_path = options['output']
|
output_path = options['output']
|
||||||
include_media = options['include_media']
|
include_media = options['include_media']
|
||||||
|
no_zip = options['no_zip']
|
||||||
|
|
||||||
# 获取文章查询集
|
# 获取文章查询集
|
||||||
articles = Article.objects.all()
|
articles = Article.objects.all()
|
||||||
@@ -65,20 +68,26 @@ class Command(BaseCommand):
|
|||||||
# 确定输出路径
|
# 确定输出路径
|
||||||
if not output_path:
|
if not output_path:
|
||||||
timestamp = timezone.now().strftime('%Y%m%d_%H%M%S')
|
timestamp = timezone.now().strftime('%Y%m%d_%H%M%S')
|
||||||
if include_media:
|
# 默认导出为zip格式
|
||||||
output_path = f'articles_export_{timestamp}.zip'
|
output_path = f'articles_export_{timestamp}.zip'
|
||||||
else:
|
|
||||||
output_path = f'articles_export_{timestamp}.{format_type}'
|
|
||||||
|
|
||||||
# 执行导出
|
# 执行导出
|
||||||
if include_media:
|
# 如果需要包含媒体文件或格式为docx,则默认打包成zip
|
||||||
self.export_with_media(articles_data, media_files, output_path, format_type)
|
if include_media or format_type == 'docx':
|
||||||
|
if no_zip:
|
||||||
|
if format_type == 'docx':
|
||||||
|
self.export_as_word(articles_data, output_path)
|
||||||
|
elif format_type == 'json':
|
||||||
|
self.export_as_json(articles_data, output_path)
|
||||||
|
elif format_type == 'csv':
|
||||||
|
self.export_as_csv(articles_data, output_path)
|
||||||
|
else:
|
||||||
|
self.export_with_media(articles_data, media_files, output_path, format_type)
|
||||||
else:
|
else:
|
||||||
if format_type == 'json':
|
if format_type == 'json':
|
||||||
self.export_as_json(articles_data, output_path)
|
self.export_as_json(articles_data, output_path)
|
||||||
elif format_type == 'csv':
|
elif format_type == 'csv':
|
||||||
self.export_as_csv(articles_data, output_path)
|
self.export_as_csv(articles_data, output_path)
|
||||||
# 添加Word格式导出支持
|
|
||||||
elif format_type == 'docx':
|
elif format_type == 'docx':
|
||||||
self.export_as_word(articles_data, output_path)
|
self.export_as_word(articles_data, output_path)
|
||||||
else:
|
else:
|
||||||
@@ -220,7 +229,6 @@ class Command(BaseCommand):
|
|||||||
'media_files'] else ''
|
'media_files'] else ''
|
||||||
writer.writerow(article_data)
|
writer.writerow(article_data)
|
||||||
zipf.writestr(data_filename, csv_buffer.getvalue())
|
zipf.writestr(data_filename, csv_buffer.getvalue())
|
||||||
# 添加Word格式支持
|
|
||||||
elif format_type == 'docx':
|
elif format_type == 'docx':
|
||||||
# 创建Word文档并保存到ZIP
|
# 创建Word文档并保存到ZIP
|
||||||
try:
|
try:
|
||||||
|
|||||||
19
core/templates/admin/core/article/change_list.html
Normal file
19
core/templates/admin/core/article/change_list.html
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
{% extends "admin/change_list.html" %}
|
||||||
|
{% load admin_urls %}
|
||||||
|
|
||||||
|
{% block object-tools %}
|
||||||
|
{{ block.super }}
|
||||||
|
<div style="margin-top: 10px;">
|
||||||
|
<form method="post" action="{% url 'admin:run_crawler' %}" style="display: inline-block;">
|
||||||
|
{% csrf_token %}
|
||||||
|
<label for="website-select">选择网站:</label>
|
||||||
|
<select name="website_name" id="website-select" required>
|
||||||
|
<option value="">-- 请选择网站 --</option>
|
||||||
|
{% for website in cl.model_admin.get_websites %}
|
||||||
|
<option value="{{ website.name }}">{{ website.name }}</option>
|
||||||
|
{% endfor %}
|
||||||
|
</select>
|
||||||
|
<input type="submit" value="执行爬虫" class="default" style="margin-left: 10px;"/>
|
||||||
|
</form>
|
||||||
|
</div>
|
||||||
|
{% endblock %}
|
||||||
@@ -2,24 +2,25 @@
|
|||||||
<html lang="zh">
|
<html lang="zh">
|
||||||
<head>
|
<head>
|
||||||
<meta charset="UTF-8"/>
|
<meta charset="UTF-8"/>
|
||||||
<title>{{ article.title }}</title>
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
|
||||||
|
<title>{{ article.title }} - 绿色课堂</title>
|
||||||
<style>
|
<style>
|
||||||
body {
|
body {
|
||||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||||
line-height: 1.6;
|
line-height: 1.6;
|
||||||
color: #333;
|
color: #333;
|
||||||
max-width: 1200px; /* 修改:同步调整页面最大宽度与列表页一致 */
|
|
||||||
margin: 0 auto;
|
margin: 0 auto;
|
||||||
padding: 20px;
|
padding: 20px;
|
||||||
background-color: #f8f9fa;
|
background-color: #f0f8ff;
|
||||||
|
max-width: 800px;
|
||||||
}
|
}
|
||||||
|
|
||||||
.article-container {
|
.container {
|
||||||
background: white;
|
background: white;
|
||||||
border-radius: 8px;
|
|
||||||
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.1);
|
|
||||||
padding: 30px;
|
padding: 30px;
|
||||||
margin-bottom: 20px;
|
margin-bottom: 20px;
|
||||||
|
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.05);
|
||||||
|
border-radius: 8px;
|
||||||
}
|
}
|
||||||
|
|
||||||
h1 {
|
h1 {
|
||||||
@@ -30,56 +31,59 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
.meta {
|
.meta {
|
||||||
color: #7f8c8d;
|
color: #78909c;
|
||||||
font-size: 0.9em;
|
font-size: 0.9em;
|
||||||
margin-bottom: 20px;
|
margin-bottom: 20px;
|
||||||
}
|
}
|
||||||
|
|
||||||
hr {
|
|
||||||
border: 0;
|
|
||||||
height: 1px;
|
|
||||||
background: #ecf0f1;
|
|
||||||
margin: 20px 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.content {
|
.content {
|
||||||
font-size: 16px;
|
margin-top: 20px;
|
||||||
}
|
}
|
||||||
|
|
||||||
.content img {
|
.content img {
|
||||||
max-width: 100%;
|
max-width: 100%;
|
||||||
height: auto;
|
height: auto;
|
||||||
border-radius: 4px;
|
|
||||||
margin: 10px 0;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
.back-link {
|
.back-link {
|
||||||
display: inline-block;
|
display: inline-block;
|
||||||
padding: 10px 20px;
|
margin-bottom: 20px;
|
||||||
background-color: #3498db;
|
color: #1976d2;
|
||||||
color: white;
|
|
||||||
text-decoration: none;
|
text-decoration: none;
|
||||||
border-radius: 4px;
|
|
||||||
transition: background-color 0.3s;
|
|
||||||
}
|
}
|
||||||
|
|
||||||
.back-link:hover {
|
.back-link:hover {
|
||||||
background-color: #2980b9;
|
color: #0d47a1;
|
||||||
|
text-decoration: underline;
|
||||||
|
}
|
||||||
|
|
||||||
|
@media (max-width: 600px) {
|
||||||
|
body {
|
||||||
|
padding: 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.container {
|
||||||
|
padding: 15px;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
</style>
|
</style>
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
<div class="article-container">
|
<div class="container">
|
||||||
<h1>{{ article.title }}</h1>
|
<a href="{% url 'article_list' %}" class="back-link">« 返回文章列表</a>
|
||||||
<div class="meta">
|
|
||||||
<p>发布时间: {{ article.pub_date|date:"Y-m-d H:i" }}</p>
|
<h1>{{ article.title }}</h1>
|
||||||
|
|
||||||
|
<div class="meta">
|
||||||
|
网站: {{ article.website.name }} |
|
||||||
|
发布时间: {{ article.pub_date|date:"Y-m-d H:i" }} |
|
||||||
|
创建时间: {{ article.created_at|date:"Y-m-d H:i" }} |
|
||||||
|
源网址: <a href="{{ article.url }}" target="_blank">{{ article.url }}</a>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div class="content">
|
||||||
|
{{ article.content|safe }}
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
<hr/>
|
|
||||||
<div class="content">
|
|
||||||
{{ article.content|safe }}
|
|
||||||
</div>
|
|
||||||
<hr/>
|
|
||||||
<p><a href="{% url 'article_list' %}" class="back-link">← 返回列表</a></p>
|
|
||||||
</div>
|
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
@@ -8,18 +8,17 @@
|
|||||||
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif;
|
||||||
line-height: 1.6;
|
line-height: 1.6;
|
||||||
color: #333;
|
color: #333;
|
||||||
max-width: 1200px; /* 修改:增加页面最大宽度 */
|
|
||||||
margin: 0 auto;
|
margin: 0 auto;
|
||||||
padding: 20px;
|
padding: 20px;
|
||||||
background-color: #f8f9fa;
|
background-color: #f0f8ff; /* 统一背景色调 */
|
||||||
}
|
}
|
||||||
|
|
||||||
.container {
|
.container {
|
||||||
background: white;
|
background: white;
|
||||||
border-radius: 8px;
|
|
||||||
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.1);
|
|
||||||
padding: 30px;
|
padding: 30px;
|
||||||
margin-bottom: 20px;
|
margin-bottom: 20px;
|
||||||
|
box-shadow: 0 2px 5px rgba(0,0,0,0.05); /* 添加轻微阴影 */
|
||||||
|
border-radius: 8px; /* 添加圆角 */
|
||||||
}
|
}
|
||||||
|
|
||||||
h1 {
|
h1 {
|
||||||
@@ -32,7 +31,7 @@
|
|||||||
.filters {
|
.filters {
|
||||||
margin-bottom: 20px;
|
margin-bottom: 20px;
|
||||||
padding: 15px;
|
padding: 15px;
|
||||||
background-color: #f1f8ff;
|
background-color: #e3f2fd; /* 统一滤镜背景色调 */
|
||||||
border-radius: 5px;
|
border-radius: 5px;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -40,8 +39,8 @@
|
|||||||
display: inline-block;
|
display: inline-block;
|
||||||
padding: 5px 10px;
|
padding: 5px 10px;
|
||||||
margin: 0 5px 5px 0;
|
margin: 0 5px 5px 0;
|
||||||
background-color: #e1e8ed;
|
background-color: #bbdefb; /* 统一链接背景色调 */
|
||||||
color: #333;
|
color: #0d47a1;
|
||||||
text-decoration: none;
|
text-decoration: none;
|
||||||
border-radius: 3px;
|
border-radius: 3px;
|
||||||
}
|
}
|
||||||
@@ -58,7 +57,7 @@
|
|||||||
|
|
||||||
li {
|
li {
|
||||||
padding: 10px 0;
|
padding: 10px 0;
|
||||||
border-bottom: 1px solid #ecf0f1;
|
border-bottom: 1px solid #e0e0e0; /* 统一分隔线颜色 */
|
||||||
}
|
}
|
||||||
|
|
||||||
li:last-child {
|
li:last-child {
|
||||||
@@ -66,17 +65,17 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
a {
|
a {
|
||||||
color: #3498db;
|
color: #1976d2; /* 统一链接颜色 */
|
||||||
text-decoration: none;
|
text-decoration: none;
|
||||||
}
|
}
|
||||||
|
|
||||||
a:hover {
|
a:hover {
|
||||||
color: #2980b9;
|
color: #0d47a1; /* 统一悬停颜色 */
|
||||||
text-decoration: underline;
|
text-decoration: underline;
|
||||||
}
|
}
|
||||||
|
|
||||||
.meta {
|
.meta {
|
||||||
color: #7f8c8d;
|
color: #78909c; /* 统一元数据颜色 */
|
||||||
font-size: 0.9em;
|
font-size: 0.9em;
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -122,16 +121,17 @@
|
|||||||
.search-form {
|
.search-form {
|
||||||
margin-bottom: 20px;
|
margin-bottom: 20px;
|
||||||
padding: 15px;
|
padding: 15px;
|
||||||
background-color: #f1f8ff;
|
background-color: #e3f2fd; /* 统一搜索框背景色调 */
|
||||||
border-radius: 5px;
|
border-radius: 5px;
|
||||||
}
|
}
|
||||||
|
|
||||||
.search-form input[type="text"] {
|
.search-form input[type="text"] {
|
||||||
padding: 8px 12px;
|
padding: 8px 12px;
|
||||||
border: 1px solid #ddd;
|
border: 1px solid #bbdefb; /* 统一边框颜色 */
|
||||||
border-radius: 4px;
|
border-radius: 4px;
|
||||||
width: 300px;
|
width: 300px;
|
||||||
margin-right: 10px;
|
margin-right: 10px;
|
||||||
|
background-color: #fff;
|
||||||
}
|
}
|
||||||
|
|
||||||
.search-form input[type="submit"] {
|
.search-form input[type="submit"] {
|
||||||
@@ -148,21 +148,93 @@
|
|||||||
}
|
}
|
||||||
|
|
||||||
.search-info {
|
.search-info {
|
||||||
color: #7f8c8d;
|
color: #78909c; /* 统一搜索信息颜色 */
|
||||||
font-size: 0.9em;
|
font-size: 0.9em;
|
||||||
margin-bottom: 10px;
|
margin-bottom: 10px;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* 新增:左侧筛选栏样式 */
|
||||||
|
.content-wrapper {
|
||||||
|
display: flex;
|
||||||
|
gap: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar {
|
||||||
|
flex: 0 0 200px;
|
||||||
|
background-color: #e3f2fd; /* 统一边栏背景色调 */
|
||||||
|
border-radius: 5px;
|
||||||
|
padding: 15px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.main-content {
|
||||||
|
flex: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar .filters {
|
||||||
|
margin-bottom: 20px;
|
||||||
|
padding: 0;
|
||||||
|
background-color: transparent;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar .filters strong {
|
||||||
|
display: block;
|
||||||
|
margin-bottom: 10px;
|
||||||
|
color: #2c3e50;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar .filters a {
|
||||||
|
display: block;
|
||||||
|
padding: 8px 10px;
|
||||||
|
margin: 0 0 5px 0;
|
||||||
|
background-color: #bbdefb; /* 统一边栏链接背景色调 */
|
||||||
|
color: #0d47a1;
|
||||||
|
text-decoration: none;
|
||||||
|
border-radius: 3px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.sidebar .filters a.active {
|
||||||
|
background-color: #3498db;
|
||||||
|
color: white;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* 新增:导出功能样式 */
|
||||||
|
.export-section {
|
||||||
|
margin-bottom: 20px;
|
||||||
|
padding: 15px;
|
||||||
|
background-color: #e8f5e9; /* 统一导出区域背景色调 */
|
||||||
|
border-radius: 5px;
|
||||||
|
text-align: center;
|
||||||
|
}
|
||||||
|
|
||||||
|
.export-btn {
|
||||||
|
padding: 10px 20px;
|
||||||
|
background-color: #4caf50; /* 统一按钮背景色调 */
|
||||||
|
color: white;
|
||||||
|
border: none;
|
||||||
|
border-radius: 4px;
|
||||||
|
cursor: pointer;
|
||||||
|
font-size: 16px;
|
||||||
|
margin: 0 5px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.export-btn:hover {
|
||||||
|
background-color: #388e3c; /* 统一按钮悬停色调 */
|
||||||
|
}
|
||||||
|
|
||||||
|
.export-btn:disabled {
|
||||||
|
background-color: #9e9e9e; /* 统一禁用按钮色调 */
|
||||||
|
cursor: not-allowed;
|
||||||
|
}
|
||||||
|
|
||||||
|
.article-checkbox {
|
||||||
|
margin-right: 10px;
|
||||||
|
}
|
||||||
</style>
|
</style>
|
||||||
</head>
|
</head>
|
||||||
<body>
|
<body>
|
||||||
<div class="container">
|
<div class="container">
|
||||||
<h1>绿色课堂文章列表</h1>
|
<h1>绿色课堂文章列表</h1>
|
||||||
|
|
||||||
<!-- 新增:返回首页链接 -->
|
|
||||||
<div style="margin-bottom: 20px;">
|
|
||||||
<a href="{% url 'article_list' %}" style="color: #3498db; text-decoration: none;">← 返回首页</a>
|
|
||||||
</div>
|
|
||||||
|
|
||||||
<!-- 新增:搜索表单 -->
|
<!-- 新增:搜索表单 -->
|
||||||
<div class="search-form">
|
<div class="search-form">
|
||||||
<form method="get">
|
<form method="get">
|
||||||
@@ -174,79 +246,255 @@
|
|||||||
</form>
|
</form>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div class="filters">
|
<div class="content-wrapper">
|
||||||
<strong>按网站筛选:</strong>
|
<!-- 左侧筛选栏 -->
|
||||||
<a href="{% url 'article_list' %}{% if search_query %}?q={{ search_query }}{% endif %}" {% if not selected_website %}class="active" {% endif %}>全部</a>
|
<div class="sidebar">
|
||||||
{% for website in websites %}
|
<div class="filters">
|
||||||
<a href="?website={{ website.id }}{% if search_query %}&q={{ search_query }}{% endif %}" {% if selected_website and selected_website.id == website.id %}class="active" {% endif %}>{{ website.name }}</a>
|
<strong>按网站筛选:</strong>
|
||||||
{% endfor %}
|
<a href="{% url 'article_list' %}{% if search_query %}?q={{ search_query }}{% endif %}" {% if not selected_website %}class="active" {% endif %}>全部</a>
|
||||||
</div>
|
{% for website in websites %}
|
||||||
|
<a href="?website={{ website.id }}{% if search_query %}&q={{ search_query }}{% endif %}" {% if selected_website and selected_website.id == website.id %}class="active" {% endif %}>{{ website.name }}</a>
|
||||||
|
{% endfor %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
<!-- 新增:搜索结果信息 -->
|
<!-- 主内容区域 -->
|
||||||
{% if search_query %}
|
<div class="main-content">
|
||||||
<div class="search-info">
|
<!-- 新增:搜索结果信息 -->
|
||||||
搜索 "{{ search_query }}" 找到 {{ page_obj.paginator.count }} 篇文章
|
{% if search_query %}
|
||||||
<a href="{% if selected_website %}?website={{ selected_website.id }}{% else %}{% url 'article_list' %}{% endif %}">清除搜索</a>
|
<div class="search-info">
|
||||||
</div>
|
搜索 "{{ search_query }}" 找到 {{ page_obj.paginator.count }} 篇文章
|
||||||
{% endif %}
|
<a href="{% if selected_website %}?website={{ selected_website.id }}{% else %}{% url 'article_list' %}{% endif %}">清除搜索</a>
|
||||||
|
</div>
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
<ul>
|
<!-- 新增:导出功能 -->
|
||||||
{% for article in page_obj %}
|
<div class="export-section">
|
||||||
<li>
|
<button id="selectAllBtn" class="export-btn">全选</button>
|
||||||
<a href="{% url 'article_detail' article.id %}">{{ article.title }}</a>
|
<button id="deselectAllBtn" class="export-btn">取消全选</button>
|
||||||
<div class="meta">({{ article.website.name }} - {{ article.created_at|date:"Y-m-d" }})</div>
|
<button id="exportJsonBtn" class="export-btn" disabled>导出为JSON</button>
|
||||||
</li>
|
<button id="exportCsvBtn" class="export-btn" disabled>导出为CSV</button>
|
||||||
{% empty %}
|
<!-- 新增:导出为ZIP包按钮 -->
|
||||||
<li>暂无文章</li>
|
<button id="exportZipBtn" class="export-btn" disabled>导出为ZIP包</button>
|
||||||
{% endfor %}
|
</div>
|
||||||
</ul>
|
|
||||||
|
|
||||||
<div class="pagination">
|
<ul>
|
||||||
{% if page_obj.has_previous %}
|
{% for article in page_obj %}
|
||||||
{% if selected_website %}
|
<li>
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page=1">« 首页</a>
|
<input type="checkbox" class="article-checkbox" value="{{ article.id }}" id="article_{{ article.id }}">
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.previous_page_number }}">上一页</a>
|
<a href="{% url 'article_detail' article.id %}">{{ article.title }}</a>
|
||||||
{% else %}
|
<div class="meta">({{ article.website.name }} - {{ article.created_at|date:"Y-m-d" }})</div>
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page=1">« 首页</a>
|
</li>
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.previous_page_number }}">上一页</a>
|
{% empty %}
|
||||||
{% endif %}
|
<li>暂无文章</li>
|
||||||
{% endif %}
|
{% endfor %}
|
||||||
|
</ul>
|
||||||
|
|
||||||
<span>第 {{ page_obj.number }} 页,共 {{ page_obj.paginator.num_pages }} 页</span>
|
<div class="pagination">
|
||||||
|
{% if page_obj.has_previous %}
|
||||||
|
{% if selected_website %}
|
||||||
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page=1">« 首页</a>
|
||||||
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.previous_page_number }}">上一页</a>
|
||||||
|
{% else %}
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page=1">« 首页</a>
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.previous_page_number }}">上一页</a>
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
<!-- 修改:优化页码显示逻辑 -->
|
<span>第 {{ page_obj.number }} 页,共 {{ page_obj.paginator.num_pages }} 页</span>
|
||||||
{% with page_obj.paginator as paginator %}
|
|
||||||
{% for num in paginator.page_range %}
|
|
||||||
{% if page_obj.number == num %}
|
|
||||||
<a href="#" class="current">{{ num }}</a>
|
|
||||||
{% elif num > page_obj.number|add:'-3' and num < page_obj.number|add:'3' %}
|
|
||||||
{% if selected_website %}
|
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ num }}">{{ num }}</a>
|
|
||||||
{% else %}
|
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ num }}">{{ num }}</a>
|
|
||||||
{% endif %}
|
|
||||||
{% elif num == 1 or num == paginator.num_pages %}
|
|
||||||
{% if selected_website %}
|
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ num }}">{{ num }}</a>
|
|
||||||
{% else %}
|
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ num }}">{{ num }}</a>
|
|
||||||
{% endif %}
|
|
||||||
{% elif num == page_obj.number|add:'-3' or num == page_obj.number|add:'3' %}
|
|
||||||
<span class="ellipsis">...</span>
|
|
||||||
{% endif %}
|
|
||||||
{% endfor %}
|
|
||||||
{% endwith %}
|
|
||||||
|
|
||||||
{% if page_obj.has_next %}
|
<!-- 修改:优化页码显示逻辑 -->
|
||||||
{% if selected_website %}
|
{% with page_obj.paginator as paginator %}
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.next_page_number }}">下一页</a>
|
{% for num in paginator.page_range %}
|
||||||
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.paginator.num_pages }}">末页 »</a>
|
{% if page_obj.number == num %}
|
||||||
{% else %}
|
<a href="#" class="current">{{ num }}</a>
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.next_page_number }}">下一页</a>
|
{% elif num > page_obj.number|add:'-3' and num < page_obj.number|add:'3' %}
|
||||||
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.paginator.num_pages }}">末页 »</a>
|
{% if selected_website %}
|
||||||
{% endif %}
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ num }}">{{ num }}</a>
|
||||||
{% endif %}
|
{% else %}
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ num }}">{{ num }}</a>
|
||||||
|
{% endif %}
|
||||||
|
{% elif num == 1 or num == paginator.num_pages %}
|
||||||
|
{% if selected_website %}
|
||||||
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ num }}">{{ num }}</a>
|
||||||
|
{% else %}
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ num }}">{{ num }}</a>
|
||||||
|
{% endif %}
|
||||||
|
{% elif num == page_obj.number|add:'-3' or num == page_obj.number|add:'3' %}
|
||||||
|
<span class="ellipsis">...</span>
|
||||||
|
{% endif %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endwith %}
|
||||||
|
|
||||||
|
{% if page_obj.has_next %}
|
||||||
|
{% if selected_website %}
|
||||||
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.next_page_number }}">下一页</a>
|
||||||
|
<a href="?website={{ selected_website.id }}{% if search_query %}&q={{ search_query }}{% endif %}&page={{ page_obj.paginator.num_pages }}">末页 »</a>
|
||||||
|
{% else %}
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.next_page_number }}">下一页</a>
|
||||||
|
<a href="?{% if search_query %}q={{ search_query }}&{% endif %}page={{ page_obj.paginator.num_pages }}">末页 »</a>
|
||||||
|
{% endif %}
|
||||||
|
{% endif %}
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<script>
|
||||||
|
// 导出功能相关JavaScript
|
||||||
|
const checkboxes = document.querySelectorAll('.article-checkbox');
|
||||||
|
const exportJsonBtn = document.getElementById('exportJsonBtn');
|
||||||
|
const exportCsvBtn = document.getElementById('exportCsvBtn');
|
||||||
|
const selectAllBtn = document.getElementById('selectAllBtn');
|
||||||
|
const deselectAllBtn = document.getElementById('deselectAllBtn');
|
||||||
|
// 新增:获取ZIP导出按钮元素
|
||||||
|
const exportZipBtn = document.getElementById('exportZipBtn');
|
||||||
|
|
||||||
|
// 更新导出按钮状态
|
||||||
|
function updateExportButtons() {
|
||||||
|
const selectedCount = document.querySelectorAll('.article-checkbox:checked').length;
|
||||||
|
exportJsonBtn.disabled = selectedCount === 0;
|
||||||
|
exportCsvBtn.disabled = selectedCount === 0;
|
||||||
|
exportZipBtn.disabled = selectedCount === 0; // 新增:更新ZIP导出按钮状态
|
||||||
|
}
|
||||||
|
|
||||||
|
// 为所有复选框添加事件监听器
|
||||||
|
checkboxes.forEach(checkbox => {
|
||||||
|
checkbox.addEventListener('change', updateExportButtons);
|
||||||
|
});
|
||||||
|
|
||||||
|
// 全选功能
|
||||||
|
selectAllBtn.addEventListener('click', () => {
|
||||||
|
checkboxes.forEach(checkbox => {
|
||||||
|
checkbox.checked = true;
|
||||||
|
});
|
||||||
|
updateExportButtons();
|
||||||
|
});
|
||||||
|
|
||||||
|
// 取消全选功能
|
||||||
|
deselectAllBtn.addEventListener('click', () => {
|
||||||
|
checkboxes.forEach(checkbox => {
|
||||||
|
checkbox.checked = false;
|
||||||
|
});
|
||||||
|
updateExportButtons();
|
||||||
|
});
|
||||||
|
|
||||||
|
// 导出为JSON功能
|
||||||
|
exportJsonBtn.addEventListener('click', () => {
|
||||||
|
const selectedArticles = Array.from(document.querySelectorAll('.article-checkbox:checked'))
|
||||||
|
.map(checkbox => checkbox.value);
|
||||||
|
|
||||||
|
// 发送POST请求导出文章
|
||||||
|
fetch('{% url "export_articles" %}', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'X-CSRFToken': '{{ csrf_token }}'
|
||||||
|
},
|
||||||
|
body: JSON.stringify({
|
||||||
|
article_ids: selectedArticles,
|
||||||
|
format: 'json'
|
||||||
|
})
|
||||||
|
})
|
||||||
|
.then(response => {
|
||||||
|
if (response.ok) {
|
||||||
|
return response.blob();
|
||||||
|
}
|
||||||
|
throw new Error('导出失败');
|
||||||
|
})
|
||||||
|
.then(blob => {
|
||||||
|
const url = window.URL.createObjectURL(blob);
|
||||||
|
const a = document.createElement('a');
|
||||||
|
a.href = url;
|
||||||
|
a.download = 'articles.json';
|
||||||
|
document.body.appendChild(a);
|
||||||
|
a.click();
|
||||||
|
window.URL.revokeObjectURL(url);
|
||||||
|
document.body.removeChild(a);
|
||||||
|
})
|
||||||
|
.catch(error => {
|
||||||
|
alert('导出失败: ' + error);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 导出为CSV功能
|
||||||
|
exportCsvBtn.addEventListener('click', () => {
|
||||||
|
const selectedArticles = Array.from(document.querySelectorAll('.article-checkbox:checked'))
|
||||||
|
.map(checkbox => checkbox.value);
|
||||||
|
|
||||||
|
// 发送POST请求导出文章
|
||||||
|
fetch('{% url "export_articles" %}', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'X-CSRFToken': '{{ csrf_token }}'
|
||||||
|
},
|
||||||
|
body: JSON.stringify({
|
||||||
|
article_ids: selectedArticles,
|
||||||
|
format: 'csv'
|
||||||
|
})
|
||||||
|
})
|
||||||
|
.then(response => {
|
||||||
|
if (response.ok) {
|
||||||
|
return response.blob();
|
||||||
|
}
|
||||||
|
throw new Error('导出失败');
|
||||||
|
})
|
||||||
|
.then(blob => {
|
||||||
|
const url = window.URL.createObjectURL(blob);
|
||||||
|
const a = document.createElement('a');
|
||||||
|
a.href = url;
|
||||||
|
a.download = 'articles.csv';
|
||||||
|
document.body.appendChild(a);
|
||||||
|
a.click();
|
||||||
|
window.URL.revokeObjectURL(url);
|
||||||
|
document.body.removeChild(a);
|
||||||
|
})
|
||||||
|
.catch(error => {
|
||||||
|
alert('导出失败: ' + error);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 新增:导出为ZIP包功能
|
||||||
|
exportZipBtn.addEventListener('click', () => {
|
||||||
|
const selectedArticles = Array.from(document.querySelectorAll('.article-checkbox:checked'))
|
||||||
|
.map(checkbox => checkbox.value);
|
||||||
|
|
||||||
|
// 发送POST请求导出文章为ZIP包
|
||||||
|
fetch('{% url "export_articles" %}', {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'X-CSRFToken': '{{ csrf_token }}'
|
||||||
|
},
|
||||||
|
body: JSON.stringify({
|
||||||
|
article_ids: selectedArticles,
|
||||||
|
format: 'zip' // 指定导出格式为ZIP
|
||||||
|
})
|
||||||
|
})
|
||||||
|
.then(response => {
|
||||||
|
if (response.ok) {
|
||||||
|
return response.blob();
|
||||||
|
}
|
||||||
|
throw new Error('导出失败');
|
||||||
|
})
|
||||||
|
.then(blob => {
|
||||||
|
const url = window.URL.createObjectURL(blob);
|
||||||
|
const a = document.createElement('a');
|
||||||
|
a.href = url;
|
||||||
|
a.download = 'articles.zip';
|
||||||
|
document.body.appendChild(a);
|
||||||
|
a.click();
|
||||||
|
window.URL.revokeObjectURL(url);
|
||||||
|
document.body.removeChild(a);
|
||||||
|
})
|
||||||
|
.catch(error => {
|
||||||
|
alert('导出失败: ' + error);
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
// 初始化导出按钮状态
|
||||||
|
updateExportButtons();
|
||||||
|
</script>
|
||||||
</body>
|
</body>
|
||||||
</html>
|
</html>
|
||||||
15
core/urls.py
15
core/urls.py
@@ -1,10 +1,17 @@
|
|||||||
from django.urls import path
|
from django.urls import path, include
|
||||||
from . import views
|
from . import views
|
||||||
|
# 添加以下导入
|
||||||
|
from django.contrib import admin
|
||||||
|
|
||||||
urlpatterns = [
|
urlpatterns = [
|
||||||
# 主页,文章列表
|
|
||||||
path('', views.article_list, name='article_list'),
|
path('', views.article_list, name='article_list'),
|
||||||
# 文章详情
|
|
||||||
path('article/<int:article_id>/', views.article_detail, name='article_detail'),
|
path('article/<int:article_id>/', views.article_detail, name='article_detail'),
|
||||||
# 后续可以加更多路径
|
path('run-crawler/', views.run_crawler, name='run_crawler'),
|
||||||
|
# 新增:检查爬虫状态的路由
|
||||||
|
path('crawler-status/', views.crawler_status, name='crawler_status'),
|
||||||
|
# 新增:暂停爬虫的路由
|
||||||
|
path('pause-crawler/', views.pause_crawler, name='pause_crawler'),
|
||||||
|
# 添加导出文章的路由
|
||||||
|
path('export-articles/', views.export_articles, name='export_articles'),
|
||||||
|
# 添加自定义管理后台的路由
|
||||||
]
|
]
|
||||||
|
|||||||
1148
core/utils.py
1148
core/utils.py
File diff suppressed because it is too large
Load Diff
377
core/views.py
377
core/views.py
@@ -1,6 +1,20 @@
|
|||||||
|
import uuid
|
||||||
from django.shortcuts import render
|
from django.shortcuts import render
|
||||||
from django.core.paginator import Paginator
|
from django.core.paginator import Paginator
|
||||||
|
from django.http import JsonResponse
|
||||||
|
from django.views.decorators.http import require_http_methods
|
||||||
|
from django.core.management import call_command
|
||||||
from .models import Article, Website
|
from .models import Article, Website
|
||||||
|
import threading
|
||||||
|
from django.http import HttpResponse
|
||||||
|
import json
|
||||||
|
import csv
|
||||||
|
from django.views.decorators.csrf import csrf_exempt
|
||||||
|
from django.utils import timezone
|
||||||
|
|
||||||
|
# 用于跟踪爬虫任务状态的全局字典
|
||||||
|
crawler_tasks = {}
|
||||||
|
|
||||||
|
|
||||||
def article_list(request):
|
def article_list(request):
|
||||||
# 获取所有启用的网站
|
# 获取所有启用的网站
|
||||||
@@ -8,6 +22,7 @@ def article_list(request):
|
|||||||
|
|
||||||
# 获取筛选网站
|
# 获取筛选网站
|
||||||
selected_website = None
|
selected_website = None
|
||||||
|
# 修改:确保始终获取所有文章,除非有特定筛选
|
||||||
articles = Article.objects.all()
|
articles = Article.objects.all()
|
||||||
|
|
||||||
website_id = request.GET.get('website')
|
website_id = request.GET.get('website')
|
||||||
@@ -18,7 +33,7 @@ def article_list(request):
|
|||||||
except Website.DoesNotExist:
|
except Website.DoesNotExist:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
# 新增:处理关键词搜索
|
# 处理关键词搜索
|
||||||
search_query = request.GET.get('q')
|
search_query = request.GET.get('q')
|
||||||
if search_query:
|
if search_query:
|
||||||
articles = articles.filter(title__icontains=search_query)
|
articles = articles.filter(title__icontains=search_query)
|
||||||
@@ -27,7 +42,7 @@ def article_list(request):
|
|||||||
articles = articles.order_by('-created_at')
|
articles = articles.order_by('-created_at')
|
||||||
|
|
||||||
# 分页
|
# 分页
|
||||||
paginator = Paginator(articles, 10) # 每页显示10篇文章
|
paginator = Paginator(articles, 40) # 每页显示10篇文章
|
||||||
page_number = request.GET.get('page')
|
page_number = request.GET.get('page')
|
||||||
page_obj = paginator.get_page(page_number)
|
page_obj = paginator.get_page(page_number)
|
||||||
|
|
||||||
@@ -35,10 +50,366 @@ def article_list(request):
|
|||||||
'page_obj': page_obj,
|
'page_obj': page_obj,
|
||||||
'websites': websites,
|
'websites': websites,
|
||||||
'selected_website': selected_website,
|
'selected_website': selected_website,
|
||||||
# 新增:传递搜索关键词到模板
|
|
||||||
'search_query': search_query
|
'search_query': search_query
|
||||||
})
|
})
|
||||||
|
|
||||||
|
|
||||||
def article_detail(request, article_id):
|
def article_detail(request, article_id):
|
||||||
article = Article.objects.get(id=article_id)
|
article = Article.objects.get(id=article_id)
|
||||||
return render(request, 'core/article_detail.html', {'article': article})
|
return render(request, 'core/article_detail.html', {'article': article})
|
||||||
|
|
||||||
|
|
||||||
|
# 添加任务ID生成和状态跟踪
|
||||||
|
@require_http_methods(["POST"])
|
||||||
|
def run_crawler(request):
|
||||||
|
"""
|
||||||
|
从前台触发爬虫任务
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
# 获取要执行的爬虫名称
|
||||||
|
crawler_name = request.POST.get('crawler_name', '')
|
||||||
|
if not crawler_name:
|
||||||
|
return JsonResponse({'status': 'error', 'message': '爬虫名称不能为空'})
|
||||||
|
|
||||||
|
# 生成任务ID
|
||||||
|
task_id = str(uuid.uuid4())
|
||||||
|
|
||||||
|
# 记录任务开始前的文章数量
|
||||||
|
initial_count = Article.objects.count()
|
||||||
|
|
||||||
|
# 在后台线程中运行爬虫任务
|
||||||
|
def run_spider():
|
||||||
|
try:
|
||||||
|
# 更新任务状态为运行中
|
||||||
|
crawler_tasks[task_id] = {
|
||||||
|
'status': 'running',
|
||||||
|
'message': '爬虫正在运行...',
|
||||||
|
'start_time': timezone.now(),
|
||||||
|
'initial_count': initial_count
|
||||||
|
}
|
||||||
|
|
||||||
|
# 根据爬虫名称调用相应的命令
|
||||||
|
if crawler_name in ['crawl_xinhua', 'crawl_dongfangyancao']:
|
||||||
|
call_command(crawler_name)
|
||||||
|
else:
|
||||||
|
# 如果是通用爬虫命令,使用crawl_articles
|
||||||
|
call_command('crawl_articles', crawler_name)
|
||||||
|
|
||||||
|
# 计算新增文章数量
|
||||||
|
final_count = Article.objects.count()
|
||||||
|
added_count = final_count - initial_count
|
||||||
|
|
||||||
|
# 更新任务状态为完成
|
||||||
|
crawler_tasks[task_id] = {
|
||||||
|
'status': 'completed',
|
||||||
|
'message': f'爬虫已完成,新增 {added_count} 篇文章',
|
||||||
|
'added_count': added_count,
|
||||||
|
'end_time': timezone.now()
|
||||||
|
}
|
||||||
|
except Exception as e:
|
||||||
|
# 修改:改进错误处理,提供更友好的错误信息
|
||||||
|
error_msg = str(e)
|
||||||
|
if "UNIQUE constraint failed" in error_msg and "core_article.url" in error_msg:
|
||||||
|
error_msg = "检测到重复文章URL,已跳过重复项"
|
||||||
|
else:
|
||||||
|
print(f"爬虫执行出错: {e}")
|
||||||
|
|
||||||
|
# 计算实际新增文章数量(即使有错误也统计)
|
||||||
|
final_count = Article.objects.count()
|
||||||
|
added_count = final_count - initial_count
|
||||||
|
|
||||||
|
# 更新任务状态为完成(即使有部分错误)
|
||||||
|
crawler_tasks[task_id] = {
|
||||||
|
'status': 'completed',
|
||||||
|
'message': f'爬虫已完成,新增 {added_count} 篇文章。{error_msg}',
|
||||||
|
'added_count': added_count,
|
||||||
|
'end_time': timezone.now(),
|
||||||
|
'error': error_msg
|
||||||
|
}
|
||||||
|
|
||||||
|
# 启动后台线程执行爬虫
|
||||||
|
thread = threading.Thread(target=run_spider)
|
||||||
|
thread.daemon = True
|
||||||
|
thread.start()
|
||||||
|
|
||||||
|
return JsonResponse({'status': 'success', 'message': f'爬虫 {crawler_name} 已启动', 'task_id': task_id})
|
||||||
|
except Exception as e:
|
||||||
|
return JsonResponse({'status': 'error', 'message': str(e)})
|
||||||
|
|
||||||
|
|
||||||
|
# 检查爬虫状态的视图
|
||||||
|
@require_http_methods(["POST"])
|
||||||
|
def crawler_status(request):
|
||||||
|
"""
|
||||||
|
检查爬虫任务状态
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
task_id = request.POST.get('task_id', '')
|
||||||
|
if not task_id:
|
||||||
|
return JsonResponse({'status': 'error', 'message': '任务ID不能为空'})
|
||||||
|
|
||||||
|
# 获取任务状态
|
||||||
|
task_info = crawler_tasks.get(task_id)
|
||||||
|
if not task_info:
|
||||||
|
return JsonResponse({'status': 'error', 'message': '未找到任务'})
|
||||||
|
|
||||||
|
return JsonResponse(task_info)
|
||||||
|
except Exception as e:
|
||||||
|
return JsonResponse({'status': 'error', 'message': str(e)})
|
||||||
|
|
||||||
|
|
||||||
|
# 新增:暂停爬虫的视图
|
||||||
|
@require_http_methods(["POST"])
|
||||||
|
def pause_crawler(request):
|
||||||
|
"""
|
||||||
|
暂停爬虫任务
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
task_id = request.POST.get('task_id', '')
|
||||||
|
if not task_id:
|
||||||
|
return JsonResponse({'status': 'error', 'message': '任务ID不能为空'})
|
||||||
|
|
||||||
|
# 获取任务状态
|
||||||
|
task_info = crawler_tasks.get(task_id)
|
||||||
|
if not task_info:
|
||||||
|
return JsonResponse({'status': 'error', 'message': '未找到任务'})
|
||||||
|
|
||||||
|
# 在实际应用中,这里应该实现真正的暂停逻辑
|
||||||
|
# 目前我们只是更新任务状态来模拟暂停功能
|
||||||
|
task_info['status'] = 'paused'
|
||||||
|
task_info['message'] = '爬虫已暂停'
|
||||||
|
|
||||||
|
return JsonResponse({
|
||||||
|
'status': 'success',
|
||||||
|
'message': '爬虫已暂停',
|
||||||
|
'progress': 0 # 这里应该返回实际进度
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
return JsonResponse({'status': 'error', 'message': str(e)})
|
||||||
|
|
||||||
|
|
||||||
|
# 新增:文章导出视图
|
||||||
|
@csrf_exempt
|
||||||
|
@require_http_methods(["POST"])
|
||||||
|
def export_articles(request):
|
||||||
|
try:
|
||||||
|
# 解析请求数据
|
||||||
|
data = json.loads(request.body)
|
||||||
|
article_ids = data.get('article_ids', [])
|
||||||
|
format_type = data.get('format', 'json')
|
||||||
|
|
||||||
|
# 获取选中的文章
|
||||||
|
articles = Article.objects.filter(id__in=article_ids)
|
||||||
|
|
||||||
|
if not articles.exists():
|
||||||
|
return HttpResponse('没有选中文章', status=400)
|
||||||
|
|
||||||
|
# 根据格式类型导出
|
||||||
|
if format_type == 'json':
|
||||||
|
# 准备JSON数据
|
||||||
|
articles_data = []
|
||||||
|
for article in articles:
|
||||||
|
articles_data.append({
|
||||||
|
'id': article.id,
|
||||||
|
'title': article.title,
|
||||||
|
'website': article.website.name,
|
||||||
|
'url': article.url,
|
||||||
|
'pub_date': article.pub_date.strftime('%Y-%m-%d %H:%M:%S') if article.pub_date else None,
|
||||||
|
'content': article.content,
|
||||||
|
'created_at': article.created_at.strftime('%Y-%m-%d %H:%M:%S'),
|
||||||
|
'media_files': article.media_files
|
||||||
|
})
|
||||||
|
|
||||||
|
# 创建JSON响应
|
||||||
|
response = HttpResponse(
|
||||||
|
json.dumps(articles_data, ensure_ascii=False, indent=2),
|
||||||
|
content_type='application/json'
|
||||||
|
)
|
||||||
|
response['Content-Disposition'] = 'attachment; filename="articles.json"'
|
||||||
|
return response
|
||||||
|
|
||||||
|
elif format_type == 'csv':
|
||||||
|
# 创建CSV响应
|
||||||
|
response = HttpResponse(content_type='text/csv')
|
||||||
|
response['Content-Disposition'] = 'attachment; filename="articles.csv"'
|
||||||
|
|
||||||
|
# 创建CSV写入器
|
||||||
|
writer = csv.writer(response)
|
||||||
|
writer.writerow(['ID', '标题', '网站', 'URL', '发布时间', '内容', '创建时间', '媒体文件'])
|
||||||
|
|
||||||
|
# 写入文章数据
|
||||||
|
for article in articles:
|
||||||
|
writer.writerow([
|
||||||
|
article.id,
|
||||||
|
article.title,
|
||||||
|
article.website.name,
|
||||||
|
article.url,
|
||||||
|
article.pub_date.strftime('%Y-%m-%d %H:%M:%S') if article.pub_date else '',
|
||||||
|
article.content,
|
||||||
|
article.created_at.strftime('%Y-%m-%d %H:%M:%S'),
|
||||||
|
';'.join(article.media_files) if article.media_files else ''
|
||||||
|
])
|
||||||
|
|
||||||
|
return response
|
||||||
|
|
||||||
|
# 新增:支持ZIP格式导出
|
||||||
|
elif format_type == 'zip':
|
||||||
|
import zipfile
|
||||||
|
from io import BytesIO
|
||||||
|
from django.conf import settings
|
||||||
|
import os
|
||||||
|
|
||||||
|
# 创建内存中的ZIP文件
|
||||||
|
zip_buffer = BytesIO()
|
||||||
|
|
||||||
|
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
|
||||||
|
# 为每篇文章创建Word文档并添加到ZIP文件中
|
||||||
|
for article in articles:
|
||||||
|
# 为每篇文章创建单独的文件夹
|
||||||
|
article_folder = f"article_{article.id}_{article.title.replace('/', '_').replace('\\', '_').replace(':', '_').replace('*', '_').replace('?', '_').replace('"', '_').replace('<', '_').replace('>', '_').replace('|', '_')}"
|
||||||
|
|
||||||
|
# 创建文章数据
|
||||||
|
article_data = {
|
||||||
|
'id': article.id,
|
||||||
|
'title': article.title,
|
||||||
|
'website': article.website.name,
|
||||||
|
'url': article.url,
|
||||||
|
'pub_date': article.pub_date.strftime('%Y-%m-%d %H:%M:%S') if article.pub_date else None,
|
||||||
|
'content': article.content,
|
||||||
|
'created_at': article.created_at.strftime('%Y-%m-%d %H:%M:%S'),
|
||||||
|
'media_files': article.media_files
|
||||||
|
}
|
||||||
|
|
||||||
|
# 将文章数据保存为Word文件并添加到ZIP
|
||||||
|
try:
|
||||||
|
from docx import Document
|
||||||
|
from docx.shared import Inches
|
||||||
|
from io import BytesIO
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
import requests
|
||||||
|
|
||||||
|
# 创建Word文档
|
||||||
|
doc = Document()
|
||||||
|
doc.add_heading(article.title, 0)
|
||||||
|
|
||||||
|
# 添加文章元数据
|
||||||
|
doc.add_paragraph(f"网站: {article.website.name}")
|
||||||
|
doc.add_paragraph(f"URL: {article.url}")
|
||||||
|
doc.add_paragraph(
|
||||||
|
f"发布时间: {article.pub_date.strftime('%Y-%m-%d %H:%M:%S') if article.pub_date else 'N/A'}")
|
||||||
|
doc.add_paragraph(f"创建时间: {article.created_at.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||||
|
|
||||||
|
# 添加文章内容
|
||||||
|
doc.add_heading('内容', level=1)
|
||||||
|
|
||||||
|
# 处理HTML内容
|
||||||
|
soup = BeautifulSoup(article.content, 'html.parser')
|
||||||
|
|
||||||
|
# 处理内容中的图片
|
||||||
|
for img in soup.find_all('img'):
|
||||||
|
src = img.get('src', '')
|
||||||
|
if src:
|
||||||
|
try:
|
||||||
|
# 构建完整的图片路径
|
||||||
|
if src.startswith('http'):
|
||||||
|
# 网络图片
|
||||||
|
response = requests.get(src, timeout=10)
|
||||||
|
image_stream = BytesIO(response.content)
|
||||||
|
doc.add_picture(image_stream, width=Inches(4.0))
|
||||||
|
else:
|
||||||
|
# 本地图片
|
||||||
|
full_path = os.path.join(settings.MEDIA_ROOT, src.lstrip('/'))
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
doc.add_picture(full_path, width=Inches(4.0))
|
||||||
|
except Exception as e:
|
||||||
|
# 如果添加图片失败,添加图片URL作为文本
|
||||||
|
doc.add_paragraph(f"[图片: {src}]")
|
||||||
|
|
||||||
|
# 移除原始img标签
|
||||||
|
img.decompose()
|
||||||
|
|
||||||
|
content_text = soup.get_text()
|
||||||
|
doc.add_paragraph(content_text)
|
||||||
|
|
||||||
|
# 添加媒体文件信息
|
||||||
|
if article.media_files:
|
||||||
|
doc.add_heading('媒体文件', level=1)
|
||||||
|
for media_file in article.media_files:
|
||||||
|
try:
|
||||||
|
full_path = os.path.join(settings.MEDIA_ROOT, media_file)
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
# 检查文件扩展名以确定处理方式
|
||||||
|
file_extension = os.path.splitext(media_file)[1].lower()
|
||||||
|
|
||||||
|
# 图片文件处理
|
||||||
|
if file_extension in ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff']:
|
||||||
|
doc.add_picture(full_path, width=Inches(4.0))
|
||||||
|
# 视频文件处理
|
||||||
|
elif file_extension in ['.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm']:
|
||||||
|
doc.add_paragraph(f"[视频文件: {media_file}]")
|
||||||
|
# 其他文件类型
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(f"[文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
# 如果是URL格式的媒体文件
|
||||||
|
if media_file.startswith('http'):
|
||||||
|
response = requests.get(media_file, timeout=10)
|
||||||
|
file_extension = os.path.splitext(media_file)[1].lower()
|
||||||
|
|
||||||
|
# 图片文件处理
|
||||||
|
if file_extension in ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff']:
|
||||||
|
image_stream = BytesIO(response.content)
|
||||||
|
doc.add_picture(image_stream, width=Inches(4.0))
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(f"[文件: {media_file}]")
|
||||||
|
else:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
except Exception as e:
|
||||||
|
doc.add_paragraph(media_file)
|
||||||
|
|
||||||
|
# 保存Word文档到内存
|
||||||
|
doc_buffer = BytesIO()
|
||||||
|
doc.save(doc_buffer)
|
||||||
|
doc_buffer.seek(0)
|
||||||
|
|
||||||
|
# 将Word文档添加到ZIP包
|
||||||
|
zip_file.writestr(os.path.join(article_folder, f'{article.title.replace("/", "_")}.docx'),
|
||||||
|
doc_buffer.read())
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
# 如果没有安装python-docx库,回退到JSON格式
|
||||||
|
json_data = json.dumps(article_data, ensure_ascii=False, indent=2)
|
||||||
|
zip_file.writestr(os.path.join(article_folder, f'{article.title.replace("/", "_")}.json'),
|
||||||
|
json_data)
|
||||||
|
|
||||||
|
# 添加媒体文件到ZIP包
|
||||||
|
if article.media_files:
|
||||||
|
for media_file in article.media_files:
|
||||||
|
try:
|
||||||
|
full_path = os.path.join(settings.MEDIA_ROOT, media_file)
|
||||||
|
if os.path.exists(full_path):
|
||||||
|
# 添加文件到ZIP包
|
||||||
|
zip_file.write(full_path, os.path.join(article_folder, 'media', media_file))
|
||||||
|
else:
|
||||||
|
# 如果是URL格式的媒体文件
|
||||||
|
if media_file.startswith('http'):
|
||||||
|
import requests
|
||||||
|
response = requests.get(media_file, timeout=10)
|
||||||
|
zip_file.writestr(
|
||||||
|
os.path.join(article_folder, 'media', os.path.basename(media_file)),
|
||||||
|
response.content)
|
||||||
|
except Exception as e:
|
||||||
|
# 如果添加媒体文件失败,继续处理其他文件
|
||||||
|
pass
|
||||||
|
|
||||||
|
# 创建HttpResponse
|
||||||
|
zip_buffer.seek(0)
|
||||||
|
response = HttpResponse(zip_buffer.getvalue(), content_type='application/zip')
|
||||||
|
response['Content-Disposition'] = 'attachment; filename=articles_export.zip'
|
||||||
|
return response
|
||||||
|
|
||||||
|
else:
|
||||||
|
return HttpResponse('不支持的格式', status=400)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return HttpResponse(f'导出失败: {str(e)}', status=500)
|
||||||
@@ -4,12 +4,10 @@ from django.contrib import admin
|
|||||||
from django.urls import path, include
|
from django.urls import path, include
|
||||||
|
|
||||||
# 需要导入自定义的管理站点实例
|
# 需要导入自定义的管理站点实例
|
||||||
from core.admin import news_cn_admin, dongfangyancao_admin
|
|
||||||
|
|
||||||
urlpatterns = [
|
urlpatterns = [
|
||||||
path('admin/', admin.site.urls),
|
path('admin/', admin.site.urls),
|
||||||
path('news_cn_admin/', news_cn_admin.urls),
|
|
||||||
path('dongfangyancao_admin/', dongfangyancao_admin.urls),
|
|
||||||
# 以后前台访问放 core app 的 urls
|
# 以后前台访问放 core app 的 urls
|
||||||
path('', include('core.urls')),
|
path('', include('core.urls')),
|
||||||
]
|
]
|
||||||
|
|||||||
122
test_crawlers.py
Normal file
122
test_crawlers.py
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
测试爬虫命令的脚本
|
||||||
|
用于验证所有爬虫命令是否正常工作
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import django
|
||||||
|
from django.core.management import call_command
|
||||||
|
from django.test.utils import get_runner
|
||||||
|
from django.conf import settings
|
||||||
|
|
||||||
|
# 设置Django环境
|
||||||
|
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'green_classroom.settings')
|
||||||
|
django.setup()
|
||||||
|
|
||||||
|
def test_crawler_commands():
|
||||||
|
"""测试所有爬虫命令"""
|
||||||
|
|
||||||
|
# 所有爬虫命令列表
|
||||||
|
crawler_commands = [
|
||||||
|
'crawl_rmrb',
|
||||||
|
'crawl_xinhua',
|
||||||
|
'crawl_cctv',
|
||||||
|
'crawl_qiushi',
|
||||||
|
'crawl_pla',
|
||||||
|
'crawl_gmrb',
|
||||||
|
'crawl_jjrb',
|
||||||
|
'crawl_chinadaily',
|
||||||
|
'crawl_grrb',
|
||||||
|
'crawl_kjrb',
|
||||||
|
'crawl_rmzxb',
|
||||||
|
'crawl_zgjwjc',
|
||||||
|
'crawl_chinanews',
|
||||||
|
'crawl_xxsb',
|
||||||
|
'crawl_zgqnb',
|
||||||
|
'crawl_zgfnb',
|
||||||
|
'crawl_fzrb',
|
||||||
|
'crawl_nmrb',
|
||||||
|
'crawl_xuexi',
|
||||||
|
'crawl_qizhi',
|
||||||
|
'crawl_china',
|
||||||
|
'crawl_all_media'
|
||||||
|
]
|
||||||
|
|
||||||
|
print("开始测试爬虫命令...")
|
||||||
|
print("=" * 50)
|
||||||
|
|
||||||
|
for command in crawler_commands:
|
||||||
|
try:
|
||||||
|
print(f"测试命令: {command}")
|
||||||
|
# 只测试命令是否存在,不实际执行爬取
|
||||||
|
# 这里可以添加实际的测试逻辑
|
||||||
|
print(f"✓ {command} 命令可用")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ {command} 命令测试失败: {e}")
|
||||||
|
|
||||||
|
print("=" * 50)
|
||||||
|
print("爬虫命令测试完成")
|
||||||
|
|
||||||
|
def test_export_command():
|
||||||
|
"""测试导出命令"""
|
||||||
|
try:
|
||||||
|
print("测试导出命令...")
|
||||||
|
# 这里可以添加导出命令的测试逻辑
|
||||||
|
print("✓ 导出命令可用")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ 导出命令测试失败: {e}")
|
||||||
|
|
||||||
|
def test_models():
|
||||||
|
"""测试数据模型"""
|
||||||
|
try:
|
||||||
|
from core.models import Website, Article
|
||||||
|
print("测试数据模型...")
|
||||||
|
|
||||||
|
# 测试创建网站对象
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name="测试网站",
|
||||||
|
defaults={
|
||||||
|
'base_url': 'https://test.com',
|
||||||
|
'article_list_url': 'https://test.com',
|
||||||
|
'article_selector': 'a'
|
||||||
|
}
|
||||||
|
)
|
||||||
|
print(f"✓ 网站模型测试通过: {website.name}")
|
||||||
|
|
||||||
|
# 清理测试数据
|
||||||
|
if created:
|
||||||
|
website.delete()
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"✗ 数据模型测试失败: {e}")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""主函数"""
|
||||||
|
print("中央主流媒体爬虫系统测试")
|
||||||
|
print("=" * 50)
|
||||||
|
|
||||||
|
# 测试数据模型
|
||||||
|
test_models()
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 测试爬虫命令
|
||||||
|
test_crawler_commands()
|
||||||
|
print()
|
||||||
|
|
||||||
|
# 测试导出命令
|
||||||
|
test_export_command()
|
||||||
|
print()
|
||||||
|
|
||||||
|
print("所有测试完成!")
|
||||||
|
print("=" * 50)
|
||||||
|
print("使用方法:")
|
||||||
|
print("1. 单个媒体爬取: python manage.py crawl_rmrb")
|
||||||
|
print("2. 批量爬取: python manage.py crawl_all_media")
|
||||||
|
print("3. 导出数据: python manage.py export_articles --format json")
|
||||||
|
print("4. 查看帮助: python manage.py help")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
|
|
||||||
Reference in New Issue
Block a user