fix bug
This commit is contained in:
188
WEBSITES_FIELD_FIX.md
Normal file
188
WEBSITES_FIELD_FIX.md
Normal file
@@ -0,0 +1,188 @@
|
|||||||
|
# 网站字段修复说明
|
||||||
|
|
||||||
|
## 问题描述
|
||||||
|
|
||||||
|
您遇到了 "can only join an iterable" 错误,这是因为将 `CrawlTask` 模型中的 `websites` 字段从 `JSONField` 改为了 `ManyToManyField`,但代码中还有一些地方没有相应更新。
|
||||||
|
|
||||||
|
## 已修复的问题
|
||||||
|
|
||||||
|
### 1. 任务执行器修复
|
||||||
|
|
||||||
|
在 `core/task_executor.py` 中,已经修复了以下方法:
|
||||||
|
|
||||||
|
- `_execute_keyword_task()` - 关键词搜索任务执行
|
||||||
|
- `_execute_historical_task()` - 历史文章任务执行
|
||||||
|
- `_execute_full_site_task()` - 全站爬取任务执行
|
||||||
|
|
||||||
|
**修复前:**
|
||||||
|
```python
|
||||||
|
websites = task.websites if task.websites else list(WEBSITE_SEARCH_CONFIGS.keys())
|
||||||
|
```
|
||||||
|
|
||||||
|
**修复后:**
|
||||||
|
```python
|
||||||
|
selected_websites = task.websites.all()
|
||||||
|
if selected_websites:
|
||||||
|
websites = [w.name for w in selected_websites]
|
||||||
|
else:
|
||||||
|
websites = list(WEBSITE_SEARCH_CONFIGS.keys())
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 模型方法修复
|
||||||
|
|
||||||
|
在 `core/models.py` 中,`get_websites_display()` 方法已经正确处理了 `ManyToManyField`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def get_websites_display(self):
|
||||||
|
"""获取网站列表的显示文本"""
|
||||||
|
try:
|
||||||
|
websites = self.websites.all()
|
||||||
|
if not websites:
|
||||||
|
return "所有网站"
|
||||||
|
# 确保网站名称是字符串并可以被join处理
|
||||||
|
website_names = [str(w.name) for w in websites if w.name]
|
||||||
|
return ", ".join(website_names) if website_names else "所有网站"
|
||||||
|
except Exception:
|
||||||
|
# 如果出现任何异常,返回默认值
|
||||||
|
return "所有网站"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 管理界面修复
|
||||||
|
|
||||||
|
在 `core/admin_extended.py` 中,已经修复了任务创建逻辑:
|
||||||
|
|
||||||
|
- 使用 `Website.objects.filter(name__in=websites)` 获取网站对象
|
||||||
|
- 使用 `task.websites.set(website_objects)` 设置关联关系
|
||||||
|
|
||||||
|
## 测试验证
|
||||||
|
|
||||||
|
已经通过以下测试验证修复:
|
||||||
|
|
||||||
|
1. **网站字段功能测试** - 验证 `ManyToManyField` 的基本操作
|
||||||
|
2. **任务执行器测试** - 验证任务执行器中的网站获取逻辑
|
||||||
|
3. **Web界面操作测试** - 验证完整的Web界面操作流程
|
||||||
|
4. **系统检查** - Django系统检查无错误
|
||||||
|
|
||||||
|
## 使用方法
|
||||||
|
|
||||||
|
### 1. 确保数据库迁移已应用
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python manage.py migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. 创建网站数据
|
||||||
|
|
||||||
|
如果还没有网站数据,可以通过以下方式创建:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from core.models import Website
|
||||||
|
|
||||||
|
# 创建一些测试网站
|
||||||
|
websites_to_create = ["新华网", "人民日报", "央视网"]
|
||||||
|
for name in websites_to_create:
|
||||||
|
website, created = Website.objects.get_or_create(
|
||||||
|
name=name,
|
||||||
|
defaults={
|
||||||
|
'base_url': f'http://{name}.com',
|
||||||
|
'enabled': True
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. 通过Web界面创建任务
|
||||||
|
|
||||||
|
1. 启动服务器:`python manage.py runserver`
|
||||||
|
2. 访问管理后台:`http://localhost:8000/admin/`
|
||||||
|
3. 在首页点击"快速创建爬取任务"
|
||||||
|
4. 选择任务类型并填写相关信息
|
||||||
|
5. 选择目标网站
|
||||||
|
6. 创建并启动任务
|
||||||
|
|
||||||
|
### 4. 通过命令行创建任务
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 关键词搜索
|
||||||
|
python manage.py crawl_by_keyword --keyword "人工智能"
|
||||||
|
|
||||||
|
# 历史文章爬取
|
||||||
|
python manage.py crawl_by_keyword --keyword "新闻" --historical
|
||||||
|
|
||||||
|
# 多网站一键爬取
|
||||||
|
python manage.py crawl_all_websites --mode both --keyword "人工智能"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 故障排除
|
||||||
|
|
||||||
|
### 如果仍然遇到 "can only join an iterable" 错误
|
||||||
|
|
||||||
|
1. **检查数据库迁移**:
|
||||||
|
```bash
|
||||||
|
python manage.py showmigrations core
|
||||||
|
python manage.py migrate
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **检查网站数据**:
|
||||||
|
```python
|
||||||
|
from core.models import Website
|
||||||
|
print(Website.objects.filter(enabled=True).count())
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **检查任务数据**:
|
||||||
|
```python
|
||||||
|
from core.models import CrawlTask
|
||||||
|
task = CrawlTask.objects.first()
|
||||||
|
if task:
|
||||||
|
print(f"任务网站: {task.get_websites_display()}")
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **重启服务器**:
|
||||||
|
```bash
|
||||||
|
# 停止当前服务器 (Ctrl+C)
|
||||||
|
python manage.py runserver
|
||||||
|
```
|
||||||
|
|
||||||
|
### 如果遇到其他错误
|
||||||
|
|
||||||
|
1. **检查日志**:查看Django日志输出
|
||||||
|
2. **检查网络连接**:确保可以访问目标网站
|
||||||
|
3. **检查网站配置**:确保 `WEBSITE_SEARCH_CONFIGS` 中的配置正确
|
||||||
|
|
||||||
|
## 技术细节
|
||||||
|
|
||||||
|
### ManyToManyField vs JSONField
|
||||||
|
|
||||||
|
**之前的JSONField方式:**
|
||||||
|
```python
|
||||||
|
websites = models.JSONField(default=list, verbose_name="目标网站")
|
||||||
|
# 使用: task.websites = ["新华网", "人民日报"]
|
||||||
|
```
|
||||||
|
|
||||||
|
**现在的ManyToManyField方式:**
|
||||||
|
```python
|
||||||
|
websites = models.ManyToManyField(Website, blank=True, verbose_name="目标网站")
|
||||||
|
# 使用: task.websites.set(website_objects)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 优势
|
||||||
|
|
||||||
|
1. **数据完整性**:通过外键关系确保数据一致性
|
||||||
|
2. **查询效率**:可以利用数据库索引进行高效查询
|
||||||
|
3. **关系管理**:Django自动处理多对多关系的创建和删除
|
||||||
|
4. **数据验证**:自动验证关联的网站是否存在
|
||||||
|
|
||||||
|
## 总结
|
||||||
|
|
||||||
|
修复已经完成,现在您可以:
|
||||||
|
|
||||||
|
1. 通过Web界面创建和管理爬取任务
|
||||||
|
2. 选择特定的网站进行爬取
|
||||||
|
3. 实时监控任务进度
|
||||||
|
4. 查看详细的爬取结果
|
||||||
|
|
||||||
|
如果仍然遇到问题,请检查:
|
||||||
|
- 数据库迁移是否已应用
|
||||||
|
- 网站数据是否存在
|
||||||
|
- 服务器是否已重启
|
||||||
|
|
||||||
|
系统现在应该可以正常工作了!
|
||||||
@@ -114,7 +114,12 @@ class TaskExecutor:
|
|||||||
task.save()
|
task.save()
|
||||||
|
|
||||||
# 准备参数
|
# 准备参数
|
||||||
websites = task.websites if task.websites else list(WEBSITE_SEARCH_CONFIGS.keys())
|
selected_websites = task.websites.all()
|
||||||
|
if selected_websites:
|
||||||
|
websites = [w.name for w in selected_websites]
|
||||||
|
else:
|
||||||
|
websites = list(WEBSITE_SEARCH_CONFIGS.keys())
|
||||||
|
|
||||||
start_date = task.start_date.strftime('%Y-%m-%d') if task.start_date else None
|
start_date = task.start_date.strftime('%Y-%m-%d') if task.start_date else None
|
||||||
end_date = task.end_date.strftime('%Y-%m-%d') if task.end_date else None
|
end_date = task.end_date.strftime('%Y-%m-%d') if task.end_date else None
|
||||||
|
|
||||||
@@ -142,7 +147,12 @@ class TaskExecutor:
|
|||||||
task.save()
|
task.save()
|
||||||
|
|
||||||
# 准备参数
|
# 准备参数
|
||||||
websites = task.websites if task.websites else list(WEBSITE_SEARCH_CONFIGS.keys())
|
selected_websites = task.websites.all()
|
||||||
|
if selected_websites:
|
||||||
|
websites = [w.name for w in selected_websites]
|
||||||
|
else:
|
||||||
|
websites = list(WEBSITE_SEARCH_CONFIGS.keys())
|
||||||
|
|
||||||
start_date = task.start_date.strftime('%Y-%m-%d') if task.start_date else None
|
start_date = task.start_date.strftime('%Y-%m-%d') if task.start_date else None
|
||||||
end_date = task.end_date.strftime('%Y-%m-%d') if task.end_date else None
|
end_date = task.end_date.strftime('%Y-%m-%d') if task.end_date else None
|
||||||
|
|
||||||
@@ -168,7 +178,11 @@ class TaskExecutor:
|
|||||||
task.save()
|
task.save()
|
||||||
|
|
||||||
# 准备参数
|
# 准备参数
|
||||||
websites = task.websites if task.websites else list(WEBSITE_SEARCH_CONFIGS.keys())
|
selected_websites = task.websites.all()
|
||||||
|
if selected_websites:
|
||||||
|
websites = [w.name for w in selected_websites]
|
||||||
|
else:
|
||||||
|
websites = list(WEBSITE_SEARCH_CONFIGS.keys())
|
||||||
|
|
||||||
total_websites = len(websites)
|
total_websites = len(websites)
|
||||||
completed_websites = 0
|
completed_websites = 0
|
||||||
|
|||||||
Reference in New Issue
Block a user