去哪儿网的游记正文怎么用Python自动提取出来？

### 使用 Python 爬虫抓取去哪儿网游记正文为了实现这一目标，可以采用多种方法和技术栈。考虑到效率和灵活性，推荐使用 Scrapy 框架来构建爬虫程序[^1]。 #### 构建 Scrapy 项目结构首先创建一个新的 Scrapy 项目： ```bash scrapy startproject qunar_travel_notes cd qunar_travel_notes ``` 接着定义 Spider 类，在 `spiders` 文件夹下新建文件如 `travel_note_spider.py`: ```python import scrapy from ..items import TravelNoteItem class TravelNoteSpider(scrapy.Spider): name = "qunar_travel_notes" allowed_domains = ["place.qyer.com"] def start_requests(self): urls = [ 'https://place.qyer.com/shanghai/travel-notes/', # 可以继续添加其他城市或页面链接... ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): note_links = response.css('a.note-item::attr(href)').getall() for link in note_links: yield response.follow(link, self.parse_note_content) def parse_note_content(self, response): item = TravelNoteItem() title = response.xpath('//h1[@id="title"]/text()').get().strip() content_blocks = response.css('.topicContent p *::text').extract() full_text = ''.join(content_blocks).replace('\n', '').strip() item['title'] = title item['content'] = full_text yield item ``` 此代码片段展示了如何通过 Scrapy 抓取特定 URL 下的文章列表，并进一步访问每篇文章的具体页面提取所需的信息，包括标题与正文内容。另外一种方式则是利用 Requests 库配合 BeautifulSoup 解析 HTML 文档中的游记详情页并抽取正文字段[^2]: ```python import requests from bs4 import BeautifulSoup def fetch_travel_note(note_url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', } resp = requests.get(note_url, headers=headers) soup = BeautifulSoup(resp.text, features='html.parser') article_body = '' paragraphs = soup.select('.topicContent p') for para in paragraphs: text = para.get_text(strip=True) if text: article_body += f"{text}\n" return article_body.strip() ``` 上述两种方案均能有效地从去哪儿网获取游记的正文部分；具体选择取决于个人偏好以及项目的实际需求。值得注意的是，在开发过程中应当遵循网站的服务条款，合理设置请求频率以免给服务器造成过大负担。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇用 Python 画信号波形图，关键步骤和常用参数有哪些？