用Python分析英文文献的高频词，怎么清洗文本、统计前10热词并用柱状图和词云可视化？

### **Python统计英文文献高频词及可视化步骤** 要完成这个任务，核心流程可分为三步：**1. 数据获取与预处理**、**2. 词频统计与排序**、**3. 结果可视化**。以下是详细步骤及代码实现。 #### **步骤一：数据获取与文本预处理** 此步骤的目标是获取干净的英文文本，去除标点、数字等非单词字符，并将所有单词统一为小写，以便准确统计。 1. **导入库并读取文件**：首先需要导入必要的库，并读取存储英文文献的文本文件。假设文献文件为 `literature.txt`。 ```python import re from collections import Counter import matplotlib.pyplot as plt from wordcloud import WordCloud # 读取英文文献文件 file_path = 'literature.txt' try: with open(file_path, 'r', encoding='utf-8') as file: text = file.read() except FileNotFoundError: print(f"文件 {file_path} 未找到，请检查路径。") exit() ``` 2. **文本清洗与分词**：使用正则表达式移除所有非字母字符（保留空格和连字符），并将文本分割成单词列表[ref_4]。 ```python # 清洗文本：移除非字母字符（保留空格和连字符），并将所有单词转为小写 cleaned_text = re.sub(r'[^a-zA-Z\s-]', ' ', text) words = cleaned_text.lower().split() # 分割成单词列表 ``` #### **步骤二：词频统计与排序** 使用Python内置的 `collections.Counter` 可以高效地统计词频，并获取出现次数最多的前10个单词[ref_4]。 ```python # 使用Counter统计词频 word_counts = Counter(words) # 获取出现频率最高的前10个单词及其次数 top_10_words = word_counts.most_common(10) # 打印结果 print("出现频率最高的前10个单词及次数：") for word, count in top_10_words: print(f"{word}: {count}") ``` #### **步骤三：结果可视化** 可视化有两种主要方式：**柱状图**和**词云**。柱状图能清晰展示精确的频率排序，词云则能直观地通过字体大小反映单词的重要性[ref_1][ref_2]。 1. **生成柱状图** ```python # 准备柱状图数据 words_list, counts_list = zip(*top_10_words) # 将元组列表解压为两个列表 # 绘制柱状图 plt.figure(figsize=(10, 6)) plt.bar(words_list, counts_list, color='skyblue') plt.title('Top 10 Most Frequent Words in the Literature') plt.xlabel('Words') plt.ylabel('Frequency') plt.xticks(rotation=45) # 将x轴标签旋转45度，防止重叠 plt.tight_layout() plt.show() ``` 2. **生成词云图** 词云图能更直观、美观地展示词频信息，频率越高的单词在图中显示得越大[ref_1][ref_3]。 ```python # 准备词云数据：Counter对象可以直接传递给WordCloud wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts) # 显示词云 plt.figure(figsize=(12, 8)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') # 关闭坐标轴 plt.title('Word Cloud of the Literature', fontsize=16) plt.show() # 可选：将词云保存为图片文件 # wordcloud.to_file("wordcloud_output.png") ``` #### **完整代码整合与示例** 将以上步骤整合，一个完整的脚本如下所示。你可以将其保存为 `.py` 文件并运行。 ```python import re from collections import Counter import matplotlib.pyplot as plt from wordcloud import WordCloud # ===== 步骤1：读取与预处理 ===== file_path = 'literature.txt' try: with open(file_path, 'r', encoding='utf-8') as file: text = file.read() except FileNotFoundError: print(f"文件 {file_path} 未找到，请检查路径。") exit() # 清洗文本并分词 cleaned_text = re.sub(r'[^a-zA-Z\s-]', ' ', text) words = cleaned_text.lower().split() # ===== 步骤2：词频统计与排序 ===== word_counts = Counter(words) top_10_words = word_counts.most_common(10) print("出现频率最高的前10个单词及次数：") for word, count in top_10_words: print(f"{word}: {count}") # ===== 步骤3：可视化 ===== # 3.1 绘制柱状图 words_list, counts_list = zip(*top_10_words) plt.figure(figsize=(10, 6)) plt.bar(words_list, counts_list, color='skyblue') plt.title('Top 10 Most Frequent Words') plt.xlabel('Words') plt.ylabel('Frequency') plt.xticks(rotation=45) plt.tight_layout() plt.show() # 3.2 绘制词云图 wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_counts) plt.figure(figsize=(12, 8)) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.title('Word Cloud of the Literature', fontsize=16) plt.show() ``` #### **潜在问题与改进建议** * **停用词处理**：上述代码统计了所有单词，包括“the”、“a”、“and”等高频但无实际分析意义的**停用词**。为了得到更有洞察力的结果，可以在分词后过滤掉停用词[ref_5]。你可以使用 `nltk` 库的停用词列表。 ```python from nltk.corpus import stopwords # 下载停用词数据集（首次运行需要） # import nltk # nltk.download('stopwords') stop_words = set(stopwords.words('english')) filtered_words = [word for word in words if word not in stop_words] # 然后对 filtered_words 进行词频统计 ``` * **词形还原**：英文单词有不同形式（如“run”, “running”, “ran”）。为了更准确地统计，可以使用词形还原（Lemmatization）将它们统一为基本形式。这同样可以借助 `nltk` 库实现。 * **文件编码**：如果读取文件时遇到编码错误（如 `UnicodeDecodeError`），可以尝试将 `encoding='utf-8'` 替换为 `encoding='ISO-8859-1'` 或 `encoding='cp1252'`，具体取决于文件的原始编码[ref_1]。通过以上步骤，你可以从一篇英文文献中提取出核心词汇，并通过两种互补的可视化方式清晰地呈现分析结果。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇 Arduino发指令给电脑，怎么用Python联动VLC实现视频播放控制？

目录

用Python分析英文文献的高频词，怎么清洗文本、统计前10热词并用柱状图和词云可视化？

Python内容推荐

python:从excel中提取高频词生成词云

Python招聘数据分析可视化系统(只有PPT、文献综述、开题、论文 无源码！)

Python3绘制词云，同时实现 文章分析，分词统计，文本检索，并制作词云

python词云源码

用python绘制txt文本中中文汉字的词云

python生成词云(含所有文件和图片，含运行效果图).zip

使用python制作词云

python 词云 worldcloud

【Python】三国演义词频统计，wordcloud实现

利用Python爬取微博数据生成词云图片实例代码

python 爬取豆瓣电影评论，并进行词云展示

基于Python词云生成器

统计词频和生成词云python程序

Python基于pyecharts可视化分析苏州旅游攻略

基于Python词云分析政府工作报告关键词

python词云 wordcloud库详细使用教程源码和资源文件

用python结合jieba和wordcloud实现词云效果

python爬虫词云

零基础Python做词云，超简单！

Python生成词云的实现代码

PyPI 官网下载 | mlpack3-3.4.2-cp36-cp36m-manylinux1_x86_64.whl

实现基于C++或者python基本库，初学学习之用.zip

机器学习的一些基础算法，主要使用Python、Cpp、Matlab编写。.zip

jenkins-conf:Jenkins的配置文件

学生成绩管理系统C++课程设计与实践

别再手动拖拽了！用Lumerical脚本批量创建FDTD仿真结构（附完整代码）

Java邮件解析任务中，如何安全高效地提取HTML邮件内容并避免硬编码、资源泄漏和类型转换异常？

RH公司应收账款管理优化策略研究

新手别慌！用BingPi-M2开发板带你5分钟搞懂Tina Linux SDK目录结构

Java线程池运行时状态怎么实时掌握？有哪些靠谱的监控手段？

Python招聘数据分析可视化系统(只有PPT、文献综述、开题、论文无源码！)

Python3绘制词云，同时实现文章分析，分词统计，文本检索，并制作词云