基于互信息和信息熵的新词提取python代码

### 基于互信息和信息熵的新词发现 Python 示例以下是基于互信息 (Mutual Information, MI) 和信息熵 (Entropy) 的新词发现方法的实现代码。该方法通过计算候选片段的信息熵以及其组成字之间的互信息来评估潜在的新词。 #### 1. 计算信息熵信息熵用于衡量一个字符串内部字符分布的不确定性。对于给定的一组连续字符序列 $ S = s_1s_2...s_n $，可以定义其信息熵为： $$ H(S) = -\sum_{i=1}^{n}{P(s_i)\log P(s_i)} $$ 其中 $ P(s_i) $ 是字符 $ s_i $ 出现在整个文本中的概率[^1]。 ```python import math from collections import Counter def calculate_entropy(text): """ Calculate the entropy of a given text. Args: text (str): Input string to compute entropy. Returns: float: Entropy value of the input string. """ counter = Counter(text) total_chars = sum(counter.values()) probabilities = {char: count / total_chars for char, count in counter.items()} entropy_value = -sum(p * math.log2(p) for p in probabilities.values()) return entropy_value ``` #### 2. 计算互信息互信息用来度量两个事件共同发生的可能性相对于它们独立发生的可能性有多大。具体到文本处理领域，它表示某两部分子串一起出现的概率与单独出现概率乘积的比例关系: $$ I(a;b)= \frac{p(a,b)}{p(a)p(b)} $$ 这里 $ p(a,b) $ 表示 $a$ 和 $b$ 同时出现的概率；而 $p(a)$ 及 $p(b)$ 则分别指代各自单个出现的概率[^3]。 ```python def mutual_information(word, freq_unigram, freq_bigram): """ Compute Mutual Information between two consecutive characters within word. Args: word (str): Word or phrase whose internal structure we analyze via MI. freq_unigram (dict): Dictionary mapping unigrams -> their frequencies across corpus. freq_bigram (dict): Similar but maps bigrams instead. Returns: list[tuple[float,str]]: List containing tuples with computed MIs alongside corresponding substrings inside `word`. """ mi_scores = [] n = len(word) # Iterate over all possible splits into pairs of subwords from this candidate 'new term' for i in range(1,n): left_substring = word[:i] right_substring = word[i:] prob_left = freq_unigram[left_substring]/len(freq_unigram.keys()) prob_right = freq_unigram[right_substring]/len(freq_unigram.keys()) joint_prob = freq_bigram[(left_substring,right_substring)]/(len(freq_unigram.keys())*(len(freq_unigram.keys()-1))) score = math.log(joint_prob/(prob_left*prob_right),2) mi_scores.append((score,f"{left_substring}|{right_substring}")) return sorted(mi_scores,key=lambda x:x[0],reverse=True)[0][0] ``` #### 3. 整合流程并执行新词发现最后一步就是把上述功能组合起来形成完整的解决方案。下面展示了一个简单的框架函数，它可以接受原始文档作为输入参数，并返回可能的新词汇表及其对应的置信评分。 ```python def discover_new_words(corpus,min_length=2,max_length=8,top_k=None): """ Perform New Term Discovery using both Entropy and Mutual Information metrics on provided Corpus Args: corpus(str): Raw textual data where potential new terms reside. min_length(int): Minimum length allowed per detected token(default set at 2). max_length(int): Maximum allowable size per identified chunk(default capped @8 chars long). top_k(int|NoneType): Number specifying how many highest ranked candidates should be returned; None implies no limit applied here. Yields: tuple[str,float]: Each yielded item consists of discovered novel expression along w/its associated confidence level derived through combined metric scores. """ tokens = tokenize_and_clean(corpus) # Assume existence of helper utility performing necessary preprocessing steps like cleaning & splitting sentences etc... # This step may involve removing stopwords,punctuation marks normalizing case folding among other things... freq_unigram = build_frequency_distribution(tokens,'unigram') freq_bigram = build_frequency_distribution(tokens,'bigram') seen_terms=set() for lngth in range(min_length,max_length+1): sliding_window=[tokens[j:j+lngth]for j in range(len(tokens)-lngth)] for window_seq in sliding_window: joined_token="".join(window_seq) if joined_token not in seen_terms: entpy_score=calculate_entropy(joined_token) mi_val=mutual_information(joined_token,freq_unigram,freq_bigram) composite_metric=(entpy_score+mi_val)/2 yield joined_token,composite_metric seen_terms.add(joined_token) if __name__=="__main__": sample_text="""Your large body of texts goes here.""" results=list(discover_new_words(sample_text)) filtered_results=[r for r in results if r[1]>some_threshold] # Define threshold based upon empirical analysis/experimentation final_output={t:s for t,s in filtered_results} print(final_output) ```

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇 python .HEIC 图片转.jpg

目录

基于互信息和信息熵的新词提取python代码

Python内容推荐

Python-python3实现互信息和左右熵的新词发现

Python计算信息熵实例

互信息MI的python代码

标准互信息NMI的Python代码

基于信息熵和逻辑回归的特征提取分类python源码.zip

图像融合-评估指标-python

熵权法求权重python代码熵权法求权重python代码

GFCC和MFCC特征提取（python代码）

python求高光谱互信息代码

fer2013数据集和提取出的数据集图片以及python提取代码

xinci:新词发现 Chinese Words Extraction & New Words Finder (Python package)

Sift特征提取——python代码实现

smite:用于计算符号互信息和熵符号传递的Python模块

决策树代码Python（包含GINI，信息熵构建方法，10折交叉验证，Adaboost以及Boost方法）

python实现决策树、随机森林的简单原理

Python 用三行代码提取PDF表格数据

香农信息熵的计算.py

互信息的计算

计算灰度图像信息熵的方法

通过互信息进行特征选择

学生成绩管理系统C++课程设计与实践

别再手动拖拽了！用Lumerical脚本批量创建FDTD仿真结构（附完整代码）

Java邮件解析任务中，如何安全高效地提取HTML邮件内容并避免硬编码、资源泄漏和类型转换异常？

RH公司应收账款管理优化策略研究

新手别慌！用BingPi-M2开发板带你5分钟搞懂Tina Linux SDK目录结构

Java线程池运行时状态怎么实时掌握？有哪些靠谱的监控手段？

桌面工具软件项目效益评估及市场预测分析

告别遮挡！UniApp中WebView与原生导航栏的和谐共处方案（附完整可运行代码）

OSPF是怎么在企业网里自动找最优路径并分区域管理的？

UML建模课程设计：图书馆管理系统论文