Python做回归分析要经历哪些关键步骤？从数据准备到模型诊断全流程是怎样的？

# Python中回归分析的逐步实现实验教程本教程将带你从零开始，使用Python完成一个完整的回归分析实验。我们将以“广告投入与销售额关系预测”为案例，涵盖数据准备、模型建立、结果解读和诊断检验的全流程。 ## 一、实验准备与环境配置 ### 1.1 安装必要库首先确保已安装以下Python库。在命令行中执行： ```bash pip install pandas numpy matplotlib seaborn statsmodels scikit-learn ``` ### 1.2 导入所需模块创建Python脚本文件（如`regression_analysis.py`），导入必要的库： ```python # 导入数据分析与可视化库 import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # 导入统计建模库 import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.graphics.gofplots import qqplot # 导入机器学习库（用于模型比较） from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score # 设置中文字体和图表样式 plt.rcParams['font.sans-serif'] = ['SimHei', 'Arial Unicode MS', 'DejaVu Sans'] plt.rcParams['axes.unicode_minus'] = False sns.set_style("whitegrid") ``` ## 二、数据准备与探索性分析 ### 2.1 创建模拟数据集我们将创建一个包含多个广告渠道投入与销售额关系的模拟数据集： ```python # 设置随机种子确保结果可复现 np.random.seed(42) # 生成模拟数据 n_samples = 200 # 样本数量 # 自变量：三种广告渠道的投入（单位：千元） TV_ads = np.random.normal(150, 50, n_samples) # 电视广告 Radio_ads = np.random.normal(30, 10, n_samples) # 广播广告 Newspaper_ads = np.random.normal(40, 15, n_samples) # 报纸广告 Online_ads = np.random.normal(80, 25, n_samples) # 线上广告 # 因变量：销售额（单位：万元） # 添加线性关系和一些随机噪声 Sales = ( 2.5 * TV_ads + 1.8 * Radio_ads + 0.5 * Newspaper_ads + 1.2 * Online_ads + np.random.normal(0, 20, n_samples) + 50 # 基础销售额50万 ) # 创建DataFrame data = pd.DataFrame({ 'TV': TV_ads, 'Radio': Radio_ads, 'Newspaper': Newspaper_ads, 'Online': Online_ads, 'Sales': Sales }) print("数据集前5行预览：") print(data.head()) print(f"\n数据集形状：{data.shape}") print("\n数据基本统计信息：") print(data.describe()) ``` ### 2.2 数据可视化探索在建模前，先通过可视化了解数据特征： ```python # 创建2x2的子图布局 fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # 1. 各变量与销售额的散点图 axes[0, 0].scatter(data['TV'], data['Sales'], alpha=0.6, color='blue') axes[0, 0].set_xlabel('电视广告投入（千元）') axes[0, 0].set_ylabel('销售额（万元）') axes[0, 0].set_title('电视广告与销售额关系') axes[0, 1].scatter(data['Radio'], data['Sales'], alpha=0.6, color='green') axes[0, 1].set_xlabel('广播广告投入（千元）') axes[0, 1].set_ylabel('销售额（万元）') axes[0, 1].set_title('广播广告与销售额关系') axes[1, 0].scatter(data['Newspaper'], data['Sales'], alpha=0.6, color='red') axes[1, 0].set_xlabel('报纸广告投入（千元）') axes[1, 0].set_ylabel('销售额（万元）') axes[1, 0].set_title('报纸广告与销售额关系') axes[1, 1].scatter(data['Online'], data['Sales'], alpha=0.6, color='purple') axes[1, 1].set_xlabel('线上广告投入（千元）') axes[1, 1].set_ylabel('销售额（万元）') axes[1, 1].set_title('线上广告与销售额关系') plt.tight_layout() plt.savefig('scatter_plots.png', dpi=300, bbox_inches='tight') plt.show() # 2. 相关性热力图 plt.figure(figsize=(10, 8)) correlation_matrix = data.corr() sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8}) plt.title('变量间相关性热力图') plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight') plt.show() ``` ## 三、回归模型建立与分析 ### 3.1 简单线性回归（单一变量）我们先从最简单的单变量模型开始，分析电视广告对销售额的影响： ```python # 使用statsmodels进行简单线性回归 X_simple = data['TV'] # 自变量 y = data['Sales'] # 因变量 # 添加常数项（截距） X_simple_with_const = sm.add_constant(X_simple) # 建立OLS（普通最小二乘法）模型 simple_model = sm.OLS(y, X_simple_with_const).fit() print("="*60) print("简单线性回归分析结果（电视广告 vs 销售额）") print("="*60) print(simple_model.summary()) # 可视化回归线 plt.figure(figsize=(10, 6)) plt.scatter(data['TV'], data['Sales'], alpha=0.6, label='实际数据') plt.plot(data['TV'], simple_model.predict(X_simple_with_const), color='red', linewidth=2, label='回归线') plt.xlabel('电视广告投入（千元）') plt.ylabel('销售额（万元）') plt.title('简单线性回归：电视广告对销售额的影响') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('simple_regression.png', dpi=300, bbox_inches='tight') plt.show() ``` ### 3.2 多元线性回归（所有变量）现在建立包含所有广告渠道的多元回归模型： ```python # 准备多元回归数据 X_multi = data[['TV', 'Radio', 'Newspaper', 'Online']] X_multi_with_const = sm.add_constant(X_multi) # 添加常数项 # 建立多元线性回归模型 multi_model = sm.OLS(y, X_multi_with_const).fit() print("\n" + "="*60) print("多元线性回归分析结果（所有广告渠道）") print("="*60) print(multi_model.summary()) # 提取关键结果 print("\n" + "="*60) print("模型关键指标解读") print("="*60) print(f"1. R-squared（决定系数）: {multi_model.rsquared:.4f}") print(f" 说明模型能解释{multi_model.rsquared*100:.1f}%的销售额变异") print(f"2. 调整R-squared: {multi_model.rsquared_adj:.4f}") print(f"3. F-statistic: {multi_model.fvalue:.2f}") print(f"4. F-statistic p-value: {multi_model.f_pvalue:.4e}") print(f"5. 残差标准差（Residual Std Error）: {np.sqrt(multi_model.mse_resid):.2f}") ``` ### 3.3 回归系数解读与预测 ```python # 创建回归系数表格 coeff_table = pd.DataFrame({ '变量': ['常数项'] + list(X_multi.columns), '系数': multi_model.params.values, '标准误': multi_model.bse.values, 't值': multi_model.tvalues.values, 'P>|t|': multi_model.pvalues.values, '95%置信区间下限': multi_model.conf_int()[0].values, '95%置信区间上限': multi_model.conf_int()[1].values }) print("\n回归系数详细分析：") print(coeff_table.to_string(index=False)) # 使用模型进行预测示例 print("\n" + "="*60) print("预测示例") print("="*60) # 创建新的广告投入方案 new_ads = pd.DataFrame({ 'TV': [200, 180, 220], 'Radio': [40, 35, 45], 'Newspaper': [50, 45, 55], 'Online': [100, 90, 110] }) # 添加常数项并预测 new_ads_with_const = sm.add_constant(new_ads, has_constant='add') predictions = multi_model.predict(new_ads_with_const) for i, pred in enumerate(predictions): print(f"方案{i+1}: TV={new_ads.iloc[i,0]}, Radio={new_ads.iloc[i,1]}, " f"Newspaper={new_ads.iloc[i,2]}, Online={new_ads.iloc[i,3]}") print(f" 预测销售额: {pred:.2f} 万元") ``` ## 四、模型诊断与检验 ### 4.1 残差分析回归模型的假设检验至关重要，我们需要验证模型是否满足线性回归的基本假设： ```python # 获取残差 residuals = multi_model.resid fitted_values = multi_model.fittedvalues # 创建诊断图（2x2布局） fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # 1. 残差 vs 拟合值图（检查同方差性） axes[0, 0].scatter(fitted_values, residuals, alpha=0.6) axes[0, 0].axhline(y=0, color='red', linestyle='--') axes[0, 0].set_xlabel('拟合值') axes[0, 0].set_ylabel('残差') axes[0, 0].set_title('残差 vs 拟合值图（检查同方差性）') # 2. Q-Q图（检查正态性） qqplot(residuals, line='45', ax=axes[0, 1]) axes[0, 1].set_title('Q-Q图（检查残差正态性）') # 3. 残差直方图 axes[1, 0].hist(residuals, bins=20, edgecolor='black', alpha=0.7) axes[1, 0].set_xlabel('残差') axes[1, 0].set_ylabel('频数') axes[1, 0].set_title('残差分布直方图') # 4. 标准化残差 vs 杠杆值图 from statsmodels.graphics.regressionplots import plot_leverage_resid2 plot_leverage_resid2(multi_model, ax=axes[1, 1]) axes[1, 1].set_title('杠杆值 vs 标准化残差平方') plt.tight_layout() plt.savefig('regression_diagnostics.png', dpi=300, bbox_inches='tight') plt.show() ``` ### 4.2 多重共线性检验多重共线性会影响回归系数的稳定性，我们需要检查自变量间的相关性： ```python # 计算方差膨胀因子（VIF） vif_data = pd.DataFrame() vif_data["变量"] = X_multi.columns vif_data["VIF"] = [variance_inflation_factor(X_multi.values, i) for i in range(X_multi.shape[1])] print("\n" + "="*60) print("多重共线性检验（方差膨胀因子VIF）") print("="*60) print(vif_data.to_string(index=False)) print("\nVIF解读指南：") print(" VIF < 5: 多重共线性可接受") print(" 5 ≤ VIF < 10: 存在中等程度多重共线性") print(" VIF ≥ 10: 存在严重多重共线性，需处理") ``` ### 4.3 异常值检测 ```python # 使用Cook距离检测有影响力的观测点 influence = multi_model.get_influence() cooks_d = influence.cooks_distance[0] # 识别异常值（Cook距离 > 4/n） n = len(data) threshold = 4 / n outliers = np.where(cooks_d > threshold)[0] print(f"\n检测到 {len(outliers)} 个有影响力的异常观测点（Cook距离 > {threshold:.4f}）:") if len(outliers) > 0: outlier_data = data.iloc[outliers].copy() outlier_data['Cook_Distance'] = cooks_d[outliers] print(outlier_data[['TV', 'Radio', 'Newspaper', 'Online', 'Sales', 'Cook_Distance']]) ``` ## 五、模型优化与比较 ### 5.1 逐步回归（变量选择）当变量较多时，可以使用逐步回归选择最重要的变量： ```python # 前向逐步回归实现 def forward_selection(X, y, threshold_in=0.05): """前向逐步回归算法""" included = [] while True: changed = False excluded = list(set(X.columns) - set(included)) new_pval = pd.Series(index=excluded) for new_column in excluded: model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit() new_pval[new_column] = model.pvalues[new_column] best_pval = new_pval.min() if best_pval < threshold_in: best_feature = new_pval.idxmin() included.append(best_feature) changed = True if not changed: break return included # 执行前向逐步回归 selected_features = forward_selection(X_multi, y) print(f"\n逐步回归选出的重要变量: {selected_features}") # 使用选出的变量重新建模 X_selected = X_multi[selected_features] X_selected_with_const = sm.add_constant(X_selected) selected_model = sm.OLS(y, X_selected_with_const).fit() print("\n" + "="*60) print("逐步回归优化后的模型") print("="*60) print(selected_model.summary()) ``` ### 5.2 模型性能评估 ```python # 划分训练集和测试集（70%训练，30%测试） X_train, X_test, y_train, y_test = train_test_split( X_multi, y, test_size=0.3, random_state=42 ) # 训练模型 X_train_const = sm.add_constant(X_train) final_model = sm.OLS(y_train, X_train_const).fit() # 在测试集上预测 X_test_const = sm.add_constant(X_test, has_constant='add') y_pred = final_model.predict(X_test_const) # 计算性能指标 mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y_test, y_pred) print("\n" + "="*60) print("模型在测试集上的性能评估") print("="*60) print(f"均方误差 (MSE): {mse:.2f}") print(f"均方根误差 (RMSE): {rmse:.2f}") print(f"决定系数 (R²): {r2:.4f}") print(f"调整R²: {final_model.rsquared_adj:.4f}") # 可视化预测 vs 实际值 plt.figure(figsize=(10, 6)) plt.scatter(y_test, y_pred, alpha=0.6, color='blue') plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2, label='完美预测线') plt.xlabel('实际销售额（万元）') plt.ylabel('预测销售额（万元）') plt.title('预测值 vs 实际值对比') plt.legend() plt.grid(True, alpha=0.3) plt.savefig('prediction_vs_actual.png', dpi=300, bbox_inches='tight') plt.show() ``` ## 六、完整实验报告生成 ```python # 生成实验报告摘要 def generate_experiment_report(model, X, y): """生成回归实验报告摘要""" report = f""" {'='*80} 回归分析实验报告摘要 {'='*80} 一、模型基本信息 1. 样本数量: {len(y)} 2. 自变量数量: {X.shape[1]} 3. 因变量: 销售额（万元）二、模型拟合优度 1. 决定系数 (R²): {model.rsquared:.4f} - 解释: 模型能解释{model.rsquared*100:.1f}%的销售额变异 2. 调整R²: {model.rsquared_adj:.4f} 3. F统计量: {model.fvalue:.2f} (p-value: {model.f_pvalue:.4e}) 三、回归系数显著性分析 """ # 添加系数表格 coeff_df = pd.DataFrame({ '变量': ['常数'] + list(X.columns), '系数': model.params.values, '标准误': model.bse.values, 't值': model.tvalues.values, 'P值': model.pvalues.values, '是否显著': ['是' if p < 0.05 else '否' for p in model.pvalues] }) report += coeff_df.to_string(index=False) report += f""" 四、模型诊断结果 1. Durbin-Watson统计量: {sm.stats.stattools.durbin_watson(model.resid):.3f} - 解读:

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

下一篇 Python里做回归分析要经历哪些关键步骤？从数据准备到模型优化怎么一步步来？

目录

Python做回归分析要经历哪些关键步骤？从数据准备到模型诊断全流程是怎样的？

Python内容推荐

python逻辑回归模型.zip

Python数据分析与机器学习-Python实现逻辑回归与梯度下降策略

Python回归分析实战

医疗数据诊断ANN模型,医疗数据分析,Python

python数据分析 医学数据分析 乳腺癌数据分析及自动诊断

ARIMA模型+测试数据生成+调整和预测+python代码

基于ARIMA模型进行时间序列数据分析与预测的Python实现项目_该项目专注于利用自回归积分滑动平均模型对单变量时间序列数据进行建模参数估计模型诊断与未来趋势预测核心内容包.zip

Python多元线性回归分析[代码]

Python多元线性回归全流程[项目代码]

基于Python和Pandas的数据分析入门示例项目-包含数据清洗-可视化-机器学习基础-统计分析-时间序列分析-特征工程-数据预处理-异常值处理-相关性分析-回归预测-分类模型-.zip

基于 Python 的机器学习算法、数据可视化与数据分析代码

Python线性回归实现[代码]

逻辑回归模型Python实现[代码]

统计分析指标python实现

Logistic回归案例（数据集+python代码）.zip

时间序列分析利器：ARIMA模型详解与Python实战

机器学习基于逻辑回归的分类模型解析：原理推导、Python实现与欺诈检测、疾病诊断、广告点击率预测应用

基于Python机器学习的乳腺癌预测模型.zip

ARIMA模型时间序列数据分析python代码

基于Python实现的时间序列分析与预测入门实践项目集_包含数据预处理平稳性检验自回归移动平均模型ARIMA季节性分解指数平滑法状态空间模型Prophet库应用单变量多变量预测评估.zip

学生成绩管理系统C++课程设计与实践

别再手动拖拽了！用Lumerical脚本批量创建FDTD仿真结构（附完整代码）

Java邮件解析任务中，如何安全高效地提取HTML邮件内容并避免硬编码、资源泄漏和类型转换异常？

RH公司应收账款管理优化策略研究

新手别慌！用BingPi-M2开发板带你5分钟搞懂Tina Linux SDK目录结构

Java线程池运行时状态怎么实时掌握？有哪些靠谱的监控手段？

桌面工具软件项目效益评估及市场预测分析

告别遮挡！UniApp中WebView与原生导航栏的和谐共处方案（附完整可运行代码）

OSPF是怎么在企业网里自动找最优路径并分区域管理的？

UML建模课程设计：图书馆管理系统论文

python数据分析医学数据分析乳腺癌数据分析及自动诊断