基于贝叶斯算法分类的反垃圾邮件系统的改进毕业论文

更新时间:2023-11-16 18:55:01 阅读量: 教育文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

学位论文题目:基于贝叶斯算法分类的反垃圾邮件系统

的改进

长春工业大学硕士学位论文

学位论文原创性声明

本人郑重声明:所呈交的论文是本人在导师的指导下独立进行研究所取得的研究成果。除了文中特别加以标注引用的内容外,本论文不包含任何其他个人或集体已经发表或撰写的成果作品。对本文的研究做出重要贡献的个人和集体,均已在文中以明确方式标明。本人完全意识到本声明的法律后果由本人承担。

作者签名: 日期: 年 月 日

学位论文版权使用授权书

本学位论文作者完全了解学校有关保留、使用学位论文的规定,同意学校保留并向国家有关部门或机构送交论文的复印件和电子版,允许论文被查阅和借阅。本人授权 大学可以将本学位论文的全部或部分内容编入有关数据库进行检索,可以采用影印、缩印或扫描等复制手段保存和汇编本学位论文。

涉密论文按学校规定处理。

作者签名: 日期: 年 月 日

导师签名: 日期: 年 月 日

I

长春工业大学硕士学位论文

摘 要

电子邮件成为一种快捷、经济的现代通信技术手段,极大地方便了人们的通信与交流。然而,垃圾邮件的产生,影响了正常的电子邮件通信,占用了传输带宽,对系统安全造成了严重的威胁。因此,研究反垃圾邮件问题已经成为全球性的具有重大现实意义的课题。

目前,应对垃圾邮件的主要方法和手段是通过反垃圾邮件立法和使用邮件过滤技术进行处理,现已相继出现了多种邮件过滤技术。常用的包括黑/白名单技术、基于内容的分析方法以及基于规则的方法等。基于内容分析的技术正逐步进入邮件过滤技术当中,并成为当前研究热点,其中,基于内容分析的邮件过滤方法中的典型方法是基于贝叶斯算法的垃圾邮件过滤模型。

本论文对中文垃圾邮件的特点进行了比较系统的分析和研究,结合贝叶斯(Bayes)理论,构造基于贝叶斯分类的垃圾邮件过滤模型,在特征提取方面,采用互信息值的方法,在分类方法上,引入了适合本文的分类方法,并采用了一种更加适合于贝叶斯计算的表示方法;本文作者采用中国教育科研网(CERNET)收集并维护的大量中文垃圾邮件和正常邮件样本的标准数据集,对本文研究的方法进行了大量测试,准确率和误判率分别达到了 95.8%和 5.3%。结果表明基于贝叶斯算法的垃圾邮件过滤系统对拦截垃圾邮件有很好的作用。

关键词:电子邮件,垃圾邮件,邮件过滤,贝叶斯理论

II

长春工业大学硕士学位论文

Abstract

The e-mail has become a quick and economical means of modern communication technology, which enormously facilitates people's communication and exchanges. However, the emergence of spam has affected the normal email correspondence, and taken the transmission band width, even posed the serious threat to the system safety. Therefore, the study of anti-spam has become a global problem of great practical significance of the topic. At present, the main ways and means of the response to spam are the anti-spam legislation and the use of mail filtering technology. But now a variety of mail filtering technologies have appeared in succession, which are usually used including black / white list technologies, content-based analysis methods, and rule-based methods. Content-based analysis techniques are gradually entering the mail filtering technology which has become hot spots of current research. The typical method of content-based analysis mail filtering methods is based on Bayesian algorithm for spam filtering model.

In this paper, the Chinese characteristics of spam has been studied and analyzed systematically. Combining with Bayesian (Bayes) theory, this paper constructs the spam filtering model which is based on Bayesian classification. In feature extraction, mutual information values are used. In the classification method, a classification method is introduced which is suitable in this article, and a more suitable expression in the Bayesian calculation method is adopted; the standard sample data sets of a large number of Chinese spam and regular mail are collected and maintained by the Chinese Education and Research Net (CERNET). The author conducted a lot of testing towards the methods which are studied by this paper. The accuracy and misjudgment rate reached 95.8% and 5.3% respectively. The results show that the spam filtering system based on algorithm Bayesian plays a very good role to block spam.

Key Words: e-mail, spam, mail filtering, Bayesian theory

III

目 录

第一章 绪论 ................................................................................................................................................... 1 1.1引言 ......................................................................................................................................................... 1 1.2垃圾邮件的定义及其危害 .............................................................................................................. 1 1.2.1垃圾邮件的定义 ......................................................................................................................... 1 1.2.2垃圾邮件的危害 ......................................................................................................................... 2 1.3国内外反垃圾邮件现状 ................................................................................................................... 3 1.4论文研究的目标与内容 ................................................................................................................... 4 第二章 垃圾邮件技术 ................................................................................................................................. 5 2.1电子邮件工作原理简介 ................................................................................................................... 5 2.1.1电子邮件的概述 ......................................................................................................................... 5 2.1.2电子邮件的格式 ......................................................................................................................... 5 2.1.3邮件的传送过程 ......................................................................................................................... 6 2.1.4相关协议 ....................................................................................................................................... 8 2.2非技术手段反垃圾邮件 ................................................................................................................. 13 2.3常用反垃圾邮件技术 ...................................................................................................................... 13 2.3.1客户端反垃圾邮件过滤技术 ............................................................................................... 14 2.3.2服务器端反垃圾邮件过滤技术 .......................................................................................... 14 第三章 垃圾邮件分类向量与特征向量 ............................................................................................. 17 3.1垃圾邮件分类向量概述 ................................................................................................................. 17 3.2 垃圾邮件分类向量与特征向量的定义.................................................................................... 17 3.3 分类方法 ............................................................................................................................................. 18 3.3.1文本量的表示方法 .................................................................................................................. 18 3.3.2关键词的选取............................................................................................................................ 19 3.3.3特征提取 ..................................................................................................................................... 20 3.3.4分类方法介绍............................................................................................................................ 23 3.4基于垃圾邮件特征向量判断垃圾邮件算法的设计 ............................................................ 23 3.4.1贝叶斯定理 ................................................................................................................................. 23 3.4.2贝叶斯过滤器的工作原理 .................................................................................................... 23 3.4.3算法的描述 ................................................................................................................................. 25 第四章 基于标准邮件集构造垃圾邮件分类向量 .......................................................................... 26 4.1标准邮件集 ......................................................................................................................................... 26 4.1.1 标准邮件集的背景 ................................................................................................................. 26 4.1.2标准邮件和正常邮件的收集 ............................................................................................... 26 4.1.3标准邮件集的概述 .................................................................................................................. 27 4. 2基于标准邮件集的垃圾邮件分类向量 ................................................................................... 27

本文来源:https://www.bwwdw.com/article/7odv.html

Top