NLPIR-ICTCLAS2014分词系统开发手册

更新时间:2024-05-22 06:30:01 阅读量: 综合文库 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

NLPIR/ICTCLAS 2014分词系统开发文档

http://ICTCLAS.nlpir.org/ @ICTCLAS张华平博士

2013-12

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

1/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org For the latest information about NLPIR, please visit Http://ICTCLAS.nlpir.org/ 访问http://ictclas.nlpir.org/(自然语言处理与信息检索共享平台),您可以获取NLPIR系统的最新版本,并欢迎您关注张华平博士的新浪微博 @ICTCLAS张华平博士 交流。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 2/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Document Information

Document ID Security level Author Publisher NLPIR-ICTCLAS-2013-WHITEPAPER Public 公开 张华平 / Version Status Date Approved by V4.0 Creation and first draft for comment Dec 19, 2013 Version History

Note:The first version is”v0.1”. Each subsequent version will add 0.1 to the exiting version. The version number should be updated only when there are significant changes, for example, changes made to reflect reviews. The first figure in the version 1.x denotes current review status by. 1. x denotes review process has passed round 1 etc .Anyone who create, review or modify the document should describe his action. Version V1.0 V2.0 V3.0 V4.0 Author/Reviewer Kevin Zhang Kevin Zhang Kevin Zhang Kevin Zhang Date 2011-8-21 2012-8-21 2012-12-19 2013-12-19 Description first complete draft for comment. ICTCLAS2010 complete draft for comment.ICTCLAS2012 complete draft for comment.ICTCLAS2013 complete draft for comment.ICTCLAS2014

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 3/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

目录

NLPIR/ICTCLAS 2014分词系统开发文档 .................................................................................. 1 目录 .................................................................................................................................................. 4 1. NLPIR/ICTCLAS2014分词系统简介 .......................................................................................... 5 2. NLPIR/ICTCLAS2014分词系统主要功能介绍 .......................................................................... 6 3. NLPIR/ICTCLAS2014分词系统评测 ........................................................................................ 11

3.1 NLPIR/ICTCLAS在973评测中的测试结果 ......................................................... 11 3.2 第一届国际分词大赛的评测结果 .......................................................................... 12 3.3 NLPIR/ICTCLAS的评测结果 ................................................................................ 12

4. NLPIR/ICTCLAS大事记: ........................................................................................................ 13 5.C/C++接口 ............................................................................................................................. 15

5.1 NLPIR_Init ..................................................................................................................... 15 5.2 NLPIR_Exit ..................................................................................................................... 16 5.3 NLPIR_ImportUserDict ................................................................................................. 17 5.4 NLPIR_ParagraphProcess ............................................................................................. 19 5.5 NLPIR_ParagraphProcessA ........................................................................................... 20 5.6 NLPIR_FileProcess ......................................................................................................... 22 5.7 NLPIR_GetParagraphProcessAWordCount ................................................................... 23 5.8 NLPIR_ ParagraphProcessAW ....................................................................................... 26 5.9 NLPIR_AddUserWord .................................................................................................... 27 5.10 NLPIR_SaveTheUsrDic ................................................................................................ 28 5.11 NLPIR_DelUsrWord..................................................................................................... 29 5.12 NLPIR_GetKeyWords .................................................................................................. 30 5.13 NLPIR_GetFileKeyWords ............................................................................................ 32 5.14 NLPIR_GetNewWords ................................................................................................. 33 5.15 NLPIR_GetFileNewWords ........................................................................................... 34 5.16 NLPIR_FingerPrint ....................................................................................................... 35 5.17 NLPIR_SetPOSmap ...................................................................................................... 36 5.17 新词发现批量处理功能 ............................................................................................... 38 6. JNA接口 .................................................................................................................................. 41

6.1jna使用分词简介 ............................................................................................................ 41 6.2jna使用分词示例 ............................................................................................................ 41 7. hadoop平台使用分词 ............................................................................................................ 44

7.1 hadoop使用分词简介 ...................................................................................................... 44 7.2 hadoop使用分词示例 ...................................................................................................... 44 8. C#接口说明 ............................................................................................................................... 47

7.1说明 ................................................................................................................................... 47 7.2接口示例 ........................................................................................................................... 47 9 NLPIR2011运行环境 .............................................................................................................. 49 9 常见问题(FAQ) ....................................................................................................................... 50

Q1: Linux调用NLPIR的时候,链接不上库 ...................................................................... 50 NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

4/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Q2: NLPIR系统初始化老是失败 ......................................................................................... 50 Q3: NLPIR系统是否支持多线程,没有显式的创建与销毁分词对象(句柄、上下文)的接口,故不支持多线程和多实例 ......................................................................................... 50 Q4: 没有找到选择粗/细颗粒度的接口 .............................................................................. 50 Q5: 连续的空白符号是每个符号单独输出的,希望有合并输出的选项。 ..................... 51 Q6: 支持在一个应用中,同时进行GB18030和UTF-8的分词 ....................................... 51 Q7: NLPIR2010的JNI调用实现过程 ................................................................................. 51 10 作者简介 ................................................................................................................................... 52

1. NLPIR/ICTCLAS2014分词系统简介

词法分析是自然语言处理的基础与关键。张华平博士在多年研究工作积累的基础上,研制出了NLPIR分词系统,主要功能包括中文分词;英文分词;词性标注;命名实体识别;新词识别;关键词提取;支持用户专业词典与微博分析。NLPIR系统支持多种编码(GBK编码、UTF8编码、BIG5编码)、多种操作系统(Windows, Linux, FreeBSD等所有主流操作系统)、多种开发语言与平台(包括:C/C++/C#,Java,Python,Hadoop等)。 NLPIR分词系统前身为2000年发布的ICTCLAS词法分析系统,从2009年开始,为了和以前工作进行大的区隔,并推广NLPIR自然语言处理与信息检索共享平台,调整命名为NLPIR分词系统。张华平博士先后倾力打造十余年,内核升级十余次,先后获得了2010年钱伟长中文信息处理科学技术奖一等奖,2003年国际SIGHAN分词大赛综合第一名,2002年国内973评测综合第一名。全球用户突破30万,包括中国移动、华为、中搜、3721、NEC、中华商务网、硅谷动力、云南日报等企业,清华大学、新疆大学、华南理工、麻省大学等机构:同时,ICTCLAS广泛地被《科学时报》、《人民日报》海外版、《科技日报》等多家媒体报道。您可以访问Google进一步了解ICTCLAS的应用情况。

我们提供各类二次开发接口,特别欢迎相关的科研人员、工程技术人员使用,并承诺非商用应用永久免费的共享策略。访问http://ictclas.nlpir.org/(自然语言处理与信息检索共享平台),您可以获取NLPIR系统的最新版本,并欢迎您关注张华平博士的新浪微博 @ICTCLAS张华平博士 交流。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 5/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图1:NLPIR/ICTCLAS获得了钱伟长中文信息处理科学技术奖一等奖

2. NLPIR/ICTCLAS2014分词系统主要功能介绍

1)中英文混合分词功能

自动对中文英文信息进行分词与词性标注功能,涵盖了中文分词、英文分词、词性标注、未登录词识别与用户词典等功能,如图所示

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 6/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图2:中英文混合分词展示

2)关键词提取功能

采用交叉信息熵的算法自动计算关键词,包括新词与已知词,下面是对十八届三中全会报告部分内容的关键词提取结果。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 7/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图3:十八届三中全会报告的关键词提取结果

3)新词识别与自适应分词功能

从较长的文本内容中,基于信息交叉熵自动发现新特征语言,并自适应测试语料的语言概率分布模型,实现自适应分词。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 8/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图4:自动识别“屌丝”等新词,并自动调整分词结果,实现自适应分词

4)用户专业词典功能

可以单条导入用户词典,也可以批量导入用户词典。如可以定“举报信 敏感点”,其中举报信是用户词,敏感点是用户自定义的词性标记。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 9/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图5:判别用户定义词“举报信”,设置为自定义词性“敏感点”

5)微博分词功能

对博主ID进行nr标示,对转发的会话进行自动分割标示(标示为ssession),

URL以及Email进行自动标引。

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 10/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

图6:微博分词示例

3. NLPIR/ICTCLAS2014分词系统评测

3.1 NLPIR/ICTCLAS在973评测中的测试结果

2002年7月6日,NLPIR/ICTCLAS参加了国家973英汉机器翻译第二阶段的开放

评测,测试结果如下:

领域 体育 国际 文艺 法制 理论 经济 总计 词数 33,348 59,683 20,524 14,668 55,225 24,765 208,213 SEG 97.01% 97.51% 96.40% 98.44% 98.12% 97,80% 97,58% TAG1 86.77% 88.55% 87.47% 85.26% 87.29% 86.25% 87.32% RTAG 89.31% 90.78% 90.59% 86.59% 88.91% 88.16% 89.42% 表3. ICTCLAS在973评测中的测试结果

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

11/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org 说明:

1. 数据来源:国家973英汉机器翻译第二阶段评测的评测总结报告 2. 标注相对正确率RTAG=TAG1/SEG*100%

3. 由于我们采取的词性标注集和973专家组的标注集有较大出入,所以词性标注的正确率不具可比性

4. 专家组的开放评测结果表明:基于HHMM的ICTCLAS能实际的解决汉语词法分析问题,和兄弟单位的类似系统对比,ICTCLAS的分词结果表现出色。

3.2 第一届国际分词大赛的评测结果

为了比较和评价不同方法和系统的性能,第四十一届国际计算语言联合会(41st Annual Meeting of the Association for Computational Linguistics, 41th ACL )下设的汉语特别兴趣研究组(the ACL Special Interest Group on Chinese Language Processing, SIGHAN;www.sighan.org) 于2003年4月22日至25日举办了第一届国际汉语分词评测大赛(First International Chinese Word Segmentation Bakeoff)[28]。报名参赛的分别是来自于大陆、台湾、美国等6个国家和地区,共计19家研究机构,最终提交结果的是12家参赛队伍。 大赛采取大规模语料库测试,进行综合打分的方法,语料库和标准分别来自北京大学(简体版)、宾州树库(简体版)、香港城市大学(繁体版),台湾“中央院” (繁体版)。每家标准分两个任务(Track):受限训练任务(Close Track)和非受限训练任务(Open Track)。

NLPIR/ICTCLAS分别参加了简体的所有四项任务,和繁体的受限训练任务。其中在宾州树库受限训练任务中综合得分0.881[28],名列第一;北京大学受限训练任务中综合得分0.951[28],名列第一;北京大学受限训练任务中综合得分0.953[28],名列第二。值得注意的是,我们在短短的两天之内,采取ICTCLAS简体版的内核代码,将多层隐马模型推广到繁体分词当中,同样取得了0.938[28]的综合得分。

3.3 NLPIR/ICTCLAS的评测结果

我们利用了《人民日报》1998年1月的新闻纯文本语料进行开放测试,ICTCLAS3.0测试的精度与速度如下表所示: 功能描述 开放测试一 分词 开放测试二 开放测试三 分词+命名实体与新分词+命名实体与新词识别+词识别 词性标注 4,092,478 Bytes 9.094001 8.9MB 12/53

测试文件大4,092,478 Bytes 4,092,478 Bytes 小 时间(s) 4.094000 6.467561 核心数据所5.5MB 7.2MB NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org 占内存 速度 精度 说明:

1. 测试机器配置:CPU: PIV3.0G;内存:512M;

2. 分词精度指的是正确切分的词数占正确结果总词数的百分比;词性标注精度指的是切分与词性标注均正确的词数占正确结果总词数的百分比。

3. 开放测试:指的是测试样本不属于训练样本集合,否则称为封闭测试;封闭测试相当于考试试题都出自于学习过的书本,这种测试并没有实质意义,而往往有一些商家故意混淆视听,以封闭测试来冒充开放测试,制造准确率99.5%的噱头,实际上,通过机械记忆小样本的封闭测试取得100%的精度不存在任何问题。这一点特别提请用户注意。

450.02 KB/s 分词精度:98.13% 词性标注分词精度:96.56% 分词精度:98.13% 精度:94.63% 999.63 KB/s 632.77 KB/s 4. NLPIR/ICTCLAS大事记:

?

2000年5月,张华平进入中科院计算所刘群教授所领导的自然语言处理课题组,开始从事分词的研发,2000年8月第一版研制成功并发布,并发表第一篇分词的论文。

2002年7月,在973项目\图像、语音、自然语言理解与知识挖掘\专家组的评测中,在所有参评的系统中,评测得分最高。(分词正确率高达97.58%,参赛单位包括北京大学,清华大学等)

2003年1月7日,获得国家版权局授予的软件著作权登记证书,编号为软著登字005178号)

在2003年4月22日至25日, ICTCLAS参加了第四十一届国际计算语言联合会(41st Annual Meeting of the Association for Computational Linguistics, 41th ACL )下设的汉语特别兴趣研究组(the ACL Special Interest Group on Chinese Language Processing, SIGHAN)举办的第一届国际汉语分词评测大赛[10],在参加的六项比赛中,获得了两项第一名、一项第二名。(参赛单位来自于6个国家和地区的12个系统,包括微软,SYSTRAN, Pennsylvania大学,Berkeley大学,北京大学)

作为计算所的15项免费技术成果之一,被来自于国内外的约30000人次的下载使用。作为中文自然语言处理开放平台的自由软件,受到了广泛的欢迎和关注,在《科学时报》、新浪网、人民日报海外版、中新网、新华网、人民网均有报道[11,12,13,14,15]。我们提供的各种形式研究成果,在学术界和产业界得到了广泛的应用,其中包括:3721、NEC研究院、中

13/53

?

?

?

?

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

华商务网、硅谷动力、云南日报等企业,新疆大学、清华大学、华南理工、麻省大学等研究机构。

? ? ? ? ?

2004年7月,推出ICTCLAS2.0; 2005年12月,推出ICTCLAS2.6;

2006年4月,推出ICTCLAS3.0,速度接近1MB/s,精度98.13%; 2008年初,推出ICTCLAS2008;开始按照年份作为版本序号;

2010年初,张华平博士调任北京理工大学,推出ICTCLAS2010,并将名称调整为NLPIR。2010年获得了钱伟长中文信息处理科学技术奖一等奖。 2012年11月,推出NLPIR/ICTCLAS2013,增加了自适应分词、新词识别与关键词提取功能。第一次采用社交网络的形式发布内测。将库文件的名称统一改为libNLPIR.so/NLPIR.dll

2013年12月,推出NLPIR/ICTCLAS2014,第一次进行线下的分词用户交流大会。

?

?

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 14/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

5.C/C++接口

5.1 NLPIR_Init

Init the analyzer and prepare necessary data for NLPIR according the configure file. bool NLPIR_Init(const char * sInitDirPath=0,int encoding=GBK_CODE,const char*sLicenceCode=0);

Routine NLPIR_Init Required Header Return Value

Return true if init succeed. Otherwise return false. Parameters

sInitDirPath: Initial Directory Path, where file Configure.xml and Data directory stored. the default value is 0, it indicates the initial directory is current working directory path

int encoding: encoding of input string, default is GBK_CODE (GBK encoding), and it can be set with UTF8_CODE (UTF8 encoding) and BIG5_CODE (BIG5 encoding). char* sLicenceCode: license code, special use for some commercial users. Other users ignore the argument Remarks

The NLPIR_Init function must be invoked before any operation with NLPIR. The whole system need call the function only once before starting NLPIR. When stopping the system and make no more operation, NLPIR_Exit should be invoked to destroy all working buffer. Any operation will fail if init do not succeed.

NLPIR_Init fails mainly because of two reasons: 1) Required data is incompatible or missing 2) Configure file missing or invalid parameters. Moreover, you could learn more from the log file NLPIR.log in the default directory. Example

#include \#include #include

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 15/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000];

const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,0);

printf(\ sResult); scanf(\}

NLPIR_Exit(); return 0; } Output

5.2 NLPIR_Exit

Exit the program and free all resources and destroy all working buffer used in NLPIR. bool NLPIR_Exit(); Routine NLPIR_Exit Return Value

Return true if succeed. Otherwise return false. Parameters none Remarks

The NLPIR_Exit function must be invoked while stopping the system and make no more operation. And call NLPIR_Init function to restart NLPIR. NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

16/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000];

const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,1);

printf(\ sResult); scanf(\}

NLPIR_Exit(); return 0; } Output

5.3 NLPIR_ImportUserDict

Import user-defined dictionary from a text file.

unsigned int NLPIR_ImportUserDict(const char *sFilename); Routine Required Header NLPIR_ImportUserDict Return Value

The number of lexical entry imported successfully Parameters

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 17/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

sFilename: Text filename for user dictionary Remarks

The NLPIR_ImportUserDict function works properly only if NLPIR_Init succeeds. The text dictionary file foramt see User-defined Lexicon.

You only need to invoke the function while you want to make some change in your customized lexicon or first use the lexicon. After you import once and make no change again, NLPIR will load the lexicon automatically if you set UserDict \configure file. While you turn UserDict \applied. Example

#include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]=\张华平于1978年3月9日出生于江西省波阳县。\const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

//Sample4: User-defined dictionary

sResult=NLPIR_ParagraphProcess(\年春夏之交的政治风波1989年政治风波24小时降雪量24小时降雨量863计划ABC防护训练APEC会议BB机BP机C2系统C3I系统C3系统C4ISR系统C4I系统CCITT建议\

printf(%unsigned int nItems=NLPIR_ImportUserDict(\dictionary

printf(\

sResult=NLPIR_ParagraphProcess(\年春夏之交的政治风波1989年政治风波24小时降雪量24小时降雨量863计划ABC防护训练APEC会议BB机BP机C2系统C3I系统C3系统C4ISR系统C4I系统CCITT建议\

printf(\NLPIR_Exit(); return 0; } Output

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

18/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Before Adding User-defined lexicon, the result is:

1989年/t 春/tg 夏/tg 之/uzhi 交/ng 的/ude1 政治/n 风波/n 1989年/t 政治/n 风波/n 24/m 小时/n 降雪/vn 量/n 24/m 小时/q 降雨量/n 863/m 计划ABC防护训练APEC会议BB机B

P机C2系统C3I系统C3系统C4ISR系统C4I/nt 系统/n CCITT/x 建议/n 14321 user-defined lexical entries added!

After Adding User-defined lexicon, the result is: 1989年春夏之交的政治风波/n 1989年政治风波/n

24小时降雪量/n 24小时降雨量/n 863计划/n ABC防护训练/vn APE

C会议/nz BB机/n BP机/n C2系统/n C3I系统/n C3系统/n C4ISR系统/n C4I系统/n CCITT建议/t

5.4 NLPIR_ParagraphProcess

Process a paragraph, and return the result buffer pointer

const char * NLPIR_ParagraphProcess(const char *sParagraph,int bPOStagged=1); Routine Required Header NLPIR_ParagraphProcess Return Value

Return the pointer of result buffer. Parameters

sParagraph: The source paragraph

bPOStagged: Judge whether need POS tagging, 0 for no tag; 1 for tagging; default:1.

Remarks

The NLPIR_ParagraphProcess function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000];

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

19/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

const char *sResult; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult=NLPIR_ParagraphProcess(sSentence,1);

printf(\scanf(\}

NLPIR_Exit(); return 0; } Output

5.5 NLPIR_ParagraphProcessA

result_t * NLPIR_ParagraphProcessA(const char *sParagraph,int *pResultCount,bool bUserDict=true) Routine NLPIR_ParagraphProcessA Return Value

the pointer of result vector, it is managed by system, user cannot alloc and free it

struct result_t{

int start; //start position,词语在输入句子中的开始位置 int length; //length,词语的长度

char sPOS[POS_SIZE];//word type,词性ID值,可以快速的获取词性表 int iPOS;//词性

int word_ID; //如果是未登录词,设成或者-1

int word_type; //区分用户词典;1,是用户词典中的词;,非用户词典中的词

int weight;// word weight

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

20/53

Required Header

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

};

Parameters

sParagraph: The source paragraph

pResultCount: pointer to result vector size bUserDict:whether use UserDict Remarks

The NLPIR_ParagraphProcessA function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000];

const result_t *pVecResult; int nCount;

if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

pVecResult=NLPIR_ParagraphProcessA(sInput,&nCount,true); for (int i=0;i

printf(\pVecResult[i].start, pVecResult[i].length, pVecResult[i].word_ID, pVecResult[i].POS_id); } }

NLPIR_Exit();

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

21/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

return 0; } Output

5.6 NLPIR_FileProcess

Process a text file

Double NLPIR_FileProcess(const char *sSourceFilename,const char *sResultFilename,int bPOStagged=1); Routine NLPIR_FileProcess Return Value

Return the processing speed if processing succeed. Otherwise return false. Parameters

sSourceFilename: The source file name to be analysized; sResultFilename: The result file name to store the results.

bPOStagged: Judge whether need POS tagging, 0 for no tag; 1 for tagging; default:1. Remarks

The NLPIR_FileProcess function works properly only if NLPIR_Init succeeds. The output format is customized in NLPIR configure. Example

#include \

int main(int argc, char* argv[]) {

//Sample2: File text lexical analysis

if(!NLPIR_Init()) {

printf(\return -1; }

printf(\

NLPIR_FileProcess(\NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

22/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

NLPIR_Exit(); return 0; } Output

5.7 NLPIR_GetParagraphProcessAWordCount

Get ProcessAWordCount, API for C#

int NLPIR_GetParagraphProcessAWordCount(const char *sParagraph); Routine NLPIR_FileProcess Return Value

Return the paragraph word count. Parameters

sParagraph: The source paragraph Remarks

The NLPIR_GetParagraphProcessAWordCount function works properly only if NLPIR_Init succeeds.

The output format is customized in NLPIR configure. Example

using System; using System.IO;

using System.Runtime.InteropServices;

namespace win_csharp {

[StructLayout(LayoutKind.Explicit)] public struct result_t {

[FieldOffset(0)] public int start; [FieldOffset(4)] public int length; [FieldOffset(8)] public int sPos; [FieldOffset(12)] public int sPosLow; [FieldOffset(16)] public int POS_id; NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

23/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

[FieldOffset(20)] public int word_ID; [FieldOffset(24)] public int word_type; [FieldOffset(28)] public int weight;

}

///

/// Class1 的摘要说明。 ///

class Class1 {

const string path = @\

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern bool NLPIR_Init(String sInitDirPath);

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern String NLPIR_ParagraphProcess(String sParagraph,int bPOStagged);

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern bool NLPIR_Exit();

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern int NLPIR_ImportUserDict(String sFilename);

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern bool NLPIR_FileProcess(String sSrcFilename,String sDestFilename,int bPOStagged);

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\ public static extern bool NLPIR_FileProcessEx(String sSrcFilename,String sDestFilename);

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\WordCount\ static extern int NLPIR_GetParagraphProcessAWordCount(String sParagraph); //NLPIR_GetParagraphProcessAWordCount

[DllImport(path,CharSet=CharSet.Ansi,EntryPoint=\]

static extern void NLPIR_ParagraphProcessAW(int nCount, NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

24/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

[Out,MarshalAs(UnmanagedType.LPArray)] result_t[] result);

[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = \

static extern int NLPIR_AddUserWord(String sWord);

[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = \

static extern int NLPIR_SaveTheUsrDic();

[DllImport(path, CharSet = CharSet.Ansi, EntryPoint = \ static extern int NLPIR_DelUsrWord(String sWord); ///

/// 应用程序的主入口点。 ///

[STAThread]

static void Main(string[] args) {

//

// TODO: 在此处添加代码以启动应用程序 //

if(!NLPIR_Init(null)) {

System.Console.WriteLine(\ return; }

String s =\点击下载超女纪敏佳深受观众喜爱。禽流感爆发在非典之后。\ int count = NLPIR_GetParagraphProcessAWordCount(s);//先得到结果的词数

result_t[] result = new result_t[count];//在客户端申请资源

NLPIR_ParagraphProcessAW(count,result);//获取结果存到客户的内存中

int i=1;

foreach(result_t r in result) {

String sWhichDic=\ switch (r.word_type) {

case 0:

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

25/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

sWhichDic = \核心词典\ break; case 1:

sWhichDic = \用户词典\ break; case 2:

sWhichDic = \专业词典\ break; default: break; }

Console.WriteLine(\

length:{2},POS_ID:{3},Word_ID:{4}, UserDefine:{5}, Word:{6}\\n\r.length, r.POS_id, r.word_ID, sWhichDic, s.Substring(r.start / 2, r.length / 2)); }

NLPIR_Exit();

} } }

Output

5.8 NLPIR_ ParagraphProcessAW

Process a paragraph, API for C#

void NLPIR_ParagraphProcessAW(int nCount,result_t * result); Routine NLPIR_FileProcess Return Value Parameters

nCount: the paragraph word count.

result: Pointer to structure to store results. Remarks

The NLPIR_ParagraphProcessAW function works properly only if NLPIR_Init succeeds. The output format is customized in NLPIR configure. Example

Required Header NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 26/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

(见上1.7例子) Output

5.9 NLPIR_AddUserWord

Add a word to the user dictionary.

int NLPIR_AddUserWord(const char *sWord); Routine NLPIR_AddUserWord Return Value

Return 1 if add succeed. Otherwise return 0. Parameters

sWord:the word added. Remarks

The NLPIR_AddUserWord function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

NLPIR_AddUserWord(“爱思客 n”);//添加词:爱思客\\t词性。其中“爱思客”为要添加

的词,“n”为词的词性,”\\t”为分隔符

printf(\NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

27/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,0);

printf(\ sResult); scanf(\}

NLPIR_Exit(); return 0; } Output

5.10 NLPIR_SaveTheUsrDic

Save the user dictionary to disk. int NLPIR_SaveTheUsrDic(); Routine NLPIR_SaveTheUsrDic Return Value

Return 1 if save succeed. Otherwise return 0. Parameters Remarks

The NLPIR_SaveTheUsrDic function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

28/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

return -1; }

NLPIR_AddUserWord(“爱思客n”);//你好\\t词性

NLPIR_SaveTheUsrDic();//保存用户词典

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,0);

printf(\ sResult); scanf(\}

NLPIR_Exit(); return 0; } Output

5.11 NLPIR_DelUsrWord

Delete a word from the user dictionary. int NLPIR_DelUsrWord(const char *sWord); Routine NLPIR_DelUsrWord Return Value

Return -1, the word not exist in the user dictionary; else, the handle of the word deleted Parameters

sWord:the word to be delete. Remarks

The NLPIR_DelUsrWord function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

29/53

Required Header NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

NLPIR_AddUserWord(“iThinker

n”);//你好\\t词性

NLPIR_AddUserWord(“爱思客 n”);

NLPIR_DelUsrWord(“iThinker”);//删除iThinker

NLPIR_SaveTheUsrDic();//保存用户词典

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,0);

printf(\ sResult); scanf(\}

NLPIR_Exit(); return 0; } Output

5.12 NLPIR_GetKeyWords

Extract keyword from paragraph.

NLPIR_API const char * NLPIR_GetKeyWords(const char *sLine,int nMaxKeyLimit=50,bool bWeightOut=false);

Routine NLPIR_GetKeyWords Required Header Return Value

Return the keywords list if excute succeed. otherwise return NULL.

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 30/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Format as:

\科学发展观 宏观经济 \

\科学发展观 23.80 宏观经济 12.20\Parameters

sLine, the input text.

nMaxKeyLimit, the maximum number of key words. bWeightOut: whether the keyword weight output or not

Remarks Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

const char * sKeyword= NLPIR_GetKeyWords(sSentence); scanf(\ }

NLPIR_Exit();

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 31/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

return 0; }

5.13 NLPIR_GetFileKeyWords

Extract keyword from a text file.

NLPIR_API const char * NLPIR_GetFileKeyWords(const char *sTextFile,int nMaxKeyLimit=50,bool bWeightOut=false);

Routine Required Header NLPIR_GetFileKeyWords Return Value

Return the keywords list if excute succeed. otherwise return NULL. Format as:

\科学发展观 宏观经济 \

\科学发展观 23.80 宏观经济 12.20\Parameters

sTextFile, the input text filename.

nMaxKeyLimit, the maximum number of key words. bWeightOut: whether the keyword weight output or not

Remarks Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result if(!NLPIR_Init()) {

printf(\

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 32/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

return -1; }

const char * sKeyword= NLPIR_GetKeyWords(“十八大报告.txt”); NLPIR_Exit(); return 0; }

5.14 NLPIR_GetNewWords

Extract new words from paragraph.

NLPIR_API const char * NLPIR_GetNewWords(const char *sLine,int nMaxKeyLimit=50,bool bWeightOut=false);

Routine NLPIR_ GetNewWords Return Value

Required Header Return the new words list if excute succeed. otherwise return NULL. Format as:

\科学发展观 宏观经济 \

\科学发展观 23.80 宏观经济 12.20\Parameters

sLine, the input text.

nMaxKeyLimit, the maximum number of key words. bWeightOut: whether the keyword weight output or not

Remarks Example

#include \#include #include

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 33/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\

while(_stricmp(sSentence,\{

const char * sKeyword= NLPIR_GetNewWords(sSentence); scanf(\ }

NLPIR_Exit(); return 0; } Output

5.15 NLPIR_GetFileNewWords

Extract new words from a text file.

NLPIR_API const char * NLPIR_GetFileNewWords(const char *sTextFile,int nMaxKeyLimit=50,bool bWeightOut=false);

Routine Required Header NLPIR_GetFileNewWords Return Value

Return the keywords list if excute succeed. otherwise return NULL. Format as:

\科学发展观 宏观经济 \

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

34/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

\科学发展观 23.80 宏观经济 12.20\Parameters

sTextFile, the input text filename.

nMaxKeyLimit, the maximum number of key words. bWeightOut: whether the keyword weight output or not

Remarks Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

const char * sKeyword= NLPIR_GetFileNewWords(“十八大报告.txt”); NLPIR_Exit(); return 0; }

5.16 NLPIR_FingerPrint

Extract a finger print from the paragraph .

unsigned long NLPIR_API unsigned long NLPIR_FingerPrint(const char *sLine); Routine NLPIR_FingerPrint Return Value

0, failed; else, the finger print of the content NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

35/53

Required Header

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Parameters sLine:input text Remarks None Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; if(!NLPIR_Init()) {

printf(\return -1; }

printf(\scanf(\Int nCount = 0;

while(_stricmp(sSentence,\{

unsigned long lFinger = NLPIR_FingerPrint(sString); scanf(\}

NLPIR_Exit(); return 0; } Output

5.17 NLPIR_SetPOSmap

select which pos map will use. int NLPIR_SetPOSmap(int nPOSmap); Routine NLPIR_SetPOSmap Required Header NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 36/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

Return Value

Return 1 if excute succeed. Otherwise return 0. Parameters

Parameters :nPOSmap : ICT_POS_MAP_FIRST 计算所一级标注集 ICT_POS_MAP_SECOND 计算所二级标注集 PKU_POS_MAP_SECOND 北大二级标注集 PKU_POS_MAP_FIRST 北大一级标注集 Remarks

The NLPIR_SetPOSmap function works properly only if NLPIR_Init succeeds. Example

#include \#include #include

int main(int argc, char* argv[]) {

//Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char * sResult; if(!NLPIR_Init()) {

printf(\return -1; }

NLPIR_SetPOSmap(ICT_POS_MAP_FIRST);

printf(\scanf(\

while(_stricmp(sSentence,\{

sResult = NLPIR_ParagraphProcess(sString,0);

printf(\ sResult); scanf(\}

NLPIR_Exit();

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 37/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

return 0; } Output

5.17 新词发现批量处理功能

/********************************************************************* *

* 以下函数为2013版本专门针对新词发现的过程,一般建议脱机实现,不宜在线处理 * 新词识别完成后,再自动导入到分词系统中,即可完成 * 函数以NLPIR_NWI(New Word Identification)开头

*********************************************************************/ /********************************************************************* *

* Func Name : NLPIR_NWI_Start *

* Description: 启动新词识别 *

* Parameters : None

* Returns : bool, true:success, false:fail *

* Author : Kevin Zhang * History :

* 1.create 2012/11/23

*********************************************************************/ NLPIR_API bool NLPIR_NWI_Start();//New Word Indentification Start

/********************************************************************* *

* Func Name : NLPIR_NWI_AddFile *

* Description: 往新词识别系统中添加待识别新词的文本文件 * 需要在运行NLPIR_NWI_Start()之后,才有效 *

* Parameters : const char *sFilename:文件名 * Returns : bool, true:success, false:fail *

* Author : Kevin Zhang * History :

* 1.create 2012/11/23

*********************************************************************/ NLPIR_API int NLPIR_NWI_AddFile(const char *sFilename); NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

38/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

/********************************************************************* *

* Func Name : NLPIR_NWI_AddMem *

* Description: 往新词识别系统中添加一段待识别新词的内存 * 需要在运行NLPIR_NWI_Start()之后,才有效 *

* Parameters : const char *sFilename:文件名 * Returns : bool, true:success, false:fail *

* Author : Kevin Zhang * History :

* 1.create 2012/11/23

*********************************************************************/ NLPIR_API bool NLPIR_NWI_AddMem(const char *sText);

/********************************************************************* *

* Func Name : NLPIR_NWI_Complete *

* Description: 新词识别添加内容结束 * 需要在运行NLPIR_NWI_Start()之后,才有效 *

* Parameters : None

* Returns : bool, true:success, false:fail *

* Author : Kevin Zhang * History :

* 1.create 2012/11/23

*********************************************************************/ NLPIR_API bool NLPIR_NWI_Complete();//新词

/********************************************************************* *

* Func Name : NLPIR_NWI_GetResult *

* Description: 获取新词识别的结果 * 需要在运行NLPIR_NWI_Complete()之后,才有效 *

* Parameters : bWeightOut:是否需要输出每个新词的权重参数 *

* Returns : 输出格式为 * 【新词1】 【权重1】 【新词2】 【权重2】 ... *

* Author : Kevin Zhang * History :

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved.

39/53

NLPIR/ICTCLAS2014分词系统开发文档 http://ICTCLAS.nlpir.org

* 1.create 2012/11/23

*********************************************************************/

NLPIR_API const char * NLPIR_NWI_GetResult(bool bWeightOut=false);//输出新词识别结果 /********************************************************************* *

* Func Name : NLPIR_NWI_Result2UserDict *

* Description: 将新词识别结果导入到用户词典中 * 需要在运行NLPIR_NWI_Complete()之后,才有效 * 如果需要将新词结果永久保存,建议在执行NLPIR_SaveTheUsrDic * Parameters : None

* Returns : bool, true:success, false:fail *

* Author : Kevin Zhang * History :

* 1.create 2012/11/23

*********************************************************************/

NLPIR_API unsigned int NLPIR_NWI_Result2UserDict();//新词识别结果转为用户词典,返回新词结果数目 Example

void testNewWord(int nCode) {

NLPIR_NWI_Start();//启动新词发现功能

NLPIR_NWI_AddFile(sInputFile); //添加新词训练的文件,可反复添加 NLPIR_NWI_Complete();//添加文件或者训练内容结束

const char *pNewWordlist=NLPIR_NWI_GetResult();//输出新词识别结果 //NLPIR

//初始化分词组件

if(!NLPIR_Init(\数据在当前路径下,默认为GBK编码的分词 { }

char sInputFile[1024]=\if (nCode==UTF8_CODE) { }

strcpy(sInputFile,\printf(\return ;

NLPIR Copyright ? 2014 Kevin Zhang. All rights reserved. 40/53

本文来源:https://www.bwwdw.com/article/w9v7.html

Top