IR Experiments with Lemur
更新时间:2023-08-31 16:13:01 阅读量: 教育文库 文档下载
- iris推荐度:
- 相关推荐
IR Experiments with LemurNancy McCracken October 21, 2004 Adapted from a presentation in IST 657 Lemur assistance: Shuyuan Mary Ho IR model slides: Liz Liddy and Anne Diekema Example experiment: Sijo Cherian
Outline Overview of Lemur project goals and capabilities Standard IR experiments made possible by TREC Steps for IR using Lemur–––– Document preparation and indexing Query preparations Retrieval using several models Other applications
Evaluation for TREC experiments Example Experiment
* The Lemur Toolkit is designed to facilitate research in language modeling and information retrieval The toolkit supports– indexing of large-scale text databases,– the construction of simple language models for documents, queries, or subcollections, and– the implementation of retrieval systems based on language models as well as a variety of other retrieval models.
* A Lemur is a nocturnal, monkey-like African animal that is largely confined to the island of Madagascar."Lemur" was chosen for the name of the UMass-CMU project in part because of its resemblance to LM/IR. (The fact that the language modeling community has until recently been an island to the IR community is also 3 suggestive.) (Lemur documentation).
Lemur Facts Written in C++ for Unix platforms, but also runs on Windows. Includes both– API to program your own applications or modules– User applications to perform IR experiments without programming Note that Lucene has an API but lacks user applications
Maintained by UMASS and CMU Public forum for discussion, also invites code submissions
Lemur Features From the Documentation Overview at the Lemur Home page–
Indexing:––––––– English, Chinese and Arabic text word stemming (Porter and Krovetz stemmers) omitting stopwords recognizing acronyms token level properties, like part of speech and named entities passage indexing incremental indexing
More Lemur Features Retrieval:– ad hoc retrieval TFIDF (vector model) Okapi (probabilistic model) relevance feedback– structured query language InQuery (Boolean queries weighted from other models)– language modeling KL-divergence query model updating for relevance feedback two-stage smoothing smoothing with Dirichlet prior or Markov chain6
More Lemur Features Distributed IR (using multiple indexes):– query-based sampling– database ranking (CORI)– results merging (CORI, single regression and multi-regression merge)
Summarization and Clustering Simple text processing CGI script and stand-alone GUI (written in Java Swing) for retrieval– Provides a user interface to submit single queries with a prepared index– Under development7
Text REtrieval Conference (TREC) NIST provides the infrastructure for IR experiments– Large data collections protected by intellectual property
rights, available for research purposes– Human evaluators for yearly experiments
Results of experiments presented at yearly workshops, since 1992– 93 groups from 22 countries participated in 2003– Standard IR experiments were in the Ad Hoc track Current tracks include Cross Language, Filtering, Question Answering, Robust Retrieval, Terabyte, Video– co-sponsored by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense8
TREC retrieval evaluation measures Evaluation based on relevance:– TREC’s definition: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.“– TREC relevance judgments are binary: relevant or not (1 or 0)– Human relevance judgments
Evaluation measures based on precision and recall– For each query: Recall: number of relevant retrieved docs/ total relevant docs Average Precision at seen relevant docs (includes ranking)– (Standard precision: number of relevant retrieved docs/ total retrieved docs)
Average Precision versus standard Recall levels– Total result: compute the average precision and recall over all queries Micro-average: total precision/recall divided by the number of queries
TREC evaluation process Problem is to identify the relevant documents for each query by humans:Relevant retrieved~ 50 Total docs~ 500,000 Relevant~ 100 Retrieved~ 1000
Solution is pooling: take all retrieved documents from all groups in the evaluation to form a“pool”. Human evaluators judge those documents.– Solution has proved to be effective10
TREC data Document collection For each Ad Hoc track evaluation:– Queries, describing information need. Includes Title Description Narrative– Relevance judgments Includes entire pool– 0 for non-relevant– 1 for relevant
Goal Use Lemur for IR experiments using TREC data
Recall“Steps in the IR Process” Documents1. 2. 3. 4. 5. 6. 7. 8. 9. Normalize document stream to predefined format. Break document stream into desired retrievable units. Isolate& meta-tag sub-document pieces. Identify potential indexable elements in documents. Delete stop words. Stem terms. Extract index entries. Compute weights. Create/ update inverted file.
Lemur accepts TREC formatted documents and provides capabilities for (most of) the other steps in indexing applications.13
Document Preparation and Initial Steps Lemur takes TREC document files and breaks each one into individual“documents” based on the<DOC> formatting tags in the files.– These are newswire documents.
Lemur provides a standard tokenizer (called a parser) that has options for stemming and stopwords– Stemming: either Porter and Krovetz are standard strong stemmers– Stopwords: user provides a stopwords file We will
start with the stopwords file from the SMART retrieval project at Cornell.
Indexing Application Programs The main differences between the indexers provided by Lemur are– Whether the index keeps word position information– Performance characteristics Index Name File File Exten- Limit sion .inv .ifp .key no no no yes Stores Loads Disk Positions Fast Space Usage no yes yes no no yes less more Even more Increment al no yes yes 15
InvIndex InvFPIndex KeyfileIncIndex BasicIndex (deprecated)
Setting Up Indexing Command line invocation of indexing application, specifying parameter file BuildKeyfileIncIndex build_param
Parameters include:– Name of the index index= pindex;– File with a list of document files to process: dataFiles= filelist;– Stemming and stop word options: stemmer= porter; stopwords= stopwords.dat;
As a result of this command, approximately 10 files are produced, starting with the index name.– Inverted index is not humanly readable– Use query processor to view tokenization, stopping and stemming on a document.
Recall“Steps in the IR Process” Queries1. Recognize query terms vs. special operators. 2. Tokenize query terms. 3. Delete stop words. 4. Stem words. 5. Create query representation. 6. Expand query terms. 7. Compute weights. ---------------------------> Matcher
Lemur has Parse applications that create the query representation Matching is provided with a number of Retrieval applications17
Query Representation: Keywords Lemur Parse application provides a keyword representation of the query– Queries in a file in the same format as documents Must choose which parts of the TREC queries to include and add DOC tags– Tokenization, stemming and stop word options are provided by Lemur
Command line ParseToFile query_param queries.sgml– Parameter options outputFile=; stemmer= porter; stopwords= stopwords.dat;18
Retrieval Applicationsfor queries represented as keywords General purpose retrieval for the three models: vector, probabilistic and language– Command: RetEval retrieval_param– Option to choose retrieval model retModel= 0;/* 0= TFIDF; 1= Okapi; 2= KL-divergence */– Other parameters index= pindex.key;/* database index */ textQuery=;/* query text stream */ resultFile= retrieval_file;/* result file */ resultCount= 1000;/* how many docs to return as the result */ resultFormat= 1;/* 0= simple-format; 1= TREC-format */– Options for pseudo-relevance feedback feedbackDocCount: the number of docs to use for pseudo-feedback– (0 means no-feedback)
feedbackTermCount: the number of terms to add to a query when doing 19 feedback.
IR Experiments with Lemur08-31
C17053S(题库答案) 投行业务重点关注问题-IPO专题11-17
微博实名制弊大于利 - 反方一辩陈词04-20
- 1IR降消除技术在长输管道上的应用研究剖析 - 图文
- 2X-ray and Near-IR Variability of the Anomalous X-ray Pulsar
- 3傅里叶变换红外光谱仪(FT-IR)简介001
- 4有关IR2104的自举电容和NMOS选择问题 - 图文
- 5UV to Mid-IR Observations of Star-forming Galaxies at z~2 Stellar Masses and Stellar Popula
- 6IR2110驱动MOS IGBT组成H桥原理与驱动电路分析
- 7红外接收发送模块 W0038HL-26、L5IR4-45(可用于智能电表)
- 8翻译(SUPAC-IR指导原则:速释口服固体制剂:放大生产和批准后变
- 9佳能IR3030N打印A3纸双面打印完了为什么反面字和正面是颠倒的
- 10山东明润工程技术检测有限公司192Ir放射源及γ射线探伤机移 - 图文
- exercise2
- 铅锌矿详查地质设计 - 图文
- 厨余垃圾、餐厨垃圾堆肥系统设计方案
- 陈明珠开题报告
- 化工原理精选例题
- 政府形象宣传册营销案例
- 小学一至三年级语文阅读专项练习题
- 2014.民诉 期末考试 复习题
- 巅峰智业 - 做好顶层设计对建设城市的重要意义
- (三起)冀教版三年级英语上册Unit4 Lesson24练习题及答案
- 2017年实心轮胎现状及发展趋势分析(目录)
- 基于GIS的农用地定级技术研究定稿
- 2017-2022年中国医疗保健市场调查与市场前景预测报告(目录) - 图文
- 作业
- OFDM技术仿真(MATLAB代码) - 图文
- Android工程师笔试题及答案
- 生命密码联合密码
- 空间地上权若干法律问题探究
- 江苏学业水平测试《机械基础》模拟试题
- 选课走班实施方案
- Experiments
- Lemur
- with
- IR
- 加温系统在不同保温措施在妇科手术中的应用效果比较
- 中考英语阅读理解精选10篇(带答案)
- 江苏省各市高二物理学业水平测试模拟试题分类汇编三 曲线运动
- 民营企业工作人员经济犯罪的预防讲座
- 2016-2022年中国塑料膜行业发展分析及前景策略研究报告(目录)
- 计算机多媒体技术在美术教学中的应用
- 门户网、行业网、地方网投稿方法
- 2011年高考语文试题分类汇编——实用类文本阅读(解析版)
- 托福口语1+2题 题目+答案汇总
- 新世纪大学英语综合教程3 Unit4
- 吴中区2011年初三英语教学质量调研测试(二)含答案
- 2013年电大期末考试汽车发动机电控系统 的结构与维修(A)
- 基本医疗保险个人帐户资金继承审批表
- 七星岛高端别墅策划建议书
- 高中生物拓展阅读之滑面内质网如何合成糖原和脂类物质
- 2017_2018学年高中数学第二章圆锥曲线与方程2.3.2抛物线的简单几何性质学案(含解析)新人教A版选修1_1
- 广东省河源市龙川县第一中学高中化学 第二章 第一节 脂肪烃教案 新人教版选修5
- 第11周(海洋法一)
- 中国强化木地板行业市场需求分析及投资盈利预测报告2018-2023年(目录)
- 电力工程类书籍-最新电力工程概预算定额应用与工程施工造价控制及质量监督验收评定标准实务全书