Simple Text Processing
更新时间:2023-06-09 05:26:01 阅读量: 实用文档 文档下载
- simple推荐度:
- 相关推荐
data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response
1
Document Management, Department of Linguistics, Uppsala University, VT992
Information Retrieval
Document Life Cycle
Document Management, Department of Linguistics, Uppsala University, VT993
Document Management, Department of Linguistics, Uppsala University, VT994
Data Retrieval versus Information Retrieval
Precision and Recall
matching inference model query language query speci cation items wanted error response
data retrieval exact deduction deterministic arti cial complete matching sensitive
information retrieval partial, best match induction probabilistic natural incomplete relevant insensitive
precision= j Dj\ D j Djr
\ recall= j D D Dj j jr r
deduction: p(a); 8x: p(x) ! q(x) ) q(a) induction: p(a); q(a) ) 8x: p(x) ! q(x)
D: documents returned by the system D: relevant documentsr
data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response
Document Management, Department of Linguistics, Uppsala University, VT995
Document Management, Department of Linguistics, Uppsala University, VT996
Simple Text Processing
Searching StringsSearch Pattern
searching strings string replacement sorting tokenization text segmentation (sentences, phrases, MWU's) frequency counts
simple strings expressionsString Searching Problems
string matching string distance longest common subsequence approximate string matching longest repeated substring
Document Management, Department of Linguistics, Uppsala University, VT997
Document Management, Department of Linguistics, Uppsala University, VT998
String Matching
String Matching - Algorithms (1)
Given pattern string x, with j x j= m, and text string y, with j y j= n, where m; n> 0 and m n, if x occurs as a substring of y then determine the position within y of the rst occurrence of x, i.e. return the least value of i such that y(i; i+ m? 1)= x(1; m).
at successive positions. worst case: O(mn) average case: O(m+ n) Knuth-Morris-Pratt: calculate a next-table for the pattern x which will be used for shifting the pattern when mismatching worst case: O(m+ n) Boyer-Moore: scan the string string from left to right, but match symbols from right to left (disregard portions which cannot possibly match the pattern) worst case: O(m+ n) average case: O(n=m) (large alphabet, small pattern)
Brute Force: compare pattern x with substrings of y
data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response
Document Management, Department of Linguistics, Uppsala University, VT999
Document Management, Department of Linguistics, Uppsala University, VT9910
String Matching - Algorithms (2)
String Distance
Sunday,Hume: mismatch optimizationQuick Search Maximal Shift Optimal Mismatch Tuned Boyer-Moore Least Cost Harrison: test for the occurrence of the pattern but don't nd its location, needs preprocessed static texts, uses hashing functions Karp-Rabin: comparison of signatures in a 'Brute-Forcemanner', uses hashing functions
Given strings x and y, with j x j= m, j y j= n, where m; n> 0, and a string distance metric, d, nd d(x; y). (Determine a minimal sequence of editing operations necessary to transform x to y
.
Document Management, Department of Linguistics, Uppsala University, VT9911
Document Management, Department of Linguistics, Uppsala University, VT9912
Longest Common Subsequence
Approximate String Matching
A subsequence of a string is obtained by eliminating zero or more symbols, not necessarily contiguous, from the string. A common subsequence of two strings is a string which occurs as a subsequence of both strings. A longest common subsequence of two strings is a common subsequence of the strings having maximal length.
Given a pattern string x, with j x j= m, text string y, with j y j= n, where m; n> 0 and m n, an integer k 0 and a distance function d, nd all the substrings, s, of y such that d(x; s) k.
data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response
Document Management, Department of Linguistics, Uppsala University, VT9913
Document Management, Department of Linguistics, Uppsala University, VT9914
Dynamic Programming - String Distancef (a; b)= 0 if a matches b, otherwise f (a; b)= 1 d1 1= f (x1; y1) 8i> 1: d 1= d?1 1+ f (x; y1 ) 8j> 1: d1= d1?1+ f (x1; y ) 8i; j> 1: d= min(d?1+1; d?1+1; d?1?1+ f (x; y )); i; i; i;j;j j i;j i;j i;j i;j i j;
Dynamic Programming - LCSf (a; b)= 1 if a matches b, otherwise f (a; b)= 0 d1 1= f (x1; y1) 8i> 1: d 1= d?1 1+ f (x; y1 ) 8j> 1: d1= d1?1+ f (x1; y ) 8i; j> 1: d= max(d?1; d?1; d?1?1+f (x; y ))i; i; i;j;j j i;j i;j i;j i;j i j
b a l a n s a r m e n s
b 0 1 2 3 4 5 6 7 8 9 10 11
a 1 0 1 2 3 4 5 6 7 8 9 10
l 2 1 0 1 2 3 4 5 6 7 8 9
a 3 2 1 0 1 2 3 4 5 6 7 8
n 4 3 2 1 0 1 2 3 4 5 6 7
c 5 4 3 2 1 1 2 3 4 5 6 7
e 6 5 4 3 2 2 2 3 4 4 5 6
7 6 5 4 3 3 3 3 4 5 5 6
a 8 7 6 5 4 4 3 4 4 5 6 6
r 9 8 7 6 5 5 4 3 4 5 6 7
m 10 9 8 7 6 6 5 4 3 4 5 6
b a l a n s a r m e n s
b 1 1 1 1 1 1 1 1 1 1 1 1
a 1 2 2 2 2 2 2 2 2 2 2 2
l 1 2 3 3 3 3 3 3 3 3 3 3
a 1 2 3 4 4 4 4 4 4 4 4 4
n 1 2 3 4 5 5 5 5 5 5 5 5
c 1 2 3 4 5 5 5 5 5 5 5 5
e 1 2 3 4 5 5 5 5 5 6 6 6
1 2 3 4 5 5 5 5 5 6 6 6
a 1 2 3 4 5 5 6 6 6 6 6 6
r 1 2 3 4 5 5 6 7 7 7 7 7
m 1 2 3 4 5 5 6 7 8 8 8 8
Document Management, Department of Linguistics, Uppsala University, VT9915
Document Management, Department of Linguistics, Uppsala University, VT9916
Heaviest Common Subsequence
Approximate String Matching
Landau-Vishkin k-mismatch: based on Knuth-Morris-
Using weight functions: 1. f (xi; yj )= w(i; j; xi) if xi matches yj, otherwise f (xi; yj )= 0 2. f (xi; yj )= w(i; j; xi; yj )
Landau-Vishkin k-distance: based on dynamic programming
Pratt string matching
data retrieval information retrieval matching exact partial, best match inference deduction induction model deterministic probabilistic query language artificial natural query specification complete incomplete items wanted matching relevant error response
Document Management, Department of Linguistics, Uppsala University, VT9917
Document Management, Department of Linguistics, Uppsala University, VT9918
SortingThree of many sorting algorithms: selection sort: Look for the biggest element and remove it. Place removed elements in an ordered list until no element is in the original list. bubble sort: Swap neighbors if they are in wrong order until the list is sorted. quick sort: Divide the list in two parts. Ru
n through the two parts and put all elements which are bigger than the chosen value in one part and all elements which are smaller in the other part. Apply the same algorithm to each of the two parts until the list is sorted (best average complexity of n2log(n)).
ReferencesKD94] Bharat Kinariwala and Tep Dobry. Chapter 10, Sorting and Searching. In Programming in C, 1994. Available on http://spectra.eng.hawaii.edu/Courses/EE150/Book/chap10/chap10.html Rij79] C. J. van RIJSBERGEN. Information Retrieval, Second Edition, Department of Computer Science, University of Glasgow, 1979. Available on http://www.dcs.gla.ac.uk/Keith/Preface.html St92] Graham A. Stephen String Search, Technical Report, School of Electronic Engineering Science, University College of North Wales, Gwynedd 1992. Available on http://ei.cs.vt.edu/ cs5604/f95/cs5604cnSS/TR92.html
正在阅读:
冬天的温柔文案_关于冬天的文案08-04
动物学论文12-10
催化重整预加氢部分控制系统设计06-01
人教版美术六上《风景写生》教案05-24
廉洁辩论赛10-04
物料供应商管理规程02-03
某热电厂扩建工程(电气部分)设计01-29
2020年小学体卫艺工作计划文档2篇05-06
- 1HYSYS中文入门案例-Gas Processing
- 2Tuning the solubility of polymerized ionic liquids by simple
- 3A Simple Model of monetary policy and currency crisis
- 4Listening Strategies for IELTS text
- 5A summary of the text sandwich generation
- 6A summary of the text sandwich generation
- 7Listening Strategies for IELTS text
- 8Simple软件 - 产品知识库
- 9Text II The Art of Acknowledgement Jean Houston
- 10S1 - JavaPen Text - Finishing
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- Processing
- Simple
- Text
- 娃哈哈大学生市场调查问卷
- 诗歌鉴赏事物形象
- 政治生活易错易混点
- 实用文档之外墙脚手架专项施工方案
- 第一章、MATLAB及其应用概述
- 2015年司考试卷二
- 电控防抱死制动系统ABS2
- 大型活动应急预案
- 小学六年级上册《品德与社会》第九课《奥林匹克的故乡》
- 4013《冒险契约》菜鸟到大师级的兑变
- 电大高级财务会计形成性考核册答案(作业1-4全套)
- 双抗体夹心ELISA法测定残余汉逊酵母菌体蛋白含量的研究
- 管家部PNP目录(下属店)
- 一年级语文上册期末测试题(无答案)语文S版
- 安全例会会议纪要
- Postmodernism and photoshop 后现代主义与photoshop
- 罗尔、拉格朗日、柯西中值定理、洛必达法则与导数的应用
- 肇州县公路客运站发车时刻表
- 浅谈《舞蹈教育》对青少年美育的作用
- 赖诺普利联合牛黄降压胶囊