php抓取NCIB中pubmed文献数据
更新时间:2023-06-09 09:57:01 阅读量: 实用文档 文档下载
- php抓取网页指定数据推荐度:
- 相关推荐
NCIB中pubmed文献数据的抓取
Pubmed数据库中含有大量的文献相关信息,但是抓取这些数据的时候会有很多的问题和困难,但是有了pubmed自己的工具就可以随心所欲的抓取了!!!!
http://www.ncbi.nlm.nih.gov/books/NBK25499/
这里面有各种工具和参数介绍!!
这里是EFetch的介绍:
Base URL
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
Functions
Returns formatted data records for a list of input UIDs Returns formatted data records for a set of UIDs stored on the Entrez History server Required Parameters
db
Database from which to retrieve records. The value must be a valid (default = pubmed). Currently EFetch does not support all Entrez databases. Please see in Chapter 2 for a list of available databases.
Required Parameter – Used only when input is from a UID list
id
UID list. Either a single UID or a comma-delimited list of UIDs may be provided. All of the UIDs must be from the database specified by db. There is no set maximum for the number of UIDs that can be passed to EFetch, but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method.
efetch.fcgi?db=protein&id=15718680,157427902,119703751
Required Parameters – Used only when input is from the Entrez History server query_key
Query key. This integer specifies which of the UID lists attached to the given Web Environment will be used as input to EFetch. Query keys are obtained from the output of previous ESearch, EPost or
ELInk calls. The query_key parameter must be used in conjunction with WebEnv.
WebEnv
Web Environment. This parameter specifies the Web Environment that contains the UID list to be provided as input to EFetch. Usually this WebEnv value is obtained from the output of a previous ESearch, EPost or ELink call. The WebEnv parameter must be used in conjunction with query_key. efetch.fcgi?db=protein&query_key=&WebEnv=
Optional Parameters – Retrieval
retmode
Retrieval mode. This parameter specifies the data format of the records returned, such as plain text, HMTL or XML. See for a full list of allowed values for each database.
Table 1 – Valid values of &retmode and &rettype for EFetch (null = empty string)
rettype
Retrieval type. This parameter specifies the record view returned, such as Abstract or MEDLINE from PubMed, or GenPept or FASTA from protein. Please see for a full list of allowed values for each database.
retstart
Sequential index of the first record to be retrieved (default=0, corresponding to the first record of the entire set). This parameter can be used in conjunction with retmax to download an arbitrary subset of records from the input set.
retmax
Total number of records from the input set to be retrieved, up to a maximum of 10,000. Optionally, for a large set the value of retstartcan be iterated while holding retmax constant, thereby downloading the entire set in batches of size retmax.
Optional Parameters – Sequence Databases
strand
Strand of DNA to retrieve. Available values are "1" for the plus strand and "2" for the minus strand. seq_start
First sequence base to retrieve. The value should be the integer coordinate of the first desired base, with "1" representing the first base of the seqence.
seq_stop
Last sequence base to retrieve. The value should be the integer coordinate of the last desired base, with "1" representing the first base of the seqence.
complexity
Data content to return. Many sequence records are part of a larger data structure or "blob", and
the complexity parameter determines how much of that blob to return. For example, an mRNA may be stored together with its protein product. The available values are as follows:
Examples
PubMed
Fetch PMIDs 17284678 and 9997 as text abstracts:
Fetch PMIDs in XML:
PubMed Central
Fetch XML for PubMed Central ID 212403:
Nucleotide/Nuccore
Fetch the first 100 bases of the plus strand of GI 21614549 in FASTA format:
Fetch the first 100 bases of the minus strand of GI 21614549 in FASTA format:
Fetch the nuc-prot object for GI 21614549:
Fetch the full ASN.1 record for GI 5:
Fetch FASTA for GI 5:
Fetch the GenBank flat file for GI 5:
Fetch GBSeqXML for GI 5:
Fetch TinySeqXML for GI 5:
Popset
Fetch the GenPept flat file for Popset ID 12829836:
Protein
Fetch the GenPept flat file for GI 8:
Fetch GBSeqXML for GI 8:
Sequences
Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)
Gene
Fetch full XML record for Gene ID 2:
#!/usr/bin/php
function pubmed_fetch($query){
$ret=1;
$flag=0;
$flag_xml=0;
print "Searching for: $query\n";
$pubmedtime=getdate();
$pubmedyear=$pubmedtime['year'];
$pubmedmonth=$pubmedtime['mon'];
$pubmedday=$pubmedtime['mday'];
$params = array(
'db' => 'pubmed',
#'retmode' => 'xml',
'retmode' => 'summary',
'retmax' => 1,
'usehistory' => 'y',
'term' => $query,
);
$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' . http_build_query($params); print "URL: $url\n";
$xml = simplexml_load_file($url);
pubmed_errors($xml);
print("$xml\n");
if (!$count = (int) $xml->Count){
print "No items found!\n";
return 0;
}
print "$count items found\n";
$translated = (string) $xml->QueryTranslation;
printf("Translated query: %s\n\n", $translated);
$params = array(
'db' => 'pubmed',
'retmode' => 'xml',
'query_key' => (string) $xml->QueryKey,
'WebEnv' => (string) $xml->WebEnv,
'retmax' => $count,
);
$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' . http_build_query($params); print "URL: $url\n";
$file = sprintf('%s.xml', preg_replace('/\W/', '_', $translated));
$file = $pubmedyear."-".$pubmedmonth."-".$pubmedday."_$file";
$wgetfilelog="wget.{$file}.log";
system(sprintf("wget %s -O %s 1>& $wgetfilelog", escapeshellarg($url),
escapeshellarg($file)),$ret);
$flag=wget_errors($wgetfilelog);
$flag_xml=xml_is_read($file);
if($ret==0 && $flag==1 && $flag_xml==1){
return $count;
}
else{
return 0;
}
}
正在阅读:
php抓取NCIB中pubmed文献数据06-09
中小学社会公德教育06-05
九年级仁爱英语上册复习之短文填空专项09-22
病句解析07-09
恒生历年笔试题05-07
湖南四建三水污水处理厂工程施工组织设计03-19
数据库系统简明教程填空题答案11-19
如何看待新中国70年考试和答案(满分)08-06
财务管理主要名词英汉对照索引04-05
- 教学能力大赛决赛获奖-教学实施报告-(完整图文版)
- 互联网+数据中心行业分析报告
- 2017上海杨浦区高三一模数学试题及答案
- 招商部差旅接待管理制度(4-25)
- 学生游玩安全注意事项
- 学生信息管理系统(文档模板供参考)
- 叉车门架有限元分析及系统设计
- 2014帮助残疾人志愿者服务情况记录
- 叶绿体中色素的提取和分离实验
- 中国食物成分表2020年最新权威完整改进版
- 推动国土资源领域生态文明建设
- 给水管道冲洗和消毒记录
- 计算机软件专业自我评价
- 高中数学必修1-5知识点归纳
- 2018-2022年中国第五代移动通信技术(5G)产业深度分析及发展前景研究报告发展趋势(目录)
- 生产车间巡查制度
- 2018版中国光热发电行业深度研究报告目录
- (通用)2019年中考数学总复习 第一章 第四节 数的开方与二次根式课件
- 2017_2018学年高中语文第二单元第4课说数课件粤教版
- 上市新药Lumateperone(卢美哌隆)合成检索总结报告
- 抓取
- 文献
- 数据
- pubmed
- NCIB
- php
- 小学二年级下册数学带小括号的混合运算
- 特种设备安全操作规程施工电梯
- I think 2012 is the end of the world
- 描写面部表情的句子
- 学前班数学单数和双数教学设计
- 3、帷幕注浆施工工艺工法
- 浅谈我国的耕地和基本农田保护制度7.15
- 小学四年级作文题目我最熟悉的人的四年级优秀作文
- 隧道掘进水压爆破方案
- 济南大学信号与信息处理研究生课程安排
- 形势与政策课心得体会
- 副本SPC全套Excel版-1
- 外语教学与研究出版社高中英语必修1语言点汇集
- 中医怎样治疗黑眼圈,天生黑眼圈怎么除
- 高中语文总复习 金牌阅读指导大全:社科文14)
- 新概念英语第3册课文word版
- 小干扰RNA与口腔癌研究进展综述
- 清华大学心得体会
- 夏季游泳安全预防与急救措施
- 江苏省气瓶充装许可管理办法