php抓取NCIB中pubmed文献数据

更新时间：2023-06-09 09:57:01 阅读量：实用文档文档下载

说明：文章内容仅供预览，部分内容可能不全。下载后的文档，内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的，是否完整无缺。

php抓取网页指定数据推荐度：
相关推荐

NCIB中pubmed文献数据的抓取

Pubmed数据库中含有大量的文献相关信息，但是抓取这些数据的时候会有很多的问题和困难，但是有了pubmed自己的工具就可以随心所欲的抓取了！！！！

http://www.ncbi.nlm.nih.gov/books/NBK25499/

这里面有各种工具和参数介绍！！

这里是EFetch的介绍：

Base URL

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

Functions

Returns formatted data records for a list of input UIDs Returns formatted data records for a set of UIDs stored on the Entrez History server Required Parameters

Database from which to retrieve records. The value must be a valid (default = pubmed). Currently EFetch does not support all Entrez databases. Please see in Chapter 2 for a list of available databases.

Required Parameter – Used only when input is from a UID list

UID list. Either a single UID or a comma-delimited list of UIDs may be provided. All of the UIDs must be from the database specified by db. There is no set maximum for the number of UIDs that can be passed to EFetch, but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method.

efetch.fcgi?db=protein&id=15718680,157427902,119703751

Required Parameters – Used only when input is from the Entrez History server query_key

Query key. This integer specifies which of the UID lists attached to the given Web Environment will be used as input to EFetch. Query keys are obtained from the output of previous ESearch, EPost or

ELInk calls. The query_key parameter must be used in conjunction with WebEnv.

WebEnv

Web Environment. This parameter specifies the Web Environment that contains the UID list to be provided as input to EFetch. Usually this WebEnv value is obtained from the output of a previous ESearch, EPost or ELink call. The WebEnv parameter must be used in conjunction with query_key. efetch.fcgi?db=protein&query_key=&WebEnv=

Optional Parameters – Retrieval

retmode

Retrieval mode. This parameter specifies the data format of the records returned, such as plain text, HMTL or XML. See for a full list of allowed values for each database.

Table 1 – Valid values of &retmode and &rettype for EFetch (null = empty string)

rettype

Retrieval type. This parameter specifies the record view returned, such as Abstract or MEDLINE from PubMed, or GenPept or FASTA from protein. Please see for a full list of allowed values for each database.

retstart

Sequential index of the first record to be retrieved (default=0, corresponding to the first record of the entire set). This parameter can be used in conjunction with retmax to download an arbitrary subset of records from the input set.

retmax

Total number of records from the input set to be retrieved, up to a maximum of 10,000. Optionally, for a large set the value of retstartcan be iterated while holding retmax constant, thereby downloading the entire set in batches of size retmax.

Optional Parameters – Sequence Databases

strand

Strand of DNA to retrieve. Available values are "1" for the plus strand and "2" for the minus strand. seq_start

First sequence base to retrieve. The value should be the integer coordinate of the first desired base, with "1" representing the first base of the seqence.

seq_stop

Last sequence base to retrieve. The value should be the integer coordinate of the last desired base, with "1" representing the first base of the seqence.

complexity

Data content to return. Many sequence records are part of a larger data structure or "blob", and

the complexity parameter determines how much of that blob to return. For example, an mRNA may be stored together with its protein product. The available values are as follows:

Examples

PubMed

Fetch PMIDs 17284678 and 9997 as text abstracts:

Fetch PMIDs in XML:

PubMed Central

Fetch XML for PubMed Central ID 212403:

Nucleotide/Nuccore

Fetch the first 100 bases of the plus strand of GI 21614549 in FASTA format:

Fetch the first 100 bases of the minus strand of GI 21614549 in FASTA format:

Fetch the nuc-prot object for GI 21614549:

Fetch the full ASN.1 record for GI 5:

Fetch FASTA for GI 5:

Fetch the GenBank flat file for GI 5:

Fetch GBSeqXML for GI 5:

Fetch TinySeqXML for GI 5:

Popset

Fetch the GenPept flat file for Popset ID 12829836:

Protein

Fetch the GenPept flat file for GI 8:

Fetch GBSeqXML for GI 8:

Sequences

Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)

Gene

Fetch full XML record for Gene ID 2:

#!/usr/bin/php

function pubmed_fetch($query){

$ret=1;

$flag=0;

$flag_xml=0;

print "Searching for: $query\n";

$pubmedtime=getdate();

$pubmedyear=$pubmedtime['year'];

$pubmedmonth=$pubmedtime['mon'];

$pubmedday=$pubmedtime['mday'];

$params = array(

'db' => 'pubmed',

#'retmode' => 'xml',

'retmode' => 'summary',

'retmax' => 1,

'usehistory' => 'y',

'term' => $query,

);

$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' . http_build_query($params); print "URL: $url\n";

$xml = simplexml_load_file($url);

pubmed_errors($xml);

print("$xml\n");

if (!$count = (int) $xml->Count){

print "No items found!\n";

return 0;

}

print "$count items found\n";

$translated = (string) $xml->QueryTranslation;

printf("Translated query: %s\n\n", $translated);

$params = array(

'db' => 'pubmed',

'retmode' => 'xml',

'query_key' => (string) $xml->QueryKey,

'WebEnv' => (string) $xml->WebEnv,

'retmax' => $count,

);

$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' . http_build_query($params); print "URL: $url\n";

$file = sprintf('%s.xml', preg_replace('/\W/', '_', $translated));

$file = $pubmedyear."-".$pubmedmonth."-".$pubmedday."_$file";

$wgetfilelog="wget.{$file}.log";

system(sprintf("wget %s -O %s 1>& $wgetfilelog", escapeshellarg($url),

escapeshellarg($file)),$ret);

$flag=wget_errors($wgetfilelog);

$flag_xml=xml_is_read($file);

if($ret==0 && $flag==1 && $flag_xml==1){

return $count;

}

else{

return 0;

}

本文来源：https://www.bwwdw.com/article/ncv1.html

相关文章：

正在阅读：

php抓取NCIB中pubmed文献数据06-09

中小学社会公德教育06-05

九年级仁爱英语上册复习之短文填空专项09-22

（第三册）施检表公路工程施工表格精品施工资料06-21

病句解析07-09

恒生历年笔试题05-07

湖南四建三水污水处理厂工程施工组织设计03-19

数据库系统简明教程填空题答案11-19

如何看待新中国70年考试和答案(满分)08-06

财务管理主要名词英汉对照索引04-05

上一篇：华师大数学教案7年级第三章整式的加减(全) 下一篇：静脉留置针护理记录单的设计与应用