php抓取NCIB中pubmed文献数据

更新时间:2023-06-09 09:57:01 阅读量: 实用文档 文档下载

说明:文章内容仅供预览,部分内容可能不全。下载后的文档,内容与下面显示的完全一致。下载之前请确认下面内容是否您想要的,是否完整无缺。

NCIB中pubmed文献数据的抓取

Pubmed数据库中含有大量的文献相关信息,但是抓取这些数据的时候会有很多的问题和困难,但是有了pubmed自己的工具就可以随心所欲的抓取了!!!!

http://www.ncbi.nlm.nih.gov/books/NBK25499/

这里面有各种工具和参数介绍!!

这里是EFetch的介绍:

Base URL

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

Functions

Returns formatted data records for a list of input UIDs Returns formatted data records for a set of UIDs stored on the Entrez History server Required Parameters

db

Database from which to retrieve records. The value must be a valid (default = pubmed). Currently EFetch does not support all Entrez databases. Please see in Chapter 2 for a list of available databases.

Required Parameter – Used only when input is from a UID list

id

UID list. Either a single UID or a comma-delimited list of UIDs may be provided. All of the UIDs must be from the database specified by db. There is no set maximum for the number of UIDs that can be passed to EFetch, but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method.

efetch.fcgi?db=protein&id=15718680,157427902,119703751

Required Parameters – Used only when input is from the Entrez History server query_key

Query key. This integer specifies which of the UID lists attached to the given Web Environment will be used as input to EFetch. Query keys are obtained from the output of previous ESearch, EPost or

ELInk calls. The query_key parameter must be used in conjunction with WebEnv.

WebEnv

Web Environment. This parameter specifies the Web Environment that contains the UID list to be provided as input to EFetch. Usually this WebEnv value is obtained from the output of a previous ESearch, EPost or ELink call. The WebEnv parameter must be used in conjunction with query_key. efetch.fcgi?db=protein&query_key=&WebEnv=

Optional Parameters – Retrieval

retmode

Retrieval mode. This parameter specifies the data format of the records returned, such as plain text, HMTL or XML. See for a full list of allowed values for each database.

Table 1 – Valid values of &retmode and &rettype for EFetch (null = empty string)

rettype

Retrieval type. This parameter specifies the record view returned, such as Abstract or MEDLINE from PubMed, or GenPept or FASTA from protein. Please see for a full list of allowed values for each database.

retstart

Sequential index of the first record to be retrieved (default=0, corresponding to the first record of the entire set). This parameter can be used in conjunction with retmax to download an arbitrary subset of records from the input set.

retmax

Total number of records from the input set to be retrieved, up to a maximum of 10,000. Optionally, for a large set the value of retstartcan be iterated while holding retmax constant, thereby downloading the entire set in batches of size retmax.

Optional Parameters – Sequence Databases

strand

Strand of DNA to retrieve. Available values are "1" for the plus strand and "2" for the minus strand. seq_start

First sequence base to retrieve. The value should be the integer coordinate of the first desired base, with "1" representing the first base of the seqence.

seq_stop

Last sequence base to retrieve. The value should be the integer coordinate of the last desired base, with "1" representing the first base of the seqence.

complexity

Data content to return. Many sequence records are part of a larger data structure or "blob", and

the complexity parameter determines how much of that blob to return. For example, an mRNA may be stored together with its protein product. The available values are as follows:

Examples

PubMed

Fetch PMIDs 17284678 and 9997 as text abstracts:

Fetch PMIDs in XML:

PubMed Central

Fetch XML for PubMed Central ID 212403:

Nucleotide/Nuccore

Fetch the first 100 bases of the plus strand of GI 21614549 in FASTA format:

Fetch the first 100 bases of the minus strand of GI 21614549 in FASTA format:

Fetch the nuc-prot object for GI 21614549:

Fetch the full ASN.1 record for GI 5:

Fetch FASTA for GI 5:

Fetch the GenBank flat file for GI 5:

Fetch GBSeqXML for GI 5:

Fetch TinySeqXML for GI 5:

Popset

Fetch the GenPept flat file for Popset ID 12829836:

Protein

Fetch the GenPept flat file for GI 8:

Fetch GBSeqXML for GI 8:

Sequences

Fetch FASTA for a transcript and its protein product (GIs 312836839 and 34577063)

Gene

Fetch full XML record for Gene ID 2:

#!/usr/bin/php

function pubmed_fetch($query){

$ret=1;

$flag=0;

$flag_xml=0;

print "Searching for: $query\n";

$pubmedtime=getdate();

$pubmedyear=$pubmedtime['year'];

$pubmedmonth=$pubmedtime['mon'];

$pubmedday=$pubmedtime['mday'];

$params = array(

'db' => 'pubmed',

#'retmode' => 'xml',

'retmode' => 'summary',

'retmax' => 1,

'usehistory' => 'y',

'term' => $query,

);

$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?' . http_build_query($params); print "URL: $url\n";

$xml = simplexml_load_file($url);

pubmed_errors($xml);

print("$xml\n");

if (!$count = (int) $xml->Count){

print "No items found!\n";

return 0;

}

print "$count items found\n";

$translated = (string) $xml->QueryTranslation;

printf("Translated query: %s\n\n", $translated);

$params = array(

'db' => 'pubmed',

'retmode' => 'xml',

'query_key' => (string) $xml->QueryKey,

'WebEnv' => (string) $xml->WebEnv,

'retmax' => $count,

);

$url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?' . http_build_query($params); print "URL: $url\n";

$file = sprintf('%s.xml', preg_replace('/\W/', '_', $translated));

$file = $pubmedyear."-".$pubmedmonth."-".$pubmedday."_$file";

$wgetfilelog="wget.{$file}.log";

system(sprintf("wget %s -O %s 1>& $wgetfilelog", escapeshellarg($url),

escapeshellarg($file)),$ret);

$flag=wget_errors($wgetfilelog);

$flag_xml=xml_is_read($file);

if($ret==0 && $flag==1 && $flag_xml==1){

return $count;

}

else{

return 0;

}

}

本文来源:https://www.bwwdw.com/article/ncv1.html

Top