Extract Abstracts from PubMed
I’m in the research group that we want to extract protein-protein interaction by using partial matching. After a while we found we need some article’s abstracts from National Center for Biotechnology Information by using PUBMED ID. Here is my program to get articles from NCBI, extract abstract section and save abstract part into new file name by PUBMED ID.
import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup
DIRECTORY = "abstractFiles"
try:
os.mkdir(DIRECTORY)
except OSError:
print "the directory %s is already exist" %DIRECTORY
f=open('output-1.txt', 'r')
tickers = []
for line in f:
tickers.append(line[13:21])
os.chdir(DIRECTORY)
for t in tickers:
try:
rows=urllib2.urlopen( \
'http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=%s' \
%t).read()
soup = BeautifulSoup(rows)
abs = soup.findAll('p',attrs={'class' : re.compile("abstract")})
ab = str(abs[0])
ab = ab[20:]
ab = ab.replace('</p>','')
t = open(t+'.txt','w')
t.write(ab)
t.close()
except IOError:
errors = [t]
errf = open('bad_trickers.txt','w+')
errf.write(str(errors))
errf.close()
print errors
f.close()
Note: This program is written in Python. To run this program you will need an external library named BeautifulSoup.
Click here to download the source code.
