Extract Abstracts from PubMed

I’m in the research group that we want to extract protein-protein interaction by using partial matching. After a while we found we need some article’s abstracts from  National Center for Biotechnology Information by using PUBMED ID. Here is my program to get articles from NCBI, extract abstract section and save abstract part into new file name by PUBMED ID.

import urllib2
import re
import os
from BeautifulSoup import BeautifulSoup

DIRECTORY = "abstractFiles"
try:
        os.mkdir(DIRECTORY)

except OSError:
        print "the directory %s is already exist" %DIRECTORY

f=open('output-1.txt', 'r')

tickers = []
for line in f:
        tickers.append(line[13:21])

os.chdir(DIRECTORY)
for t in tickers:
        try:

                rows=urllib2.urlopen( \
                'http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=%s' \
                %t).read()

                soup = BeautifulSoup(rows)
                abs = soup.findAll('p',attrs={'class' : re.compile("abstract")})

                ab = str(abs[0])
                ab = ab[20:]

                ab = ab.replace('</p>','')
                t = open(t+'.txt','w')

                t.write(ab)
                t.close()

        except IOError:

                errors = [t]
                errf = open('bad_trickers.txt','w+')

                errf.write(str(errors))
                errf.close()

                print errors
f.close()

Note: This program is written in Python. To run this program you will need an external library named BeautifulSoup.

Click here to download the source code.

  • athikitie
    Hemp is is far more than a psychoactive drug. And indeed the perfect food, and when learned. Go to http://www.hempproteinguide.net/ for great information.
blog comments powered by Disqus

Search

Blogroll

Me Where Else?

Python

My micro log

digvan: RT @jeanineAKAj9: RT @effie313: RT @jfavreau: : “Love all, trust a few, do wrong to none” ~William Shakespeare #quote
3 days ago
digvan: RT @voodootikigod: So facebook is becoming foursquare, iTunes is becoming facebook, and twitter is stable during a keynote. The future is nao.
4 days ago
digvan: RT @davidebbo: My machine has 67 instances of GoogleUpdate.exe running right now. It's nice to stay up to date!
1 week ago
digvan: RT @phil_nash: Welcome to the new decade: Java is a restricted platform, Google is evil, Apple is a monopoly and Microsoft are the underdogs
3 weeks ago