Skip to content Skip to sidebar Skip to footer

Want To Pull A Journal Title From An Rcsb Page Using Python & Beautifulsoup

I am trying to get specific information about the original citing paper in the Protein Data Bank given only the 4 letter PDBID of the protein. To do this I am using the python libr

Solution 1:

The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:

"This browser is either not Javascript enabled or has it turned off. This site will not function correctly without Javascript."

For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.

PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:

Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.

This SO question explains why it's not recomended way to do it.

Solution 2:

The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI

pip install pypdb

For your application, the function describe_pdb takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:

my_desc = describe_pdb('4lza')

There's fields in my_desc for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza') or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)

Post a Comment for "Want To Pull A Journal Title From An Rcsb Page Using Python & Beautifulsoup"