Want To Pull A Journal Title From An Rcsb Page Using Python & Beautifulsoup
Solution 1:
The content you are interested in is provided by the javascript. It's easy to find out, visit the same URL on browser with javascript disabled and you will not see that specific info. It also displays a friendly message:
"This browser is either not Javascript enabled or has it turned off. This site will not function correctly without Javascript."
For javascript driven pages, you cannot use Python Requests. There are some alternatives, one being dryscape.
PS: Do not import libraries/modules within a function. Python does not recommend it and PEP08 says that:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
This SO question explains why it's not recomended way to do it.
Solution 2:
The Python package PyPDB that can do this task. The repository can be found here, but it is also available on PyPI
pip install pypdb
For your application, the function describe_pdb
takes a four-character PDB ID as an input and returns a dictionary containing the metadata associated with the entry:
my_desc = describe_pdb('4lza')
There's fields in my_desc
for 'citation_authors', 'structure_authors', and 'title', but not all entries appear to have journal titles associated with them. The other options are to use the broader function get_all_info('4lza')
or get (and parse) the entire raw .pdb file using get_pdb_file('4lza', filetype='cif', compression=True)
Post a Comment for "Want To Pull A Journal Title From An Rcsb Page Using Python & Beautifulsoup"