Skip to content Skip to sidebar Skip to footer

How To Get The Html Dom Of A Webpage And Its Frames

I would like to get the DOM of a website after js execution. I would also like to get all the content of the iframes in the website, similarly to what I have in Google Chrome's In

Solution 1:

This is a very hard problem to solve in general.

The main difficulty is that there is no way to know in advance how many frames each page has. And in addition to that, each child-frame may have its own set of frames, the number of which is also unknown. In theory, there could be an infinite number of nested frames, and the page will never finish loading (which seems no exaggeration for sites that have a lot of ads).

Anyway, below is a version of your script which gets the top-level QWebFrame object of each frame as it loads, and shows how you can access some of the things you are interested in. As you will see from the output, there are a lot of "junk" frames inserted by ads and such like that you will somehow need to filter out.

import sys, signal
from PyQt4 import QtGui, QtCore, QtWebKit

class Sp():
  def save(self, ok, frame=None):
    if frame is None:
        print ('main-frame')
        frame = self.webView.page().mainFrame()
    else:
        print('child-frame')
    print('URL: %s' % frame.baseUrl().toString())
    print('METADATA: %s' % frame.metaData())
    print('TAG: %s' % frame.documentElement().tagName())
    print()

  def handleFrameCreated(self, frame):
    frame.loadFinished.connect(lambda: self.save(True, frame=frame))

  def main(self):
    self.webView = QtWebKit.QWebView()
    self.webView.page().frameCreated.connect(self.handleFrameCreated)
    self.webView.page().mainFrame().loadFinished.connect(self.save)
    self.webView.load(QtCore.QUrl("http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_iframe_scrolling"))

signal.signal(signal.SIGINT, signal.SIG_DFL)
print('Press Crtl+C to quit\n')
app = QtGui.QApplication(sys.argv)
s = Sp()
s.main()
sys.exit(app.exec_())

NB: it is important that you connect to the loadFinished signal of the main frame rather than the web-view. If you connect to the latter, it will be called multiple times if the page contains more than one frame.

Post a Comment for "How To Get The Html Dom Of A Webpage And Its Frames"