Skip to content Skip to sidebar Skip to footer

Django-dynamic-scraper Unable To Scrape The Data

I am new to using dynamic scraper, and I have used the following sample for learningopen_news. I have everything set up but it keeps me showing the same error: dynamic_scraper.mod

Solution 1:

This is caused by "REQUEST PAGE TYPES" is missing. Each "SCRAPER ELEMS" must have it's own "REQUEST PAGE TYPES".

To solve this problem, please follow the steps below:

  1. Login admin page (usually http://localhost:8000/admin/)
  2. Go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article)
  3. Click on "Add another Request page type" under "REQUEST PAGE TYPES"
  4. Create 4 "REQUEST PAGE TYPES" in total for each "(base (Article))", "(title (Article))", "(description (Article))" and "(url (Article))"

"REQUEST PAGE TYPES" Settings

All "Content type" are "HTML"

All "Request type" are "Request"

All "Method" are "Get"

For "Page type", just assign them in sequence like

(base (Article)) | Main Page

(title (Article)) | Detail Page 1

(description (Article) | Detail Page 2

(url (Article)) | Detail Page 3

After the steps above you should fix "DoesNotExist: RequestPageType" error.

However, "ERROR: Mandatory elem title missing!" would come up!

To solve this. I suggest you changing all "REQUEST PAGE TYPE" in "SCRAPER ELEMS" to "Main Page" including "title (Article)".

And then change the XPath as follow:

(base (Article)) | //td[@class="l_box"]

(title (Article)) | span[@class="l_title"]/a/@title

(description (Article) | p/span[@class="l_summary"]/text()

(url (Article)) | span[@class="l_title"]/a/@href

After all, run scrapy crawl article_spider -a id=1 -a do_action=yes on command prompt. You should be able to crawl the "Article". You may check it from Home › Open_News › Articles

Enjoy~

Solution 2:

I might be late for the party, but hopefully my solution could be somewhat helpful for those coming across later.

@alan-nala solution works well. However, it basically skips the detail page scraping.

Here is how you can take full advantage of the detail page scraping.

First, go to Home › Dynamic_Scraper › Scrapers › Wikinews Scraper (Article) and add those in Request page types.

Second, make sure your elements look like this in SCRAPER ELEMS.

Now, you can run the manual scraping command according to the doc

scrapy crawl article_spider -a id=1 -a do_action=yes

Well, you are likely to encounter an error as mentioned by @alan-nala

"ERROR: Mandatory elem title missing!"

Please pay attention to the error screenshot, I have a message indicating the script is "Calling DP2 URL for..." in my case.

Finally, you can go back to SCRAPER ELEMS and change Request page type of the element "title (Article)" to "Detail Page 2" instead of "Detail Page 1".

Save your settings and run the scrapy command again.

Note: Your "Detail Page #" might vary.

By the way, I have also prepared a short tutorial hosted by GitHub, in case you need more details on this topic.

Post a Comment for "Django-dynamic-scraper Unable To Scrape The Data"