Run scrapy on a set of hundred plus urls -


i need download cpu , gpu data of set of phones fro gsmarena. step one, downloaded urls of phones running scrapy , deleted unnecessary items.

code same below.

# -*- coding: utf-8 -*-  scrapy.selector import selector scrapy import spider scrapy.contrib.linkextractors.sgml import sgmllinkextractor gsmarena_data.items import gsmarenadataitem   class mobileinfospider(spider): name = "mobile_info" allowed_domains = ["gsmarena.com"] start_urls = ( # 'http://www.gsmarena.com/samsung-phones-f-9-10.php', # 'http://www.gsmarena.com/apple-phones-48.php', # 'http://www.gsmarena.com/microsoft-phones-64.php', # 'http://www.gsmarena.com/nokia-phones-1.php', # 'http://www.gsmarena.com/sony-phones-7.php', # 'http://www.gsmarena.com/lg-phones-20.php', # 'http://www.gsmarena.com/htc-phones-45.php', # 'http://www.gsmarena.com/motorola-phones-4.php', # 'http://www.gsmarena.com/huawei-phones-58.php', # 'http://www.gsmarena.com/lenovo-phones-73.php', # 'http://www.gsmarena.com/xiaomi-phones-80.php', # 'http://www.gsmarena.com/acer-phones-59.php', # 'http://www.gsmarena.com/asus-phones-46.php', # 'http://www.gsmarena.com/oppo-phones-82.php', # 'http://www.gsmarena.com/blackberry-phones-36.php', # 'http://www.gsmarena.com/alcatel-phones-5.php', # 'http://www.gsmarena.com/xolo-phones-85.php', # 'http://www.gsmarena.com/lava-phones-94.php', # 'http://www.gsmarena.com/micromax-phones-66.php', # 'http://www.gsmarena.com/spice-phones-68.php', 'http://www.gsmarena.com/gionee-phones-92.php', )  def parse(self, response): phone = gsmarenadataitem() hxs = selector(response) phone_listings = hxs.css('.makers')  phone_listings in phone_listings: phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract() phone['link'] = phone_listings.xpath("ul/li/a/@href").extract() yield phone 

now, need run scrapy on set of urls cpu , gpu data. info comes css selector = ".ttl".

kindly guide how loop scrapy on set of urls , output data in single csv or json. i'm aware creating items , using css selectors. need how loop on hundred plus pages.

i have list of urls like:  www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php www.gsmarena.com/samsung_galaxy_s5-6033.php www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php www.gsmarena.com/samsung_galaxy_core_lte-6099.php www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php  links phone descriptions on gsm arena.  need download cpu , gpu info of 100 models have.      extracted urls of 100 models data required.      spider written same is,       scrapy.selector import selector     scrapy import spider     gsmarena_data.items import gsmarenadataitem      class mobileinfospider(spider):     name = "cpu_gpu_info"     allowed_domains = ["gsmarena.com"]     start_urls = (     "http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php",     "http://www.gsmarena.com/microsoft_lumia_435-6942.php",     "http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php",     "http://www.gsmarena.com/microsoft_lumia_535-6791.php",     )     def parse(self, response):     phone = gsmarenadataitem()     hxs = selector(response)     cpu_gpu = hxs.css('.ttl')     phone_listings in phone_listings:     phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract()     phone['gpu'] = cpu_gpu.xpath("ul/li/a/@href").extract()     yield phone 

if somehow run on urls want extract data, required data in single csv file.

i think need information every vendors. if don't have put hundreds of urls in start-url, alternatively can use link start-url after in parse() extract urls programatically , process want.

this answer so.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -