Run scrapy on a set of hundred plus urls -
i need download cpu , gpu data of set of phones fro gsmarena. step one, downloaded urls of phones running scrapy , deleted unnecessary items.
code same below.
# -*- coding: utf-8 -*- scrapy.selector import selector scrapy import spider scrapy.contrib.linkextractors.sgml import sgmllinkextractor gsmarena_data.items import gsmarenadataitem class mobileinfospider(spider): name = "mobile_info" allowed_domains = ["gsmarena.com"] start_urls = ( # 'http://www.gsmarena.com/samsung-phones-f-9-10.php', # 'http://www.gsmarena.com/apple-phones-48.php', # 'http://www.gsmarena.com/microsoft-phones-64.php', # 'http://www.gsmarena.com/nokia-phones-1.php', # 'http://www.gsmarena.com/sony-phones-7.php', # 'http://www.gsmarena.com/lg-phones-20.php', # 'http://www.gsmarena.com/htc-phones-45.php', # 'http://www.gsmarena.com/motorola-phones-4.php', # 'http://www.gsmarena.com/huawei-phones-58.php', # 'http://www.gsmarena.com/lenovo-phones-73.php', # 'http://www.gsmarena.com/xiaomi-phones-80.php', # 'http://www.gsmarena.com/acer-phones-59.php', # 'http://www.gsmarena.com/asus-phones-46.php', # 'http://www.gsmarena.com/oppo-phones-82.php', # 'http://www.gsmarena.com/blackberry-phones-36.php', # 'http://www.gsmarena.com/alcatel-phones-5.php', # 'http://www.gsmarena.com/xolo-phones-85.php', # 'http://www.gsmarena.com/lava-phones-94.php', # 'http://www.gsmarena.com/micromax-phones-66.php', # 'http://www.gsmarena.com/spice-phones-68.php', 'http://www.gsmarena.com/gionee-phones-92.php', ) def parse(self, response): phone = gsmarenadataitem() hxs = selector(response) phone_listings = hxs.css('.makers') phone_listings in phone_listings: phone['model'] = phone_listings.xpath("ul/li/a/strong/text()").extract() phone['link'] = phone_listings.xpath("ul/li/a/@href").extract() yield phone
now, need run scrapy on set of urls cpu , gpu data. info comes css selector = ".ttl".
kindly guide how loop scrapy on set of urls , output data in single csv or json. i'm aware creating items , using css selectors. need how loop on hundred plus pages.
i have list of urls like: www.gsmarena.com/samsung_galaxy_s5_cdma-6338.php www.gsmarena.com/samsung_galaxy_s5-6033.php www.gsmarena.com/samsung_galaxy_core_lte_g386w-6846.php www.gsmarena.com/samsung_galaxy_core_lte-6099.php www.gsmarena.com/acer_iconia_one_8_b1_820-7217.php www.gsmarena.com/acer_iconia_tab_a3_a20-7136.php www.gsmarena.com/microsoft_lumia_640_dual_sim-7082.php www.gsmarena.com/microsoft_lumia_532_dual_sim-6951.php links phone descriptions on gsm arena. need download cpu , gpu info of 100 models have. extracted urls of 100 models data required. spider written same is, scrapy.selector import selector scrapy import spider gsmarena_data.items import gsmarenadataitem class mobileinfospider(spider): name = "cpu_gpu_info" allowed_domains = ["gsmarena.com"] start_urls = ( "http://www.gsmarena.com/microsoft_lumia_435_dual_sim-6949.php", "http://www.gsmarena.com/microsoft_lumia_435-6942.php", "http://www.gsmarena.com/microsoft_lumia_535_dual_sim-6792.php", "http://www.gsmarena.com/microsoft_lumia_535-6791.php", ) def parse(self, response): phone = gsmarenadataitem() hxs = selector(response) cpu_gpu = hxs.css('.ttl') phone_listings in phone_listings: phone['cpu'] = cpu_gpu.xpath("ul/li/a/strong/text()").extract() phone['gpu'] = cpu_gpu.xpath("ul/li/a/@href").extract() yield phone
if somehow run on urls want extract data, required data in single csv file.
i think need information every vendors. if don't have put hundreds of urls in start-url
, alternatively can use link start-url
after in parse()
extract urls programatically , process want.
this answer so.
Comments
Post a Comment