python - BeautifulSoup adds content when I run find_all -
i trying scrape listings yp.com, , in building code, able isolate section names (div class="search-results organic"), when run find_all() on content, returns listings outside section.
the url http://www.yellowpages.com/search?search_terms=septic&geo_location_terms=80521
here running:
from bs4 import beautifulsoup import urllib import re import xml import requests urlparse import urlparse filename = "webspyorganictag.html" term = "septic" zipcode = "80521" url = "http://www.yellowpages.com/search?search_terms="+ term +"&geo_location_terms="+ zipcode open(filename, "w") myfile: myfile.write("information organic<br>") r = requests.get(url) soup = beautifulsoup(r.content, "xml") organic = soup.find("div", {"class": "search-results organic"}) open(filename, "a") myfile: myfile.write(str(organic)) and returns content in organic listings section. there 30 listings.
then, add:
listings = organic.find_all("div", {"class": "info"}) = 1 open(filename, "a") myfile: listing in listings: myfile.write("this listing " + str(i) + "<br>") myfile.write(str(listing) + "<br>") += 1 and returns original 30 listings plus 10 more listings (aside id="main-aside") not included in variable 'organic'.
shouldn't calling organic.find_all() limit scope data in variable 'organic'?
using "xml" finding 41 class="info"> soup.find("div", {"class": "search-results organic"}) not surprising 41 returned find_all. getting other elements returned can seen looking @ organic returns i.e href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573", href="/longmont-co/mip/rays-backhoe-service-6327932?lid=216924340" , every other listing ten featured.
if @ line 41 of html write contains:
href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573" last of featured listings.
the problem parser, if change parser "lxml":
soup = beautifulsoup(r.content,"lxml") organic = soup.find("div", {"class": "search-results organic"}) print(len(organic.find_all("h3",{"class":"info"}))) 30 or use html.parser:
soup = beautifulsoup(r.content,"html.parser") organic = soup.find("div", {"class": "search-results organic"}) print(len(organic.find_all("div",{"class":"info"}))) 30 you correct result.
Comments
Post a Comment