python - BeautifulSoup adds content when I run find_all -


i trying scrape listings yp.com, , in building code, able isolate section names (div class="search-results organic"), when run find_all() on content, returns listings outside section.

the url http://www.yellowpages.com/search?search_terms=septic&geo_location_terms=80521

here running:

from bs4 import beautifulsoup import urllib import re import xml import requests urlparse import urlparse  filename = "webspyorganictag.html" term = "septic" zipcode = "80521" url = "http://www.yellowpages.com/search?search_terms="+ term +"&geo_location_terms="+ zipcode  open(filename, "w") myfile:     myfile.write("information organic<br>")  r = requests.get(url) soup = beautifulsoup(r.content, "xml") organic = soup.find("div", {"class": "search-results organic"})  open(filename, "a") myfile:     myfile.write(str(organic)) 

and returns content in organic listings section. there 30 listings.

then, add:

listings = organic.find_all("div", {"class": "info"}) = 1 open(filename, "a") myfile:     listing in listings:         myfile.write("this listing " + str(i) + "<br>")         myfile.write(str(listing) + "<br>")         += 1 

and returns original 30 listings plus 10 more listings (aside id="main-aside") not included in variable 'organic'.

shouldn't calling organic.find_all() limit scope data in variable 'organic'?

using "xml" finding 41 class="info"> soup.find("div", {"class": "search-results organic"}) not surprising 41 returned find_all. getting other elements returned can seen looking @ organic returns i.e href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573", href="/longmont-co/mip/rays-backhoe-service-6327932?lid=216924340" , every other listing ten featured.

if @ line 41 of html write contains:

href="/wray-co/mip/ritcheys-redi-mix-precast-inc-10367117?lid=1000575822573" last of featured listings.

the problem parser, if change parser "lxml":

soup = beautifulsoup(r.content,"lxml")  organic = soup.find("div", {"class": "search-results organic"})  print(len(organic.find_all("h3",{"class":"info"}))) 30 

or use html.parser:

soup = beautifulsoup(r.content,"html.parser")   organic = soup.find("div", {"class": "search-results organic"})  print(len(organic.find_all("div",{"class":"info"}))) 30 

you correct result.


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -