python - Basic HTML with Pattern.web -
i trying scrap following information imdb:
- budget
- weekend gross (in us)
- screens (associated weekend gross, only)
desired output:
$220,000,000 (estimated), $207,438,708 (usa), (4,349 screens)
i wrote following code html seen below:
from pattern import web import requests url_business = url_movie = "http://www.imdb.com/title/tt0848228/business" business_html = requests.get(url_business) dom = web.element(business_html.text) business in dom.by_id('tn15content'): print business.source the output (truncated) looks this:
<div id="tn15content"> <h5>budget</h5> $220,000,000 (estimated)<br/> <br/> <h5>opening weekend</h5> $207,438,708 (usa) (<a href="/date/05-06/">6 may</a> <a href="/year/2012/">2012</a>) (4,349 screens)<br/>£15,778,074 (uk) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (521 screens)<br/>$178,400,000 (non-usa) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>)<br/>brl 20,387,104 (brazil) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (996 screens)<br/>$51,640 (cambodia) (<a href="/date/05-17/">17 may</a> <a href="/year/2012/">2012</a>)<br/>inr 110,000,000 (india) (<a href="/date/04-27/">27 april</a> <a href="/year/2012/">2012</a>)<br/>€4,752,836 (italy) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (678 screens)<br/>php 277,383,923 (philippines) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (479 screens)<br/>€468,100 (portugal) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (80 screens)<br/> <br/> <h5>gross</h5> because text not within tag, cannot element.by_tag().content. how information?
here's have got far - think should easy take here
from pattern import web import requests import sys url = "http://www.imdb.com/title/tt0848228/business" r = requests.get(url) if not r.ok: sys.exit(-1) d = web.element(r.text) x = d.getelementbyid('tn15content') split text of dom element x .
strs = x.string.split('<h5>') first 2 items
print strs[0] print strs[1] here rest of elements, split them <br />
b = strs[2].split(r'<br />') get rid of a href string.
import re r = re.compile(r'(<a.*a>)') in b: print r.sub('', i) output: opening weekend</h5> $207,438,708 (usa) () (4,349 screens) £15,778,074 (uk) () (521 screens) $178,400,000 (non-usa) () brl 20,387,104 (brazil) () (996 screens) $51,640 (cambodia) () inr 110,000,000 (india) () €4,752,836 (italy) () (678 screens) php 277,383,923 (philippines) () (479 screens) €468,100 (portugal) () (80 screens)
i think can follow desired output.
Comments
Post a Comment