python - Basic HTML with Pattern.web -


i trying scrap following information imdb:

  • budget
  • weekend gross (in us)
  • screens (associated weekend gross, only)

desired output:

$220,000,000 (estimated), $207,438,708 (usa), (4,349 screens)

i wrote following code html seen below:

from pattern import web import requests  url_business = url_movie = "http://www.imdb.com/title/tt0848228/business" business_html = requests.get(url_business) dom = web.element(business_html.text)  business in dom.by_id('tn15content'):     print business.source  

the output (truncated) looks this:

<div id="tn15content">   <h5>budget</h5> $220,000,000 (estimated)<br/> <br/>  <h5>opening weekend</h5> $207,438,708 (usa) (<a href="/date/05-06/">6 may</a> <a href="/year/2012/">2012</a>) (4,349 screens)<br/>&#163;15,778,074 (uk) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (521 screens)<br/>$178,400,000 (non-usa) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>)<br/>brl 20,387,104 (brazil) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (996 screens)<br/>$51,640 (cambodia) (<a href="/date/05-17/">17 may</a> <a href="/year/2012/">2012</a>)<br/>inr 110,000,000 (india) (<a href="/date/04-27/">27 april</a> <a href="/year/2012/">2012</a>)<br/>&#8364;4,752,836 (italy) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (678 screens)<br/>php 277,383,923 (philippines) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (479 screens)<br/>&#8364;468,100 (portugal) (<a href="/date/04-29/">29 april</a> <a href="/year/2012/">2012</a>) (80 screens)<br/> <br/>  <h5>gross</h5> 

because text not within tag, cannot element.by_tag().content. how information?

here's have got far - think should easy take here

from pattern import web import requests import sys  url = "http://www.imdb.com/title/tt0848228/business"  r = requests.get(url) if not r.ok:     sys.exit(-1)  d = web.element(r.text)  x = d.getelementbyid('tn15content') 

split text of dom element x .

strs = x.string.split('<h5>') 

first 2 items

print strs[0] print strs[1] 

here rest of elements, split them <br />

b = strs[2].split(r'<br />') 

get rid of a href string.

import re r = re.compile(r'(<a.*a>)') in b:     print r.sub('', i) 

output: opening weekend</h5> $207,438,708 (usa) () (4,349 screens) &#163;15,778,074 (uk) () (521 screens) $178,400,000 (non-usa) () brl 20,387,104 (brazil) () (996 screens) $51,640 (cambodia) () inr 110,000,000 (india) () &#8364;4,752,836 (italy) () (678 screens) php 277,383,923 (philippines) () (479 screens) &#8364;468,100 (portugal) () (80 screens)

i think can follow desired output.


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -