python - Remove all style, scripts, and html tags from an html page -


here have far:

from bs4 import beautifulsoup  def cleanme(html):     soup = beautifulsoup(html) # create new bs4 object html data loaded     script in soup(["script"]):          script.extract()     text = soup.get_text()     return text testhtml = "<!doctype html>\n<head>\n<title>this example </title><style>.call {font-family:arial;}</style><script>getit</script><body>i need text captured<h1>and this</h1></body>"  cleaned = cleanme(testhtml) print (cleaned) 

this working remove script

it looks have it. need remove html tags , css styling code. here solution (i updated function):

def cleanme(html):     soup = beautifulsoup(html) # create new bs4 object html data loaded     script in soup(["script", "style"]): # remove javascript , stylesheet code         script.extract()     # text     text = soup.get_text()     # break lines , remove leading , trailing space on each     lines = (line.strip() line in text.splitlines())     # break multi-headlines line each     chunks = (phrase.strip() line in lines phrase in line.split("  "))     # drop blank lines     text = '\n'.join(chunk chunk in chunks if chunk)     return text 

Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -