.net - Read only web content of given URL and strip HTML and javaScript tag out of it

.net - Read only web content of given URL and strip HTML and javaScript tag out of it - C# regex expression -

January 15, 2015

i have 2 class 1 build response stream , string of web content of given url , class strip html tags using regex expression not stripping contents down. want content web page want ignore javascript code, html , other tag.

2nd part; want introduce class read url web content

 public void processurl()    {        // used build entire input        stringbuilder sb = new stringbuilder();         // used on each read operation        byte[] buf = new byte[8192];         httpwebrequest request = (httpwebrequest)            webrequest.create("http://www.uwl.ac.uk/why-uwl");         httpwebresponse response = (httpwebresponse) request.getresponse();         // read data via response stream        stream resstream = response.getresponsestream();        string tempstring = null;        int count = 0;                {            count = resstream.read(buf, 0, buf.length);             if (count != 0)            {                tempstring = encoding.ascii.getstring(buf, 0, count);                sb.append(tempstring);            }        }        while (count > 0); // more data read?        console.writeline(".............................");        console.writeline(striptagsregex(sb.tostring()));    }     public static string striptagsregex(string source)    {        return regex.replace(source, "<.*?>", string.empty);    }

i not suggest use regular expressions parsing html. use html-parsing library instead. e.g. htmlagilitypack. can select text nodes given html:

string html; // html htmldocument doc = new htmldocument(); doc.loadhtml(html); var textnodes = doc.documentnode.selectnodes("//text()");

now can grab inner text of each node

var pagetext = string.join(" ", textnodes.select(n => n.innertext.trim()));

downloading html:

string html;  using(var responsestream = response.getresponsestream()) using(var reader = new streamreader(responsestream))     html = reader.readtoend();

or more simple

var client = new webclient(); string html = client.downloadstring("http://www.uwl.ac.uk/why-uwl");

Search This Blog

Call

.net - Read only web content of given URL and strip HTML and javaScript tag out of it - C# regex expression -

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -