.net - Read only web content of given URL and strip HTML and javaScript tag out of it - C# regex expression -
i have 2 class 1 build response stream , string of web content of given url , class strip html tags using regex expression not stripping contents down. want content web page want ignore javascript code, html , other tag.
2nd part; want introduce class read url web content
public void processurl() { // used build entire input stringbuilder sb = new stringbuilder(); // used on each read operation byte[] buf = new byte[8192]; httpwebrequest request = (httpwebrequest) webrequest.create("http://www.uwl.ac.uk/why-uwl"); httpwebresponse response = (httpwebresponse) request.getresponse(); // read data via response stream stream resstream = response.getresponsestream(); string tempstring = null; int count = 0; { count = resstream.read(buf, 0, buf.length); if (count != 0) { tempstring = encoding.ascii.getstring(buf, 0, count); sb.append(tempstring); } } while (count > 0); // more data read? console.writeline("............................."); console.writeline(striptagsregex(sb.tostring())); } public static string striptagsregex(string source) { return regex.replace(source, "<.*?>", string.empty); }
i not suggest use regular expressions parsing html. use html-parsing library instead. e.g. htmlagilitypack. can select text nodes given html:
string html; // html htmldocument doc = new htmldocument(); doc.loadhtml(html); var textnodes = doc.documentnode.selectnodes("//text()");
now can grab inner text of each node
var pagetext = string.join(" ", textnodes.select(n => n.innertext.trim()));
downloading html:
string html; using(var responsestream = response.getresponsestream()) using(var reader = new streamreader(responsestream)) html = reader.readtoend();
or more simple
var client = new webclient(); string html = client.downloadstring("http://www.uwl.ac.uk/why-uwl");
Comments
Post a Comment