regex - Using Regular Expressions to extract specific urls in python -
i have parsed html document containing javascript beautifulsoup, , have managed isolate javascript within , convert string. javascript looks this:
<script> [irrelevant javascript code here] sources:[{file:"http://url.com/folder1/v.html",label:"label1"}, {file:"http://url.com/folder2/v.html",label:"label2"}, {file:"http://url.com/folder3/v.html",label:"label3"}], [irrelevant javascript code here] </script> i trying array urls contained in sources array, so:
urls = ['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html'] the domains unknown ips, folders of random name-length consisting of lowercase letters , numbers, , there 1-5 of them in each file(usually 3). constant start http , end .html.
i decided use regular expressions deal problem(which quite new at) , code looks this: urls=re.findall(r'http://[^t][^s"]+', document)
the [^t] there because there other urls in document domain names start t. problem is, there url jpg same domain urls extracting, gets put urls array along others.
example:
urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html' 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html', 'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg'] how go fetching html urls?
you can use r'"(http.*?)"' urls within text :
>>> s="""<script> ... [irrelevant javascript code here] ... sources:[{file:"http://url.com/folder1/v.html",label:"label1"}, ... {file:"http://url.com/folder2/v.html",label:"label2"}, ... {file:"http://url.com/folder3/v.html",label:"label3"}], ... [irrelevant javascript code here] ... </script>""" >>> re.findall(r'"(http.*?)"',s,re.multiline|re.dotall) ['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html'] ans extracting .html's in list of urls can use str.endswith :
>>> urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', ... 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', ... 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html', ... 'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg'] >>> >>> [i in urls if i.endswith('html')] ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html'] also general , flexible way such tasks can use fnmatch module :
>>> fnmatch import fnmatch >>> [i in urls if fnmatch(i,'*.html')] ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']
Comments
Post a Comment