regex - Using Regular Expressions to extract specific urls in python -

September 15, 2012

i have parsed html document containing javascript beautifulsoup, , have managed isolate javascript within , convert string. javascript looks this:

<script>     [irrelevant javascript code here]     sources:[{file:"http://url.com/folder1/v.html",label:"label1"},     {file:"http://url.com/folder2/v.html",label:"label2"},     {file:"http://url.com/folder3/v.html",label:"label3"}],     [irrelevant javascript code here] </script>

i trying array urls contained in sources array, so:

urls = ['http://url.com/folder1/v.html',          'http://url.com/folder2/v.html',          'http://url.com/folder3/v.html']

the domains unknown ips, folders of random name-length consisting of lowercase letters , numbers, , there 1-5 of them in each file(usually 3). constant start http , end .html.

i decided use regular expressions deal problem(which quite new at) , code looks this: urls=re.findall(r'http://[^t][^s"]+', document)

the [^t] there because there other urls in document domain names start t. problem is, there url jpg same domain urls extracting, gets put urls array along others.

example:

urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html'         'http://123.45.67.89/alwefaoewifiasdof224a/v.html',         'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',         'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']

how go fetching html urls?

you can use r'"(http.*?)"' urls within text :

>>> s="""<script> ...     [irrelevant javascript code here] ...     sources:[{file:"http://url.com/folder1/v.html",label:"label1"}, ...     {file:"http://url.com/folder2/v.html",label:"label2"}, ...     {file:"http://url.com/folder3/v.html",label:"label3"}], ...     [irrelevant javascript code here] ... </script>"""  >>> re.findall(r'"(http.*?)"',s,re.multiline|re.dotall) ['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']

ans extracting .html's in list of urls can use str.endswith :

>>> urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', ...         'http://123.45.67.89/alwefaoewifiasdof224a/v.html', ...         'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html', ...         'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg'] >>>  >>> [i in urls if i.endswith('html')] ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',   'http://123.45.67.89/alwefaoewifiasdof224a/v.html',   'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']

also general , flexible way such tasks can use fnmatch module :

>>> fnmatch import fnmatch >>> [i in urls if fnmatch(i,'*.html')] ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',   'http://123.45.67.89/alwefaoewifiasdof224a/v.html',   'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']

Search This Blog

Call

regex - Using Regular Expressions to extract specific urls in python -

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -