how to write endless loop crawler in python? -

February 15, 2011

edited:

i have crawler.py crawls sites every 10 minutes , sends me emails regarding these site. crawler ready , working locally.

how can adjust following 2 things happen :

it run in endless loop on hosting i'll upload to?
sometimes able stop ( e.g. debugging).

at first, thought of doing endless loop e.g.

crawler.py:

while true:     docarwling()     sleep(10 minutes)

however, according answers got below, impossible since hosting providers kill processes after while (just question sake, let's assume proccesses killed every 30 min). therefore, endless loop process killed @ point.

therefore, have thought pf different solution: lets assume crawler located @ "www.example.com\crawler.py" , each time accessed, executes function run():

run()      docarwling()      sleep(10 minutes)      call url "www.example.com\crawler.py"

thus, there no endless loop. in fact, every time crawler runs, access url execute same crawler again. therefore, there no endless loop, no process long-running time, , crawler continue operating forever.

will idea work? there hidden drawbacks haven't thought of?

thanks!

thanks

as stated in comments, running on public shared server godaddy , on. therefore cron not available there , long running scripts forbidden - process killed if using sleep.

therefore, solution see use external server on have control connect public server , run script, every 10 minutes. 1 solution using cron on local machine connect wget or curl specific page on host. **

maybe can find on-line services allow running script periodically, , use those, know none.

** bonus: can results directly response without having send email.

update

so, in updated question propose yo use script call http request. thought of before, didn't consider in previous answer because believe won't work (in general).

my concern is: server kill script if http connection requesting closed before script terminates?

in other words: if open yoursite.com/script.py , takes 60 seconds run, , close connection server after 10 seconds, script run till regular end?

i thought answer "no, script killed", therefore method useless, because should guarantee script calling via http request stays alive longer called script. did little experiment using flask, , proved me wrong:

from flask import flask app = flask(__name__)  @app.route('/') def hello_world():     import time     print('script started...')     time.sleep(5)     print('5 seconds passed...')     time.sleep(5)     print('script finished')     return 'script finished'  if __name__ == '__main__':     app.run()

if run script , make http request localhost:5000, , close connection after 2 seconds, scripts continues run until end , messages still printed.

therefore, with flask, if can asynchronous request yourself, should able have "infinite loop" script.

i don't know behavior on other servers, though. should make test.

control

assuming server allows request , have script running if connection closed, have few things take care of, example script still has run fast enough complete during maximum server time allowance, , make script run every 10 minutes, maximum allowance of 1 minute, have count every time 10 calls.

in addition, mechanism has controlled, because cannot interrupt debug requested. @ least, not directly.

therefore, suggest use files: use file split crawling in smaller steps, each capable finish in less 1 minute, , continue again when script called again.

use file count how many times script called, before doing crawling. necessary if, example, script allowed live 90 seconds, want crawl every 10 hours.

use file control script: store boolean flag use stop recursion mechanism if need to.

Search This Blog

Call

how to write endless loop crawler in python? -

update

control

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -