-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
Description
@qiuzhihui
two use case:
- for our crawling jobs.
- prevent robot from crawling our web too often. I would assume this may happen later when there are some robots.. trying to visit the web thousands of times per second.. we can have some idea to block them.
This just came out of my mind today, when I was writing code for rate limiting.
Instead of using cronjob. we can possibly have a separate process for crawling using rate limiter?
I would assume doing crawling is sth we are constantly working, instead of just run it several time perday, since we are looking for real time data.
but the issue is, if we use same Id to craw too fast we may be block out? I don't know the limitation for crawling T and F.. but should be sth like.
def getLimit(max_frequency):
last_time = [0.0]
i = [0]
def rateLimit(max_frequency):
min_interval = 1.0 / max_frequency
cur_time = time.clock()
while min_interval > cur_time - last_time[0]:
cur_time = time.clock()
last_time[0] = cur_time
print i[0]
i[0] += 1
while True:
rateLimit(max_frequency)
getLimit(2)you can replace i[0] with a function.. so that we will not craw too frequently.
downside is this process will always take resources..
upside is you can use separate machine to do this job.
But I would say cron job may not be a correct solution for crawling, since it only handle periodic jobs.
Reactions are currently unavailable