Skip to content

design rate limiting for crawler (discussion) #11

@hyperchi

Description

@hyperchi

@qiuzhihui
two use case:

  1. for our crawling jobs.
  2. prevent robot from crawling our web too often. I would assume this may happen later when there are some robots.. trying to visit the web thousands of times per second.. we can have some idea to block them.
    This just came out of my mind today, when I was writing code for rate limiting.
    Instead of using cronjob. we can possibly have a separate process for crawling using rate limiter?

I would assume doing crawling is sth we are constantly working, instead of just run it several time perday, since we are looking for real time data.

but the issue is, if we use same Id to craw too fast we may be block out? I don't know the limitation for crawling T and F.. but should be sth like.

def getLimit(max_frequency):
    last_time = [0.0]
    i = [0]
    def rateLimit(max_frequency):
        min_interval = 1.0 / max_frequency
        cur_time = time.clock()
        while min_interval > cur_time - last_time[0]:
            cur_time = time.clock()
        last_time[0] = cur_time
        print  i[0]
        i[0] += 1
    while True:
        rateLimit(max_frequency)
getLimit(2)

you can replace i[0] with a function.. so that we will not craw too frequently.
downside is this process will always take resources..
upside is you can use separate machine to do this job.
But I would say cron job may not be a correct solution for crawling, since it only handle periodic jobs.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions