design rate limiting for crawler (discussion)

@qiuzhihui 
two use case:
1. for our crawling jobs.
2. prevent robot from crawling our web too often. I would assume this may happen later when there are some robots.. trying to visit the web thousands of times per second.. we can have some idea to block them.
This just came out of my mind today, when I was writing code for rate limiting.
Instead of using cronjob. we can possibly have a separate process for crawling using rate limiter?

I would assume doing crawling is sth we are constantly working, instead of just run it several time perday, since we are looking for real time data.

but the issue is, if we use same Id to craw too fast we may be block out? I don't know the limitation for crawling T and F.. but  should be sth like.

``` python
def getLimit(max_frequency):
    last_time = [0.0]
    i = [0]
    def rateLimit(max_frequency):
        min_interval = 1.0 / max_frequency
        cur_time = time.clock()
        while min_interval > cur_time - last_time[0]:
            cur_time = time.clock()
        last_time[0] = cur_time
        print  i[0]
        i[0] += 1
    while True:
        rateLimit(max_frequency)
getLimit(2)
```

you can replace i[0] with a function.. so that we will not craw too frequently. 
 downside is this process will always take resources..
upside is you can use separate machine to do this job.
But I would say cron job may not be a correct solution for crawling, since it only handle periodic jobs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design rate limiting for crawler (discussion) #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

design rate limiting for crawler (discussion) #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions