The performance is very unefficient

By running the sample code , 4096 context + 768 steps, it costs 8min for one question on H20 gpu and about 50G GRAM  is occupied.

- 4096 context + 768 steps: 8min +  50G GRAM 
- 2048 context + 768 steps: 4min +  31G GRAM 
- 768 context + 768 steps: 2min +  20G GRAM