Beautiful adaptation of a fundamental idea of paging, reference counting and copy -on-write.👌
@dinoscheidt Жыл бұрын
Full circle dynamic memory management and garbage collection. Great talk!
@simonguo104811 ай бұрын
Such an elegant idea and amazingly clear explanation!
@sherlockho46135 ай бұрын
very helpful and distinguish presentation!
@TheAIEpiphany8 ай бұрын
Great talk and amazing work guys!
@keshmesh1234 ай бұрын
It was great. Thank you!
@RahulJain-wr6kxАй бұрын
Awesome 👍
@harshadkunjir5800 Жыл бұрын
This is so great!
@vaporeon28228 ай бұрын
Interesting sharings. Curious about the underlying implementation for KV blocks sharing part you have a copy-on-write mechanism, but how does it avoid dirty-read condition, where both request reads that ref count is 2 and both request copies the block simultaneously.
@alankhor200011 ай бұрын
I think the last question was asked on impact on latency
@erkinsagroglu85192 ай бұрын
7:25 How is it possible to compute attentions separately block by block? Softmax (attention weight) is calculated based on all of the previous tokens and then those softmax scores are multiplied with all of the previous tokens' value vectors to calculate the attention score for the new token. So it should use all of the previous tokens on other blocks twice. What am I missing here?
@erkinsagroglu85192 ай бұрын
I read the paper. Turns out the illustration is not 100% accurate (probably for the sake of making it intuitive). It indeed uses every previous block (in case sliding windows is not used) while computing the attention for the next layer.
@LiangyueLi8 ай бұрын
great work
@mshonle Жыл бұрын
It seems like there would be a performance increase for beam search as well? (That is, in addition to the memory savings it gets.) Would be great to see some benchmarks for that!
@erkinsagroglu85192 ай бұрын
If sequences of different sizes can be processed in parallel (say request 1 is generating 11th token and request 2 is generating 3rd token), how come those two operations (Query vector of request 1 - say dimension 1x50 - dot product with previous tokens' key vectors matrix 11x50) and (1x50 dot product 3x50) can be batched together?
@Karthikprath7 ай бұрын
How do we calculate memory used by kv cache in paged attention.Example for input 500 and output 1000
@billykotsos4642 Жыл бұрын
sick
@julien357811 ай бұрын
brilliant guys
@ameynaik2743 Жыл бұрын
Is vLLM engine running on the host?
@fxhp111 ай бұрын
you run the server on the host that has the GPU installed, the server can be accessible over an API remotely using openai's client. follow me for more AI vids