Deploy Open LLMs with LLAMA-CPP Server

  Рет қаралды 8,551

Prompt Engineering

Prompt Engineering

Күн бұрын

Пікірлер: 12
@engineerprompt
@engineerprompt 3 ай бұрын
If you want to build robust RAG applications based on your own datasets, this is for you: prompt-s-site.thinkific.com/courses/rag
@unclecode
@unclecode 3 ай бұрын
👏 I'm glad to see you're focusing on DevOps options for AI apps. In my opinion, LlamaCpp will remain the best way to launch a production LLM server. One notable feature is its support for hardware-level concurrency. Using the `-np 4` (or `--parallel 4`) flag allows running 4 slots in parallel, where 4 can be any number of concurrent requests you want. One thing to remember the context window will be divided accordingly. For example, if you pass `-c 4096`, each slot will have a context size of 1024. Adding the `--n-gpu-layers` (`-ngl 99`) flag will offload the model layers to your GPU, providing the best performance. So, a command like `-c 4096 -np 4 -ngl 99` will offer excellent concurrency on a machine with a 4090 GPU.
@thecodingchallengeshow
@thecodingchallengeshow Ай бұрын
can we finetune it using lora? i need it to be about ai so i have doqnloded data about ai and i want to add it to this model
@johnkost2514
@johnkost2514 3 ай бұрын
Mozilla's Llamafile format is very flexible for deploying LLM(s) across operating systems. NIM has the advantage of bundling other types of models like audio or video.
@Nihilvs
@Nihilvs 3 ай бұрын
amazing thanks !
@andreawijayakusuma6008
@andreawijayakusuma6008 3 ай бұрын
bro, I wanna ask, do I need to use GPU to run this ?
@sadsagftrwre
@sadsagftrwre 3 ай бұрын
No, llama-cpp specifically enables llms on cpus. its just going to be a bit slow, mate.
@andreawijayakusuma6008
@andreawijayakusuma6008 2 ай бұрын
@@sadsagftrwre oke thanks for the answer. I just want to tried it but afraid it won't worked without GPU.
@sadsagftrwre
@sadsagftrwre 2 ай бұрын
@@andreawijayakusuma6008 I tried on CPU and it worked.
@marcaodd
@marcaodd 3 ай бұрын
Which server specs did you use?
@engineerprompt
@engineerprompt 2 ай бұрын
Its running on A6000 with 48GB vRAM. Hope that helps.
Local RAG with llama.cpp
8:38
Learn Data with Mark
Рет қаралды 4,1 М.
Deploy AI Models to Production with NVIDIA NIM
12:08
Prompt Engineering
Рет қаралды 10 М.
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 132 МЛН
Officer Rabbit is so bad. He made Luffy deaf. #funny #supersiblings #comedy
00:18
Funny superhero siblings
Рет қаралды 12 МЛН
ДЕНЬ УЧИТЕЛЯ В ШКОЛЕ
01:00
SIDELNIKOVVV
Рет қаралды 2 МЛН
Demo: Rapid prototyping with Gemma and Llama.cpp
11:37
Google for Developers
Рет қаралды 67 М.
Qwen-Agent: Build Autonomous Agents with The Best Open Weight Model
19:34
Prompt Engineering
Рет қаралды 8 М.
GGUF quantization of LLMs with llama cpp
12:10
AI Bites
Рет қаралды 3 М.
Setting up a production ready VPS is a lot easier than I thought.
29:50
Mistral 7B Function Calling with llama.cpp
5:19
Learn Data with Mark
Рет қаралды 1,6 М.
Go Production:  ⚡️ Super FAST LLM (API) Serving with vLLM !!!
11:53
vLLM: AI Server with 3.5x Higher Throughput
5:58
Mervin Praison
Рет қаралды 7 М.
Cheap mini runs a 70B LLM 🤯
11:22
Alex Ziskind
Рет қаралды 120 М.
Self-Host and Deploy Local LLAMA-3 with NIMs
13:08
Prompt Engineering
Рет қаралды 7 М.
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 132 МЛН