Small Update: As in one of the latest versions of the OobaBooga WebUI the "AutoAWQ" loader has been removed from the drop-down menu, you should select the "Transformers" loader for AWQ quantized models instead, and the WebUI will utilize AutoAWQ automatically. Now, instead of the context length slider, which is no longer there, you can use the "alpha_value" to increase the model's max context length if you have the VRAM to spare. Everything else should work exactly the same. 🙌
@MrAlienXYZ12 ай бұрын
I tried to use your video but its impossible for me to have a response due to the CUDA, if you know something about it
@lastcron8 күн бұрын
Are you going to new video about newest version of oobanooga man, new version or alternative would be great, awq not working
@Heldn1008 ай бұрын
thanks for this great video! i love it, can you make more deep dive in use cases and extinctions and if you can some model recommendation for specific things, and maybe character, and fine tunning too...
@xzendor7digitalartcreations3 ай бұрын
Thanks for the share. Since I don't have an nVidia GPU I had to use CPU mode. But the models would not run until I tracked down info on using GGUF versions on the models. Which are quantized versions. After downloading these models that are available on HuggingFace; I was able to run the AI without issues on an Intel 12700F CPU. The result was very responsive when compared to other desktop AI's that I have tried. I tested this with the quantized Llama 3.1 8B Instruct GGUF (nmerkle/Meta-Llama-3-8B-Instruct-ggml-model-Q4_K_M.gguf) and the quantized Llama 3.1 70B-Instruct GGUF (MaziyarPanahi/Meta-Llama-3.1-70B-Instruct-GGUF) The 4bit quantized 70B parameter model runs my 64GB system but it is slow; so I recommend just sticking with the 8B parameter model.
@snailman798912 күн бұрын
im leaving a comment just to help with the algorithm because this was so helpful
@digibrayy225 ай бұрын
the most fastest textgen i ever try, faster than agnaistic
@DarkSentinel525 ай бұрын
i get this error RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@UltraStyle-AI6 ай бұрын
Nice video explaining this. Thanks for sharing!
@fiuzziiАй бұрын
great video, can you make more indeph tutorial, you probably can come up with something better but for example make some changes to the ai to make it more human or something, or just to explore and explain better this software
@GraghMSM6 ай бұрын
while downloading the dolhpin thing one of the bars is stuck at 0%, its been a few hours and nothing has changed
@mytechantics5 ай бұрын
Check if your terminal window output isn't paused, if you click somewhere inside the window and leave it at that, the script execution stops. Press enter when inside the active terminal window, and it should resume. If it's not that, then I'd just try redownloading the model, or picking a different one.
@paintingvalues5 ай бұрын
No matter what I do I keep getting this message on the console when I send a message on the chat "ValueError: No model is loaded! Select one in the Model tab." I'll load the model but I get this every time
@mytechantics5 ай бұрын
Hmm, that definitely sounds strange but I honestly have no better idea here than to reinstall the WebUI and make sure all the dependencies are there.
@siddhartharoy13067 ай бұрын
No Scroll down option for Character gallery
@mytechantics7 ай бұрын
Make sure that in the Session tab in the Available Extensions section the gallery option is enabled. Then, the character library should be available on the bottom of the chat tab after scrolling down.
@DarkSentinel525 ай бұрын
FEDS CANT SPY ON ME 🗣🗣🗣🔥🔥🔥
@wirek695 күн бұрын
What would be recommended model for 12 Vram GPU?
@alexishungry564425 күн бұрын
maaaan i cant even download it properly it just tells me the filename or extension is too long and how i gotta recreate it
@danwestwood96636 ай бұрын
hey thanks alot for doing this video I was so stuck with integrating and downloading a model even after an ai course there are so many technologies , look can we run one in git hub pages with oobabooga and serve it ? somehow through api?
@mytechantics6 ай бұрын
Hi, thanks for the comment. Sadly I'm not familiar enough with GitHub Pages to answer that question
@rfdouglas74245 ай бұрын
Hi, I tried to load this model and got the following error, ModuleNotFoundError: No module named 'awq', What can it be? 😅
@mytechantics5 ай бұрын
I would try reinstalling the WebUI, make sure you run all the installation files as administrator.
@rfdouglas74245 ай бұрын
@@mytechantics ty
@investigator79846 ай бұрын
When my context length (the amount of stuff that's been written in the chat) in the conversation reaches the max_seq_len value I've set, the chat bot starts to reply with empty messages only. I can't increase max_seq_len indefinetely because the answers will take too long to generate- surely there must be a way to continue chatting beyond the max_seq_len, because a conversation reaches that point pretty quickly?
@mytechantics5 ай бұрын
Well, the max context length or context window you can set without any slowdowns is dependent on how much VRAM your GPU has. If you set the max context length too high, and the program starts to use your main system RAM in addition to your VRAM, you will experience slowdowns. This is the main reason I've set the context length to such a low value in this video - I have only 8GB of VRAM on my current testing GPU (RTX 2070 Super).
@investigator79845 ай бұрын
@@mytechantics - I don't feel like you understood what I was asking for. For example, I set my max_seq_len to 2048, because that's what my hardware can handle. When the context length in my chat reaches 2048 (you can see this in the cmd window), this takes maybe 15 minutes of chatting, the chatbot becomes incapable of generating any more responses and returns empty responses only. I feel like it's not how it should work? Because it would render the entire web UI incapable of any longer and entertaining chats. There must be something I'm missing for the bot to keep replying after it reaches the max_seq_len.
@investigator79845 ай бұрын
I experimented a bit and I think it has something to do with the model loader but I don't understand it fully.
@mytechantics5 ай бұрын
Oh, I see what you mean now. Generally once you exceed the max context length that your GPU can handle you'll get either scrambled nonsensical output, or no output at all. As after some time you naturally can't fit any more conversation data in your VRAM (and as far as I know after each new message, all the current existing conversation data has to be processed as the conversation context), I presume the only solution would be to somehow automatically cut the oldest part of the current conversation context window data, in favor of the new messages in the convo as you go. This of course would mean that your character wouldn't "remember" what happened outside of that moving context window. I don't know if it's possible in OobaBooga though.
@JosephNewton575 ай бұрын
followed you through to the end I have deleted and reinstalled everything multiple times using the files you suggested but am unable to get the chat to answer anything it just sits a typing never ends. I thought that I figured out when you adjusted the memory but still no joy. System is AMD Rysen 5 2600, 32gig ddr4 mem, and an AMD RX580 not great but should work so why not I have screenshots of everything HWINFO file etc if you can help me
@kdzvocalcovers35164 ай бұрын
this program cant be installed...error after error..tried 20 dam times
@And-lj5gb6 ай бұрын
I think I followed everything in the video but I don't have this "Character gallery" thing.
@mytechantics6 ай бұрын
Make sure that in the Session tab in the Available Extensions section the gallery option is enabled and then restart the WebUI. Then, the character library should be available on the bottom of the chat tab after scrolling down!
@And-lj5gb6 ай бұрын
@@mytechantics - it works, thanks!
@peaskygt59286 ай бұрын
"Failed to load the model" while it "Fusing layers at 62%", any solution?
@mytechantics6 ай бұрын
It might be that you've ran out of VRAM when loading the model and that's why the process failed, if that's the problem, you simply need to use a smaller model (watch your VRAM usage in the task manager as shown). If that's not the case, try a few different models suitable for your GPU and see if the problem persists.
@peaskygt59286 ай бұрын
@@mytechantics Oh, its the GPU 4sure.. I got 1050TI 4VRAM. Can you recommend me which model i should use? I tryed the one from ur video, and than the Wizard-Vicuna-7B-Uncensored model u commented down below my comment for some guy. Same problem but that wizard loaded to 78% or a bit more.
@mytechantics6 ай бұрын
@@peaskygt5928 Oh, I see, unfortunately that's going to be a problem. Running any LLMs on a GPU with 4GB of VRAM is going to be very hard if not impossible. You could try running some smaller GGUF format models splitting them between your GPU VRAM (4GB) and main system RAM on load (if you have a sufficient amount of RAM for the model to load), however mind that the inference can get extremely slow in that configuration. See this thread here, the guy also has an 1050 Ti 4GB: www.reddit.com/r/Oobabooga/comments/17pq7he/models_on_low_pc/ Sadly, I can't guarantee it will work. I've only done experiments with 8GB VRAM cards, and it seems to be the absolute minimum for fitting and fully loading smaller quantized models in most cases.
@batteryphil7 ай бұрын
im getting this error. NotImplementedError: Cannot copy out of meta tensor; no data!
@mytechantics7 ай бұрын
Your GPU might not have enough VRAM to fit the model with your chosen load settings, plus there might be some problems with the CPU off-loading process (which you don't want to kick in anyway). Similar problem here: github.com/oobabooga/text-generation-webui/issues/4965#issuecomment-1886094064 and here, better explained in another piece of software: github.com/togethercomputer/OpenChatKit/issues/87#issuecomment-1537234491 You could try and run a smaller AWQ model like this one with a smaller context window setting (starting with like 1000-1500), and see if it works: huggingface.co/TheBloke/Mistral-Pygmalion-7B-AWQ
@CarpeUniversum7 ай бұрын
You might try a gptq model with 7b instead of awq. I have a 3090, and awq have never worked correctly for me. I've heard that AWQ models, while faster and newer, have all kinds of random issues.. gptq models work beautifully for me. Search Huggingface for something like "7b gptq" to find tons of options. Add "dolphin mixtral" if you wanna find this exact one in gptq variety.
@mytechantics7 ай бұрын
Interesting take, @batteryphil you should definitely try that. I've been mostly trying out AWQ models up until now, so I definitely need to experiment more with different model quantization methods like GPTQ in the future.
@CarpeUniversum7 ай бұрын
I'm genuinely starting to think something is borked up with the most recent OOBABOOGA version, too. I've had all this running buttery smooth on my old pc with a worse vid card without issue. In fact, that pc hasn't been changed or powered on in months, and I'm gonna set it up tomorrow or Sunday, and see if things still work properly there with the older setup. ~SOMETHING~ weird is going on somewhere.
@thrushithchowdaryyelamanch98926 ай бұрын
how can i restart the web ui again after closing it??
@mytechantics5 ай бұрын
You just need to run the start_windows.bat file again.
@CarpeUniversum7 ай бұрын
Any idea why, even when i drop to 100 max_seq_len, with a 7b awq or gptq model, it consumes literally all of the ram i have, both video, and system? 24 gig of video ram, and 60 gig of system ram for that is absurd. I've had these running in the past with no issues, on a lesser video card, and I simply do not understand what is happening.
@mytechantics7 ай бұрын
Yeah that definitely sounds weird. I honestly have no idea what could be taking up such absurd amount of memory in this case. If this situation is easily reproducible I'd open an issue for this on the official GitHub repo, but myself I have no clue what might be causing it. On the other hand, 100 token context length seems very little for most practical purposes. With a GPU like the 3090 with 24GB of VRAM you should be able to easily run larger models with context lengths higher than a few thousand tokens.
@CarpeUniversum7 ай бұрын
@@mytechanticsit went away after a reset. Perhaps it was still hung up on the messed up awq model.... But I'm also having trouble getting literally anything to work past about 40 lines of dialog. Without really changing any settings. In the past I was able to have long chats and conversations that would go well past 1000 messages.... Now, no matter what model I use, stuff seems to just go insane after around 40. Spitting out random gibberish or repeating itself.... It's like the whole thing is messed up. I see a few others complaining of similar since early an March update to OOBABOOGA, but OOBABOOGA insists all is well. As is, unfortunately, this is all almost useless to me.
@mytechantics7 ай бұрын
@@CarpeUniversum Hmm, unfortunately I won't be able to help you out here, as with the petty 8GB of VRAM I have, I'm not able to set my context length to more than 2k tokens in my use cases anyway, so my test conversations can almost never exceed ~30 relatively short messages without getting into the out-of-context random gibberish area. Maybe try and revert to an earlier version of the WebUI? In a "normal" situation on your GPU you should be able to load a model with context lengths way exceeding what I can test on my system.
@CarpeUniversum7 ай бұрын
@@mytechantics Yea - I was able to load a 10.7b gptq model that used about 16 gig of vid ram with a context length of 48,000.... changed no other settings but context length, and after about 20 messages i just got random code, gibberish. It's very strange. I'm looking into alternatives now cause this is all just.... weird. Even with just 512 context on my old setup, I was able to talk for usually around 250 messages. Something has really changed.
@DANai-hy4vs7 ай бұрын
I dont have the character gallery option available, how can I activate it?
@mytechantics7 ай бұрын
Make sure that in the Session tab in the Available Extensions section the gallery option is enabled. The character library then should be available on the bottom of the chat tab. There it's still kinda hidden, you need to scroll down and expand it.
@DANai-hy4vs7 ай бұрын
@@mytechantics thanks it was what you said
@rajapathysivalingam42025 ай бұрын
Oobabooga
@smilescharleston6196Ай бұрын
I have a Geforce 820m. I am beyond cooked.
@avatar2233Ай бұрын
Whats the point of the clickbait? Who uses LLM for NSFW??
@larsdrakblod2407 ай бұрын
do you have amoment? i'm a patron-doner, and i'm new to this, i'm trying to run TextGenWebUi on my GTX1080, the 13B-moddels won't work ofc with my 8gb-vram, and the 7B models are slow (20-120s per response), is there any lower like 3B that works well/faster with TextGenWebUi for like *chough* nsfw-chatting *chough* that you could recommend? ^.^
@mytechantics7 ай бұрын
7B models with 4-bit quantization and low context windows (set to around ~1000-1500 tokens) should work a bit faster even on an 1080. Besides the dolphin-2.6-mistral-7B model I'm using in this example, I can think of the Wizard-Vicuna-7B-Uncensored and the OpenHermes-2-Mistral-7B (you can find them on HuggingFace, I recommend the AWQ versions). All of these work with almost instant generations on my GPU with 8GB of VRAM.
@larsdrakblod2407 ай бұрын
@@mytechantics wow, thank for the help! but, i get this strange error message: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@mikrodizels6 ай бұрын
Hey, I have a GTX 1060 6GB and 16GB of RAM. I strongly suggest using GGUF (Quantized models) and separate the load between your GPU and CPU. TheBloke/Loyal-Macaroni-Maid-7B-GGUF, for a 7B, this is specifically merged an made for roleplays and is the best I currently have. I can run the biggest (Q8_0) on my potato PC, just make sure you download the model through web UI and get the tokenizer using the llamacpp_HF tab. Then you can play around with setting before loading the model, to see how much you can offload to your GPU. 16K context length on this model BTW, works fine
@mytechantics6 ай бұрын
@@mikrodizels Thanks for the model recommendation and the tips! I was hesitant to use GGUF models at first so I've stuck to the AWQ format like in the tutorial. How are the generation speeds in your configuration with CPU offloading if you don't mind me asking?
@mikrodizels6 ай бұрын
@@mytechantics That specific model gives me around 3-4 tokens/sec. It's slightly faster than I'm able to read/type, so more speed is unnecessary for me. I have selected offloading 16 layers to my GPU, but I think the offloading is not working efficiently and making a big difference. I wish there was an in-depth tutorial about web-ui model loader settings, I got no clue how most of them work or what they do
@Arc_Soma26395 ай бұрын
This is stupid, but I refrain from installing it because I hate how it install a lot of dependencies, and I think it's just bothersome to have so much things install, like what do I do to have a clean pc? have a clean isntallation of the OS? what a burdersome thing just to chat with AI.
@mytechantics5 ай бұрын
That's just how it goes with running these things locally. That's why I run many of these projects on temporary Windows virtual machines when I test them out, so I don't clutter my PC that much. If you want a good quality AI chat without any installation quirks (albeit not local), I think the only reasonable thing that's left is character.ai. Also, when it comes to locally hosted solutions you might want to check the GPT4ALL project, it's not quite the same from the technical standpoint, but it has a simple one click installer.
@Arc_Soma26395 ай бұрын
@@mytechantics Thank you very much
@BB-Series6 ай бұрын
My computer cannot run start_alltalk.bat. it appears and disappears. Help me
@mytechantics5 ай бұрын
This might be because your antivirus is blocking the program from starting. Add the start_alltalk.bat file, or even better, the whole program directory to your antivirus exceptions list, and it should work then.
@BB-Series5 ай бұрын
@@mytechantics I tried turning off the protection and doing as you said but it still didn't work. Should I delete it and run it a second time?
@Cloudbat-m9p8 күн бұрын
It's uncomical how many problems I'm having bruh. Soooo many errros and stuff. I'm losing my mind. What does all this mean bruh? "Traceback (most recent call last): File "C:\text-generation-webui-main\modules\ui_model_menu.py", line 232, in load_model_wrapper shared.model, shared.tokenizer = load_model(selected_model, loader) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-main\modules\models.py", line 93, in load_model output = load_func_map[loader](model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-main\modules\models.py", line 263, in huggingface_loader model = LoaderClass.from_pretrained(path_to_model, **params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\models\auto\auto_factory.py", line 564, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\text-generation-webui-main\installer_files\env\Lib\site-packages\transformers\modeling_utils.py", line 3763, in from_pretrained raise EnvironmentError( OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory models vidia_Llama-3.1-Nemotron-70B-Instruct-HF."