subtitlecat.com

All language subtitles for Ultimate Guide Local AI Setup (Qwen3.6 + LlamaC++ + TurboQuant)

Afrikaans

Akan

Albanian

Amharic

Arabic

Armenian

Azerbaijani

Basque

Belarusian

Bemba

Bengali

Bihari

Bosnian

Breton

Bulgarian

Cambodian

Catalan

Cebuano

Cherokee

Chichewa

Chinese (Simplified)

Chinese (Traditional)

Corsican

Croatian

Czech

Danish

Dutch

English

Esperanto

Estonian

Ewe

Faroese

Filipino

Finnish

French

Frisian

Galician

Georgian

German

Greek

Guarani

Gujarati

Haitian Creole

Hausa

Hawaiian

Hebrew

Hindi

Hmong

Hungarian

Icelandic

Igbo

Indonesian

Interlingua

Irish

Italian

Japanese

Javanese

Kannada

Kazakh

Kinyarwanda

Kirundi

Kongo

Korean

Krio (Sierra Leone)

Kurdish

Kurdish (Soranî)

Kyrgyz

Laothian

Latin

Latvian

Lingala

Lithuanian

Lozi

Luganda

Luo

Luxembourgish

Macedonian

Malagasy

Malay

Malayalam

Maltese

Maori

Marathi

Mauritian Creole

Moldavian

Mongolian

Myanmar (Burmese)

Montenegrin

Nepali

Nigerian Pidgin

Northern Sotho

Norwegian

Norwegian (Nynorsk)

Occitan

Oriya

Oromo

Pashto

Persian

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Punjabi

Quechua

Romanian

Romansh

Runyakitara

Russian

Samoan

Scots Gaelic

Serbian

Serbo-Croatian

Sesotho

Setswana

Seychellois Creole

Shona

Sindhi

Sinhalese

Slovak

Slovenian

Somali

Spanish

Spanish (Latin American)

Sundanese

Swahili

Swedish

Tajik

Tamil

Tatar

Telugu

Thai

Tigrinya

Tonga

Tshiluba

Tumbuka

Turkish Download

Turkmen

Twi

Uighur

Ukrainian

Urdu

Uzbek

Vietnamese

Welsh

Wolof

Xhosa

Yiddish

Yoruba

Zulu

Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated: 1 00:00:00,000 --> 00:00:04,760 I had so many issues trying to get local 2 00:00:02,320 --> 00:00:07,120 LLM's working with Claude code, with 3 00:00:04,760 --> 00:00:09,440 Kilo code, with Open Claw. I kept 4 00:00:07,120 --> 00:00:12,240 getting this pesky client timeout error 5 00:00:09,440 --> 00:00:14,360 in LM Studio and Ollama would just wipe 6 00:00:12,240 --> 00:00:16,520 over my Open Claw setting. But I wanted 7 00:00:14,360 --> 00:00:18,880 to get to the bottom of what how I can 8 00:00:16,520 --> 00:00:21,000 get the purest setup running local 9 00:00:18,880 --> 00:00:23,440 models on my computer and I found the 10 00:00:21,000 --> 00:00:25,240 answer in Llama.cpp. So today we're 11 00:00:23,440 --> 00:00:26,840 going to get that set up and how you can 12 00:00:25,240 --> 00:00:28,920 run it and follow it carefully I can 13 00:00:26,840 --> 00:00:30,120 guarantee you'll have zero issues. Now 14 00:00:28,920 --> 00:00:31,560 what a lot of people don't realize is 15 00:00:30,120 --> 00:00:34,040 that LM Studio and Ollama are actually 16 00:00:31,560 --> 00:00:36,880 just wrappers of Llama.cpp. They just 17 00:00:34,040 --> 00:00:38,680 provide a new nicely user interface and 18 00:00:36,880 --> 00:00:40,880 we'll get into this but Ollama just 19 00:00:38,680 --> 00:00:43,680 added this but LM Studio have had it for 20 00:00:40,880 --> 00:00:45,840 quite a while is the option to run MLX 21 00:00:43,680 --> 00:00:48,000 models. Ollama only have it for one 22 00:00:45,840 --> 00:00:50,520 model whereas you get all of the models 23 00:00:48,000 --> 00:00:52,720 you can find on Hugging Face 24 00:00:50,520 --> 00:00:54,200 downloadable through LM Studio. And with 25 00:00:52,720 --> 00:00:56,120 this method we're going to leverage what 26 00:00:54,200 --> 00:00:58,640 they call Turbo Quant. Turbo Quant was 27 00:00:56,120 --> 00:01:01,920 something Google announced this year 28 00:00:58,640 --> 00:01:04,960 which inevitably is a new way to quantal 29 00:01:01,920 --> 00:01:06,160 quantisize quantisize quantisization of 30 00:01:04,960 --> 00:01:08,200 is that a 31 00:01:06,160 --> 00:01:11,040 a real word? It's an efficient way to 32 00:01:08,200 --> 00:01:12,960 extreme compression. Again, I'm not an 33 00:01:11,040 --> 00:01:15,440 expert in all this but it's an 34 00:01:12,960 --> 00:01:17,800 interesting way to optimize key value or 35 00:01:15,440 --> 00:01:22,600 KV cache and whilst it hasn't been 36 00:01:17,800 --> 00:01:25,600 incorporated into Llama.cpp yet Tom here 37 00:01:22,600 --> 00:01:27,880 has gone ahead and created a branch that 38 00:01:25,600 --> 00:01:31,040 enables this. Eventually in time I've no 39 00:01:27,880 --> 00:01:33,160 doubt that we'll get it in Llama.cpp 40 00:01:31,040 --> 00:01:35,680 now Tom is the one who's done this so 41 00:01:33,160 --> 00:01:37,880 give him a give him a thanks up give his 42 00:01:35,680 --> 00:01:39,320 give his repo a star there. So whilst 43 00:01:37,880 --> 00:01:41,520 we've got the base model and 44 00:01:39,320 --> 00:01:43,400 quantization happens 45 00:01:41,520 --> 00:01:45,520 on the model itself we can't really do 46 00:01:43,400 --> 00:01:48,320 anything about that we just choose the 47 00:01:45,520 --> 00:01:50,320 model we want to use is that KV storage 48 00:01:48,320 --> 00:01:52,840 that we want to optimize and that's 49 00:01:50,320 --> 00:01:54,520 exactly what Turbo Quant does. Now the 50 00:01:52,840 --> 00:01:57,200 first thing you're going to want to do 51 00:01:54,520 --> 00:01:59,160 is head on over to the Tom and Turbo 52 00:01:57,200 --> 00:02:02,080 Quant plus. I'll leave everything linked 53 00:01:59,160 --> 00:02:04,240 below and there's a lot to take in here 54 00:02:02,080 --> 00:02:06,560 and I'm not going to claim to know all 55 00:02:04,240 --> 00:02:08,560 of this stuff but we'll go through it 56 00:02:06,560 --> 00:02:10,800 step by step and I'll try and point out 57 00:02:08,560 --> 00:02:12,320 the things you should really care about. 58 00:02:10,800 --> 00:02:13,760 This is the section you're going to want 59 00:02:12,320 --> 00:02:16,640 to really care about and the first thing 60 00:02:13,760 --> 00:02:19,120 you want to do is actually clone the Git 61 00:02:16,640 --> 00:02:21,120 repo. And this won't be a lesson on the 62 00:02:19,120 --> 00:02:23,160 terminal but I'll do my best to explain 63 00:02:21,120 --> 00:02:24,960 it here. Do yourself a favor and 64 00:02:23,160 --> 00:02:26,640 download Warp links are down below in 65 00:02:24,960 --> 00:02:28,480 the description. This will make it 66 00:02:26,640 --> 00:02:31,040 things go wrong a hell of a lot easier 67 00:02:28,480 --> 00:02:32,760 because it's an AI powered terminal and 68 00:02:31,040 --> 00:02:34,640 it will just help you figure it out. It 69 00:02:32,760 --> 00:02:36,240 did so on one of my old machines which 70 00:02:34,640 --> 00:02:38,840 obviously had some sort of conflicts or 71 00:02:36,240 --> 00:02:40,880 something like that. So, within terminal 72 00:02:38,840 --> 00:02:42,720 you want to CD which is change direction 73 00:02:40,880 --> 00:02:44,800 to the direct the directory that you 74 00:02:42,720 --> 00:02:46,280 want to store this app in. Now it's not 75 00:02:44,800 --> 00:02:48,400 going to be a traditional app which you 76 00:02:46,280 --> 00:02:49,840 double click and then it opens up. It 77 00:02:48,400 --> 00:02:52,040 you're literally going to be downloading 78 00:02:49,840 --> 00:02:54,800 the raw files that build this 79 00:02:52,040 --> 00:02:56,200 application. So what I'd suggest is CD 80 00:02:54,800 --> 00:02:58,520 and if you know the location of the 81 00:02:56,200 --> 00:03:00,520 directory just type it in here. I know 82 00:02:58,520 --> 00:03:02,480 mine is in sites as an example that's 83 00:03:00,520 --> 00:03:03,920 where I want to download mine or if 84 00:03:02,480 --> 00:03:06,760 you're brand new to this open up your 85 00:03:03,920 --> 00:03:09,000 finder literally just drag the folder in 86 00:03:06,760 --> 00:03:11,200 there and click enter and it will change 87 00:03:09,000 --> 00:03:12,520 that directory. Press enter I've already 88 00:03:11,200 --> 00:03:14,400 got it so it's probably going to fail 89 00:03:12,520 --> 00:03:16,520 there but it will 90 00:03:14,400 --> 00:03:18,640 it will download that code. Next we're 91 00:03:16,520 --> 00:03:19,960 going to CD into that folder but Warp's 92 00:03:18,640 --> 00:03:21,920 probably going to help us out when this 93 00:03:19,960 --> 00:03:23,920 is done. 94 00:03:21,920 --> 00:03:27,560 And sure enough it does. 95 00:03:23,920 --> 00:03:29,160 And then you check out the branch 96 00:03:27,560 --> 00:03:32,160 because this is all stored on GitHub 97 00:03:29,160 --> 00:03:34,080 where as I say all the code lives this 98 00:03:32,160 --> 00:03:36,680 is a special feature that they've made 99 00:03:34,080 --> 00:03:37,960 off of the main branch that makes sense. 100 00:03:36,680 --> 00:03:40,040 Basically you're going to want to check 101 00:03:37,960 --> 00:03:42,120 out that branch here which again I 102 00:03:40,040 --> 00:03:44,800 already have so just copy that line of 103 00:03:42,120 --> 00:03:46,280 code there paste it in and it should 104 00:03:44,800 --> 00:03:47,200 just check that out. I'm not going to do 105 00:03:46,280 --> 00:03:49,400 that. 106 00:03:47,200 --> 00:03:52,480 And obviously we're Apple silicon here 107 00:03:49,400 --> 00:03:53,720 but if you're Nvidia or AMD you're going 108 00:03:52,480 --> 00:03:55,840 to want to follow the relevant 109 00:03:53,720 --> 00:03:58,240 instructions but as I say a metal and 110 00:03:55,840 --> 00:04:00,280 this is where on my other machine Warp 111 00:03:58,240 --> 00:04:01,800 came in clutch cuz it couldn't find 112 00:04:00,280 --> 00:04:04,120 something or there was a conflict 113 00:04:01,800 --> 00:04:06,200 elsewhere doing this command and I was 114 00:04:04,120 --> 00:04:08,160 able to step through that using Warp as 115 00:04:06,200 --> 00:04:10,000 a as as a helpful tool. And then you're 116 00:04:08,160 --> 00:04:12,320 going to want to actually run the build 117 00:04:10,000 --> 00:04:14,480 script which again I already have but 118 00:04:12,320 --> 00:04:16,400 just copy that back into that and this 119 00:04:14,480 --> 00:04:18,200 might take a little bit of time and then 120 00:04:16,400 --> 00:04:21,040 what you'll be left with this is the 121 00:04:18,200 --> 00:04:24,880 folder that got checked out here 122 00:04:21,040 --> 00:04:26,560 and in build bin this is basically what 123 00:04:24,880 --> 00:04:27,680 you just created all of these different 124 00:04:26,560 --> 00:04:29,680 little applications. We're not 125 00:04:27,680 --> 00:04:32,280 interested in most of these we're only 126 00:04:29,680 --> 00:04:34,360 interested in Llama.cpp 127 00:04:32,280 --> 00:04:36,320 and we're we're interested in Llama 128 00:04:34,360 --> 00:04:38,200 server here. It's worth saying here this 129 00:04:36,320 --> 00:04:40,200 will only need to be done once and then 130 00:04:38,200 --> 00:04:41,920 you navigate to this folder with CD to 131 00:04:40,200 --> 00:04:44,240 do everything from this point onwards so 132 00:04:41,920 --> 00:04:45,880 you only do that build step once. So now 133 00:04:44,240 --> 00:04:48,160 with all that done we're going to 134 00:04:45,880 --> 00:04:51,200 actually want to choose a model that we 135 00:04:48,160 --> 00:04:53,360 want to use. So for this we're going to 136 00:04:51,200 --> 00:04:54,680 go to Hugging Face which is basically 137 00:04:53,360 --> 00:04:56,640 where everyone uploads all of their 138 00:04:54,680 --> 00:04:58,600 models and you've got some of the raw 139 00:04:56,640 --> 00:05:01,000 models 140 00:04:58,600 --> 00:05:03,080 such as Qwen 3.6 which is what we're 141 00:05:01,000 --> 00:05:05,520 going to deal with today. This is the 142 00:05:03,080 --> 00:05:08,120 raw files. We go in here and there's a 143 00:05:05,520 --> 00:05:09,960 bunch of just all of these safe tensor 144 00:05:08,120 --> 00:05:12,200 files and stuff like that. We're not 145 00:05:09,960 --> 00:05:14,240 going to be touching all that today cuz 146 00:05:12,200 --> 00:05:17,360 what people have kindly done and if we 147 00:05:14,240 --> 00:05:19,280 search Qwen 3.6 the thing that we're 148 00:05:17,360 --> 00:05:22,920 going to want to make sure we include is 149 00:05:19,280 --> 00:05:22,920 you'll want to download 150 00:05:22,960 --> 00:05:27,920 one of these here. Now a popular one a 151 00:05:25,560 --> 00:05:31,200 trusted one is Unsloth they're another 152 00:05:27,920 --> 00:05:34,680 AI app and they have 153 00:05:31,200 --> 00:05:37,040 built the model as GGUF 154 00:05:34,680 --> 00:05:39,240 which is basically a a model just 155 00:05:37,040 --> 00:05:41,280 condensed down to a single file that we 156 00:05:39,240 --> 00:05:43,720 can then use. And just to break down the 157 00:05:41,280 --> 00:05:46,680 actual naming here the the model is 158 00:05:43,720 --> 00:05:48,160 obviously Qwen 3.6 it's a 35 billion 159 00:05:46,680 --> 00:05:50,720 parameter model and because it's a 160 00:05:48,160 --> 00:05:52,720 mixture of expert model it has 3 billion 161 00:05:50,720 --> 00:05:55,960 active parameters. We don't have any 162 00:05:52,720 --> 00:06:00,000 other model of Qwen 3.6 but if we look 163 00:05:55,960 --> 00:06:01,760 at Qwen 3.5 we've got 27 billion model 164 00:06:00,000 --> 00:06:03,280 here we've got a 9 billion parameter 165 00:06:01,760 --> 00:06:04,720 model here we've got a 27 billion 166 00:06:03,280 --> 00:06:07,080 parameter model here. That's kind of how 167 00:06:04,720 --> 00:06:09,000 to read these model names. And with that 168 00:06:07,080 --> 00:06:10,920 you want to take a look at this section 169 00:06:09,000 --> 00:06:13,520 here. Now this will be entirely 170 00:06:10,920 --> 00:06:15,880 dependent on how much VRAM that you 171 00:06:13,520 --> 00:06:18,040 have. If you've got bucket loads 172 00:06:15,880 --> 00:06:20,000 hundreds of gigs of VRAM then you're 173 00:06:18,040 --> 00:06:22,600 really looking at the 16-bit which is 174 00:06:20,000 --> 00:06:24,360 the unquantized version there's been no 175 00:06:22,600 --> 00:06:26,880 sort of compression added to the model 176 00:06:24,360 --> 00:06:27,880 it's a full fat version you're probably 177 00:06:26,880 --> 00:06:29,600 want to be 178 00:06:27,880 --> 00:06:33,320 you're going to be safe to download this 179 00:06:29,600 --> 00:06:36,040 one. I have 64 gigabytes of RAM so I'll 180 00:06:33,320 --> 00:06:37,720 start to look into all of these versions 181 00:06:36,040 --> 00:06:40,040 of the model. These are quantization 182 00:06:37,720 --> 00:06:42,520 levels how much they've been compressed 183 00:06:40,040 --> 00:06:44,680 in order to an attempt to maintain 184 00:06:42,520 --> 00:06:47,480 quality but reduce the size but 185 00:06:44,680 --> 00:06:49,960 realistically the more quantized you get 186 00:06:47,480 --> 00:06:51,520 the more sacrifice the more detriment to 187 00:06:49,960 --> 00:06:53,640 the model is going to take place. So the 188 00:06:51,520 --> 00:06:55,600 higher up you can get the better which 189 00:06:53,640 --> 00:06:58,320 is why we talk about the more VRAM the 190 00:06:55,600 --> 00:07:00,360 better. However, there's another layer 191 00:06:58,320 --> 00:07:02,800 that we need to think of which is again 192 00:07:00,360 --> 00:07:05,520 that KV storage which is why Turbo Quant 193 00:07:02,800 --> 00:07:07,720 comes into so much handy is because it 194 00:07:05,520 --> 00:07:09,600 compresses that and reduces that in 195 00:07:07,720 --> 00:07:11,800 order to squeeze more out of a model. 196 00:07:09,600 --> 00:07:15,520 Whereas traditionally you might look to 197 00:07:11,800 --> 00:07:17,440 reserve another 15 gig on top of the 198 00:07:15,520 --> 00:07:20,400 actual model size that you need in your 199 00:07:17,440 --> 00:07:23,360 VRAM to support it we can now think of 200 00:07:20,400 --> 00:07:26,640 about adding about 10 gig more. So as an 201 00:07:23,360 --> 00:07:29,480 example my 64 gig can can't even run 202 00:07:26,640 --> 00:07:33,720 this model however the 8-bit version 203 00:07:29,480 --> 00:07:36,560 the 38 gig model plus 10 gig between 48 204 00:07:33,720 --> 00:07:39,160 and 50 gig let's say this will fit on my 205 00:07:36,560 --> 00:07:40,920 machine with the full context is that 206 00:07:39,160 --> 00:07:43,200 context amount that we're dealing with. 207 00:07:40,920 --> 00:07:44,960 However, again depending how much VRAM 208 00:07:43,200 --> 00:07:47,080 you do you might have to go lower and 209 00:07:44,960 --> 00:07:49,000 lower and lower which starts to become 210 00:07:47,080 --> 00:07:51,880 arguable when you're quantizing it at 211 00:07:49,000 --> 00:07:53,960 3-bit 2-bit and 1-bit. So 212 00:07:51,880 --> 00:07:56,640 this will be dependent on you however 213 00:07:53,960 --> 00:07:59,280 much RAM you've got will determine what 214 00:07:56,640 --> 00:08:02,120 quantization level you can go to. Now 215 00:07:59,280 --> 00:08:05,080 the excel so we've got quantized eight 216 00:08:02,120 --> 00:08:07,600 and K is the way that they've quantized 217 00:08:05,080 --> 00:08:10,560 it. You've got some other ones here IQ 218 00:08:07,600 --> 00:08:11,880 you've got MXFP4 219 00:08:10,560 --> 00:08:14,440 you've got all of these different types 220 00:08:11,880 --> 00:08:16,000 of quantization. Honestly I always just 221 00:08:14,440 --> 00:08:19,360 look to the K ones. This is the more 222 00:08:16,000 --> 00:08:21,560 modern way to quantize a model. And then 223 00:08:19,360 --> 00:08:25,880 within that you've got another sort of 224 00:08:21,560 --> 00:08:28,320 extra large or medium or small again 225 00:08:25,880 --> 00:08:30,840 sort of micro adjustments made to the 226 00:08:28,320 --> 00:08:32,479 type of quantization that further brings 227 00:08:30,840 --> 00:08:35,080 down the 228 00:08:32,479 --> 00:08:36,760 the amount of space it takes up. Long 229 00:08:35,080 --> 00:08:39,880 story short you're going to want to go 230 00:08:36,760 --> 00:08:42,880 as high bit rate as you possibly can as 231 00:08:39,880 --> 00:08:45,320 larger sort of size as you possibly can 232 00:08:42,880 --> 00:08:48,240 and basically just looking at the number 233 00:08:45,320 --> 00:08:49,480 of gigabytes in relation to your RAM is 234 00:08:48,240 --> 00:08:51,520 the thing that you're going to want to 235 00:08:49,480 --> 00:08:53,760 care about. So you simply click the 236 00:08:51,520 --> 00:08:55,160 version you've you want to download I've 237 00:08:53,760 --> 00:08:56,400 already again I've already done that 238 00:08:55,160 --> 00:09:00,160 it'll take some time. I've just 239 00:08:56,400 --> 00:09:02,320 downloaded 30 gig worth 38 gigs worth of 240 00:09:00,160 --> 00:09:04,960 model and it should just download as a 241 00:09:02,320 --> 00:09:08,480 single file into your downloads folder. 242 00:09:04,960 --> 00:09:11,120 Now coming back into our app folder here 243 00:09:08,480 --> 00:09:12,960 if we navigate backwards and there 244 00:09:11,120 --> 00:09:14,640 should be a little models folder. This 245 00:09:12,960 --> 00:09:16,640 model can be stored anywhere but I just 246 00:09:14,640 --> 00:09:19,000 find it's a lot easier and a lot simpler 247 00:09:16,640 --> 00:09:22,920 just to store it in this area here. I'm 248 00:09:19,000 --> 00:09:24,920 going to download my GGUF model Qwen 3.6 249 00:09:22,920 --> 00:09:28,560 into that folder there. You can see it's 250 00:09:24,920 --> 00:09:30,360 38.45 gig and this is just a nice place 251 00:09:28,560 --> 00:09:31,600 for us to work. This is where the fun 252 00:09:30,360 --> 00:09:34,920 begins. 253 00:09:31,600 --> 00:09:36,920 So if we come to this build script here 254 00:09:34,920 --> 00:09:39,120 which is 255 00:09:36,920 --> 00:09:40,920 let's just run the CLI first this is 256 00:09:39,120 --> 00:09:42,400 nice and easy. Now I'm going to go 257 00:09:40,920 --> 00:09:45,280 through all of these parameters here 258 00:09:42,400 --> 00:09:47,240 with you. If you paste that in this is 259 00:09:45,280 --> 00:09:50,560 this is what it takes to run it. Now we 260 00:09:47,240 --> 00:09:52,920 are not going to be passing in a query 261 00:09:50,560 --> 00:09:55,600 so we can already delete that. We don't 262 00:09:52,920 --> 00:09:58,280 need ginger and we don't need N100 being 263 00:09:55,600 --> 00:10:00,520 on a Mac. This is how much G G how many 264 00:09:58,280 --> 00:10:02,440 GPU cores you're going to designate to 265 00:10:00,520 --> 00:10:03,916 the model. We don't really worry about 266 00:10:02,440 --> 00:10:03,920 that because we're on a Mac. 267 00:10:03,916 --> 00:10:06,080 >> [snorts] 268 00:10:03,920 --> 00:10:07,960 >> Now, the this is the KV storage that we 269 00:10:06,080 --> 00:10:09,840 talk about. Now, Tom has done the 270 00:10:07,960 --> 00:10:11,720 research here. I'm not going to claim to 271 00:10:09,840 --> 00:10:13,720 be Jesus. Tom should be Jesus, but he's 272 00:10:11,720 --> 00:10:17,440 done a lot of testing here into 273 00:10:13,720 --> 00:10:20,480 different ways to quantize the model. 274 00:10:17,440 --> 00:10:24,200 And long story short, the result of that 275 00:10:20,480 --> 00:10:25,760 is asymmetric quantization gets the 276 00:10:24,200 --> 00:10:28,320 uncompromised 277 00:10:25,760 --> 00:10:30,800 performance out of the model whilst 278 00:10:28,320 --> 00:10:33,360 leveraging Turbo Quant. Symmetrical 279 00:10:30,800 --> 00:10:35,840 Turbo Quant, you would sacrifice the the 280 00:10:33,360 --> 00:10:38,280 model performance, which is interesting. 281 00:10:35,840 --> 00:10:40,840 So, coming [snorts] back into here, the 282 00:10:38,280 --> 00:10:44,760 V storage he recommends leaving it as 283 00:10:40,840 --> 00:10:47,360 Turbo 3, but the K storage actually 284 00:10:44,760 --> 00:10:51,120 quantizing it asymmetrically. So, 285 00:10:47,360 --> 00:10:53,800 quantizing it at 8-bit here or and Turbo 286 00:10:51,120 --> 00:10:56,840 3 or Turbo 4 with the V storage. Now, 287 00:10:53,800 --> 00:10:59,520 this is the default. So, long we can 288 00:10:56,840 --> 00:11:02,080 just delete that there and we can be 289 00:10:59,520 --> 00:11:03,800 quite happy with that. Now, 290 00:11:02,080 --> 00:11:06,800 this allows us to break onto a new line, 291 00:11:03,800 --> 00:11:09,040 so don't worry about that. We want FA on 292 00:11:06,800 --> 00:11:11,800 and we are going to want to actually 293 00:11:09,040 --> 00:11:15,400 delete this one here and we're left with 294 00:11:11,800 --> 00:11:16,880 basically that. Now, C is going to be 295 00:11:15,400 --> 00:11:19,720 very important. This is the amount of 296 00:11:16,880 --> 00:11:21,600 context you're allowing the model to 297 00:11:19,720 --> 00:11:23,240 have. Now, if you're running into 298 00:11:21,600 --> 00:11:25,520 issues, you might need to reduce the 299 00:11:23,240 --> 00:11:27,240 context. So, you might be able to Again, 300 00:11:25,520 --> 00:11:29,160 fit the model on your on your hard 301 00:11:27,240 --> 00:11:31,200 drive, but can you fit the full context 302 00:11:29,160 --> 00:11:33,560 amount? You might be able to squeeze a 303 00:11:31,200 --> 00:11:36,640 little bit more juice by Let's say, for 304 00:11:33,560 --> 00:11:38,400 example, you've got 40 gig of RAM here. 305 00:11:36,640 --> 00:11:40,600 Um you might be able to download this 306 00:11:38,400 --> 00:11:42,640 level, but reduce the context so much 307 00:11:40,600 --> 00:11:44,320 that only takes a couple of gig of 308 00:11:42,640 --> 00:11:46,520 space. So, you're you're managing your 309 00:11:44,320 --> 00:11:50,839 context like that. I'm fortunate enough 310 00:11:46,520 --> 00:11:55,520 to have 64 GB of RAM on this M1 Max. So, 311 00:11:50,839 --> 00:11:58,800 what I recommend doing, if we go to 312 00:11:55,520 --> 00:12:01,280 Quen 3.6 here, is finding out the 313 00:11:58,800 --> 00:12:03,400 context size of the window of the of the 314 00:12:01,280 --> 00:12:05,839 model. So, if I just search context 315 00:12:03,400 --> 00:12:05,839 here, 316 00:12:06,000 --> 00:12:10,320 the context length here is 262,144. 317 00:12:10,440 --> 00:12:14,320 Can extend up to a million tokens. We're 318 00:12:12,880 --> 00:12:16,720 not going to push that. I'm going to 319 00:12:14,320 --> 00:12:20,040 copy that number there and actually give 320 00:12:16,720 --> 00:12:22,360 myself the maximum number of context 321 00:12:20,040 --> 00:12:24,400 size. And honestly, 322 00:12:22,360 --> 00:12:26,440 having done this quite a few times now, 323 00:12:24,400 --> 00:12:28,080 it's always a balancing act depending on 324 00:12:26,440 --> 00:12:29,920 what you're doing, how much context you 325 00:12:28,080 --> 00:12:31,880 want to give it. Obviously, simple chat 326 00:12:29,920 --> 00:12:33,920 applications don't need a lot of 327 00:12:31,880 --> 00:12:36,880 context. However, if you're doing code, 328 00:12:33,920 --> 00:12:39,520 if you're doing like multi-step 329 00:12:36,880 --> 00:12:41,320 like file reading and things like that, 330 00:12:39,520 --> 00:12:42,720 the more context you can get, the 331 00:12:41,320 --> 00:12:44,480 better. 332 00:12:42,720 --> 00:12:46,240 Now, finally, we're just going to pass 333 00:12:44,480 --> 00:12:48,120 in the actual model. Now, we're already 334 00:12:46,240 --> 00:12:52,600 navigating to the models folder, which 335 00:12:48,120 --> 00:12:54,920 we moved our model file in. So, if I go 336 00:12:52,600 --> 00:12:57,880 here, I literally just copy the name of 337 00:12:54,920 --> 00:13:00,080 the file, paste that in instead of that, 338 00:12:57,880 --> 00:13:01,760 and it's going to turn blue there. Now, 339 00:13:00,080 --> 00:13:04,240 hit enter. This is going to actually 340 00:13:01,760 --> 00:13:07,600 enable us to run this model. Maybe I can 341 00:13:04,240 --> 00:13:11,000 bring in my GPU history here 342 00:13:07,600 --> 00:13:13,120 and even my activity monitor. 343 00:13:11,000 --> 00:13:16,240 You can already see that our RAM has 344 00:13:13,120 --> 00:13:17,600 just been filled up totally with our 345 00:13:16,240 --> 00:13:19,040 model. 346 00:13:17,600 --> 00:13:20,680 Uh 347 00:13:19,040 --> 00:13:24,560 All right. 348 00:13:20,680 --> 00:13:26,520 And now I'm using Quen 3.6 35 billion 349 00:13:24,560 --> 00:13:28,320 parameters model. 350 00:13:26,520 --> 00:13:31,200 It's not the fastest and this is what we 351 00:13:28,320 --> 00:13:34,200 need to come to expect of local models, 352 00:13:31,200 --> 00:13:35,720 but at a modest 53 tokens per second, 353 00:13:34,200 --> 00:13:38,240 it's not so bad. 354 00:13:35,720 --> 00:13:39,680 And if you're happy using llama.cpp in 355 00:13:38,240 --> 00:13:42,040 this way and using the model in this 356 00:13:39,680 --> 00:13:44,040 way, more power to you, you're done. But 357 00:13:42,040 --> 00:13:45,320 if you want to use another application, 358 00:13:44,040 --> 00:13:47,680 you want to bring it into another app, 359 00:13:45,320 --> 00:13:49,600 you want to build your own app or 360 00:13:47,680 --> 00:13:51,440 something like that, what we really want 361 00:13:49,600 --> 00:13:53,800 to do, if we cancel that and press up, 362 00:13:51,440 --> 00:13:57,800 which will go to the previous 363 00:13:53,800 --> 00:14:00,600 command that we ran, if we change CLI to 364 00:13:57,800 --> 00:14:02,280 server, which if you remember correctly 365 00:14:00,600 --> 00:14:05,640 inside of 366 00:14:02,280 --> 00:14:09,280 build bin, I said we'd be interested in 367 00:14:05,640 --> 00:14:10,360 the llama server or llama CLI, wherever 368 00:14:09,280 --> 00:14:12,560 it is. 369 00:14:10,360 --> 00:14:14,560 We're actually going to use the server 370 00:14:12,560 --> 00:14:15,839 application, let's call it. So, if we 371 00:14:14,560 --> 00:14:18,000 hit that there, it's going to do 372 00:14:15,839 --> 00:14:21,240 everything it needs to do to enable us 373 00:14:18,000 --> 00:14:23,520 to serve this model across our local 374 00:14:21,240 --> 00:14:25,600 machine or across the network, if we 375 00:14:23,520 --> 00:14:27,040 choose to. So, we're waiting for this. 376 00:14:25,600 --> 00:14:29,839 Main server is listening on 377 00:14:27,040 --> 00:14:31,760 127.0.0.1:8080. 378 00:14:29,839 --> 00:14:34,720 If I click that, open it up in my 379 00:14:31,760 --> 00:14:37,480 browser, and we have a simple user 380 00:14:34,720 --> 00:14:38,480 interface running the Quen 3.6 model 381 00:14:37,480 --> 00:14:40,000 here. 382 00:14:38,480 --> 00:14:41,440 And you can chat with it and do all the 383 00:14:40,000 --> 00:14:44,440 rest of it, but we're really not that 384 00:14:41,440 --> 00:14:46,200 interested in using a UI for our use 385 00:14:44,440 --> 00:14:48,800 case. What I'm going to do now is 386 00:14:46,200 --> 00:14:51,920 actually bring it into VS Code to enable 387 00:14:48,800 --> 00:14:53,800 us to run this on across codebases and 388 00:14:51,920 --> 00:14:56,480 actually code with it. 389 00:14:53,800 --> 00:14:59,480 So, if you download VS Code, it's 390 00:14:56,480 --> 00:15:01,520 completely free, and what I recommend is 391 00:14:59,480 --> 00:15:03,720 downloading the Kilo Code extension, 392 00:15:01,520 --> 00:15:05,079 which again is completely free. They are 393 00:15:03,720 --> 00:15:07,079 a silver sponsor of the channel, but 394 00:15:05,079 --> 00:15:08,320 they are not sponsoring this episode 395 00:15:07,079 --> 00:15:11,480 whatsoever. 396 00:15:08,320 --> 00:15:13,280 Um we If you go into your settings here 397 00:15:11,480 --> 00:15:15,839 and providers, 398 00:15:13,280 --> 00:15:18,839 this is what the different providers 399 00:15:15,839 --> 00:15:21,640 that provide AI. And we're going to want 400 00:15:18,839 --> 00:15:23,880 to add our own custom provider. Now, we 401 00:15:21,640 --> 00:15:25,640 can put llama.cpp, 402 00:15:23,880 --> 00:15:27,400 really anything you want, and then a 403 00:15:25,640 --> 00:15:30,040 human-readable 404 00:15:27,400 --> 00:15:32,520 version of that is in the display name. 405 00:15:30,040 --> 00:15:35,360 If we paste in our URL here, which is 406 00:15:32,520 --> 00:15:37,800 127.0.0.1:8080, 407 00:15:35,360 --> 00:15:39,200 the thing that you saw in the terminal 408 00:15:37,800 --> 00:15:40,839 here, 409 00:15:39,200 --> 00:15:43,079 /v1, 410 00:15:40,839 --> 00:15:45,600 this is automatically going to pick up 411 00:15:43,079 --> 00:15:47,640 the fact that we're running llama.cpp 412 00:15:45,600 --> 00:15:49,360 and it's found the model that we've 413 00:15:47,640 --> 00:15:51,160 already got loaded. So, we're going to 414 00:15:49,360 --> 00:15:54,800 add that there 415 00:15:51,160 --> 00:15:56,800 and submit that. And now, when we click 416 00:15:54,800 --> 00:15:59,880 on the model selector down here, you've 417 00:15:56,800 --> 00:16:03,880 got all the models under llama.cpp, we 418 00:15:59,880 --> 00:16:07,520 have Quen 3.6 35 billion. And I'm just 419 00:16:03,880 --> 00:16:09,560 going to say, "What does this codebase 420 00:16:07,520 --> 00:16:11,959 do?" 421 00:16:09,560 --> 00:16:14,040 If we flip back into the terminal here, 422 00:16:11,959 --> 00:16:16,240 you'll see this progress build up. Now, 423 00:16:14,040 --> 00:16:18,240 this is the prefill. This is it loading 424 00:16:16,240 --> 00:16:20,320 up and getting ready to answer our 425 00:16:18,240 --> 00:16:22,959 question. Now, I have an old machine. 426 00:16:20,320 --> 00:16:25,480 This is an M1 Max. This is where the 427 00:16:22,959 --> 00:16:28,760 actual performance of your chip comes 428 00:16:25,480 --> 00:16:30,520 into play over the actual VRAM size. 429 00:16:28,760 --> 00:16:32,320 This is an old chip. It's going to take 430 00:16:30,520 --> 00:16:34,079 a long time to do this. We're already at 431 00:16:32,320 --> 00:16:35,680 0.27 432 00:16:34,079 --> 00:16:38,040 and this might seem fast in the 433 00:16:35,680 --> 00:16:40,920 beginning, but we'll start to see the 434 00:16:38,040 --> 00:16:44,600 GPU start ramping up. We'll start to see 435 00:16:40,920 --> 00:16:47,160 the RAM just the memory pressure and the 436 00:16:44,600 --> 00:16:48,800 memory use ramp up a little bit there. 437 00:16:47,160 --> 00:16:51,280 And this will go round and round and 438 00:16:48,800 --> 00:16:55,000 round and eventually, we will get a 439 00:16:51,280 --> 00:16:57,000 response inside of Kilo Code. You with a 440 00:16:55,000 --> 00:16:59,000 newer machine, a newer chip, this will 441 00:16:57,000 --> 00:17:00,360 be a lot faster for you, but for me, 442 00:16:59,000 --> 00:17:02,240 this takes a while. So, we're not going 443 00:17:00,360 --> 00:17:03,680 to sit around and do it here, but you 444 00:17:02,240 --> 00:17:06,199 can pat yourself on the back and know 445 00:17:03,680 --> 00:17:09,199 that you've got a local LLM running 446 00:17:06,199 --> 00:17:11,160 inside of Kilo Code ready to start 447 00:17:09,199 --> 00:17:12,920 coding completely free and completely 448 00:17:11,160 --> 00:17:15,400 privately. You will see I've done videos 449 00:17:12,920 --> 00:17:17,760 on the M5 MacBook Pro and the MacBook 450 00:17:15,400 --> 00:17:19,079 Air, so go check those out as well and 451 00:17:17,760 --> 00:17:21,520 you'll start to see the performance on 452 00:17:19,079 --> 00:17:22,959 some of the newer chips, but honestly, I 453 00:17:21,520 --> 00:17:25,199 showed you on here just cuz it's a bit 454 00:17:22,959 --> 00:17:27,280 easier to set up. Now, here in Open 455 00:17:25,199 --> 00:17:29,679 Claw, it's a similar deal. Now, I 456 00:17:27,280 --> 00:17:31,240 wouldn't go and run any auto 457 00:17:29,679 --> 00:17:33,679 configurators or anything like that. I 458 00:17:31,240 --> 00:17:36,600 would literally just go into config and 459 00:17:33,679 --> 00:17:38,400 open the raw config. And [snorts] 460 00:17:36,600 --> 00:17:41,440 looking here, 461 00:17:38,400 --> 00:17:44,160 you're concerned with models, providers, 462 00:17:41,440 --> 00:17:45,640 and you can see I've got LM Studio here. 463 00:17:44,160 --> 00:17:48,760 I've got Ollama, I've got MiniMax 464 00:17:45,640 --> 00:17:50,160 Portal, and I've also got llama.cpp 465 00:17:48,760 --> 00:17:51,960 added. This can be anything you want. 466 00:17:50,160 --> 00:17:53,800 It's just a human It's just a readable 467 00:17:51,960 --> 00:17:55,800 name just to name the set of 468 00:17:53,800 --> 00:17:58,560 configurations here. 469 00:17:55,800 --> 00:18:02,520 And just copy these. So, you've got the 470 00:17:58,560 --> 00:18:05,280 base URL is the exactly the same one as 471 00:18:02,520 --> 00:18:08,000 we put into Kilo Code with the the V1 472 00:18:05,280 --> 00:18:09,760 there. API key can be anything that you 473 00:18:08,000 --> 00:18:12,880 want. It doesn't actually need to be 474 00:18:09,760 --> 00:18:15,000 there. API needs to be OpenAI responses 475 00:18:12,880 --> 00:18:17,600 because that's the the format that 476 00:18:15,000 --> 00:18:21,080 llama.cpp expects. And then you're going 477 00:18:17,600 --> 00:18:23,720 to want to add the these model format. 478 00:18:21,080 --> 00:18:25,160 Now, if you copy and paste this, 479 00:18:23,720 --> 00:18:28,520 couple of things you want to 480 00:18:25,160 --> 00:18:33,679 uh know is the ID. And the way we get an 481 00:18:28,520 --> 00:18:37,360 ID is if I just curl request HTTP colon 482 00:18:33,679 --> 00:18:37,360 um 127.0.0.1:8080, 483 00:18:38,640 --> 00:18:42,640 we can actually just hit 484 00:18:40,400 --> 00:18:44,800 models. And this will be the models that 485 00:18:42,640 --> 00:18:46,640 we have. It's a little bit hard to see, 486 00:18:44,800 --> 00:18:48,919 but if you look carefully, there's an ID 487 00:18:46,640 --> 00:18:50,440 here. You can have multiple 488 00:18:48,919 --> 00:18:52,600 here depending on if you just want to 489 00:18:50,440 --> 00:18:54,840 save that information. Uh so, you want 490 00:18:52,600 --> 00:18:56,760 to put the ID in here and then name it a 491 00:18:54,840 --> 00:18:58,720 human-readable name. And then the next 492 00:18:56,760 --> 00:19:00,440 bit will be unique to the model itself. 493 00:18:58,720 --> 00:19:02,200 And you'll remember the context size cuz 494 00:19:00,440 --> 00:19:04,040 we actually passed it in here. So, if 495 00:19:02,200 --> 00:19:05,280 you just select that and copy it in. And 496 00:19:04,040 --> 00:19:06,960 those are the only values you're going 497 00:19:05,280 --> 00:19:08,159 to want to touch. And with that, you can 498 00:19:06,960 --> 00:19:10,280 save 499 00:19:08,159 --> 00:19:12,000 that file, 500 00:19:10,280 --> 00:19:13,919 go into Open Claw and literally just 501 00:19:12,000 --> 00:19:15,640 refresh. 502 00:19:13,919 --> 00:19:17,960 And when you go into your chat, you 503 00:19:15,640 --> 00:19:19,600 should see down the bottom of this list 504 00:19:17,960 --> 00:19:21,760 here, 505 00:19:19,600 --> 00:19:23,280 your model. On top of that, if we scroll 506 00:19:21,760 --> 00:19:26,480 down here, 507 00:19:23,280 --> 00:19:28,760 if we go to agents default model 508 00:19:26,480 --> 00:19:30,679 primary, this is where you're going to 509 00:19:28,760 --> 00:19:32,720 want to set up the whether you want it 510 00:19:30,679 --> 00:19:34,960 to be your primary model or you want it 511 00:19:32,720 --> 00:19:36,600 to be a fallback. And you basically just 512 00:19:34,960 --> 00:19:39,560 copy the provider name, which which we 513 00:19:36,600 --> 00:19:42,360 named C llama.cpp, and then you're going 514 00:19:39,560 --> 00:19:45,360 to want to copy the ID 515 00:19:42,360 --> 00:19:47,159 and put that in after a slash. 516 00:19:45,360 --> 00:19:49,000 Now, this means every time you start up 517 00:19:47,159 --> 00:19:51,400 a new chat, start to feel the machine 518 00:19:49,000 --> 00:19:53,280 get chuggy. 519 00:19:51,400 --> 00:19:55,600 19% 520 00:19:53,280 --> 00:19:57,960 29%. Now, generally, you're going to 521 00:19:55,600 --> 00:19:59,840 want not a lot of stuff running on your 522 00:19:57,960 --> 00:20:02,520 laptop during this time cuz everything 523 00:19:59,840 --> 00:20:04,000 you use will take up VRAM, will take up 524 00:20:02,520 --> 00:20:07,560 because it's unified, it's going to take 525 00:20:04,000 --> 00:20:08,920 up the RAM as well. So, I tend to think 526 00:20:07,560 --> 00:20:12,000 that 527 00:20:08,920 --> 00:20:14,400 having a separate computer, whether it's 528 00:20:12,000 --> 00:20:16,200 a Mac mini, for Open Claw, we start to 529 00:20:14,400 --> 00:20:18,000 get a little bit more justifiable to get 530 00:20:16,200 --> 00:20:20,160 a separate Mac mini. So, I just have it 531 00:20:18,000 --> 00:20:22,320 running on a separate computer just in 532 00:20:20,160 --> 00:20:25,800 the back there. And for coding, I really 533 00:20:22,320 --> 00:20:27,960 want to set up a remote a local server 534 00:20:25,800 --> 00:20:28,680 so that I can do I can 535 00:20:27,960 --> 00:20:30,400 um 536 00:20:28,680 --> 00:20:31,960 that I can connect to and actually code 537 00:20:30,400 --> 00:20:33,720 on my main machine, but the processing 538 00:20:31,960 --> 00:20:35,160 is happening on a different computer. 539 00:20:33,720 --> 00:20:36,440 Let us know if you want that video, but 540 00:20:35,160 --> 00:20:38,720 in the meantime, this is how to get it 541 00:20:36,440 --> 00:20:41,040 all set up on your main machine. 542 00:20:38,720 --> 00:20:42,560 And there we go. There is uh 543 00:20:41,040 --> 00:20:44,360 I just came online. I'm your AI 544 00:20:42,560 --> 00:20:46,280 assistant. We're good to go with Open 545 00:20:44,360 --> 00:20:48,720 Claw. So, there you go. I hope you 546 00:20:46,280 --> 00:20:51,000 enjoyed that getting local AI set up on 547 00:20:48,720 --> 00:20:52,840 your machine. I've had zero problems 548 00:20:51,000 --> 00:20:55,280 with Open Claw. I've had zero problems 549 00:20:52,840 --> 00:20:57,480 with Kobold Claw running local models 550 00:20:55,280 --> 00:20:59,240 using this method. And on top of that, 551 00:20:57,480 --> 00:21:00,880 using some cutting-edge technology like 552 00:20:59,240 --> 00:21:02,800 Turbo Quant. Let me know what you're 553 00:21:00,880 --> 00:21:05,600 using your local LLMs for, and I'll see 554 00:21:02,800 --> 00:21:05,600 you in the next one. 41261