Would you like to inspect the original subtitles? These are the user uploaded subtitles that are being translated:
1
00:00:00,000 --> 00:00:04,760
I had so many issues trying to get local
2
00:00:02,320 --> 00:00:07,120
LLM's working with Claude code, with
3
00:00:04,760 --> 00:00:09,440
Kilo code, with Open Claw. I kept
4
00:00:07,120 --> 00:00:12,240
getting this pesky client timeout error
5
00:00:09,440 --> 00:00:14,360
in LM Studio and Ollama would just wipe
6
00:00:12,240 --> 00:00:16,520
over my Open Claw setting. But I wanted
7
00:00:14,360 --> 00:00:18,880
to get to the bottom of what how I can
8
00:00:16,520 --> 00:00:21,000
get the purest setup running local
9
00:00:18,880 --> 00:00:23,440
models on my computer and I found the
10
00:00:21,000 --> 00:00:25,240
answer in Llama.cpp. So today we're
11
00:00:23,440 --> 00:00:26,840
going to get that set up and how you can
12
00:00:25,240 --> 00:00:28,920
run it and follow it carefully I can
13
00:00:26,840 --> 00:00:30,120
guarantee you'll have zero issues. Now
14
00:00:28,920 --> 00:00:31,560
what a lot of people don't realize is
15
00:00:30,120 --> 00:00:34,040
that LM Studio and Ollama are actually
16
00:00:31,560 --> 00:00:36,880
just wrappers of Llama.cpp. They just
17
00:00:34,040 --> 00:00:38,680
provide a new nicely user interface and
18
00:00:36,880 --> 00:00:40,880
we'll get into this but Ollama just
19
00:00:38,680 --> 00:00:43,680
added this but LM Studio have had it for
20
00:00:40,880 --> 00:00:45,840
quite a while is the option to run MLX
21
00:00:43,680 --> 00:00:48,000
models. Ollama only have it for one
22
00:00:45,840 --> 00:00:50,520
model whereas you get all of the models
23
00:00:48,000 --> 00:00:52,720
you can find on Hugging Face
24
00:00:50,520 --> 00:00:54,200
downloadable through LM Studio. And with
25
00:00:52,720 --> 00:00:56,120
this method we're going to leverage what
26
00:00:54,200 --> 00:00:58,640
they call Turbo Quant. Turbo Quant was
27
00:00:56,120 --> 00:01:01,920
something Google announced this year
28
00:00:58,640 --> 00:01:04,960
which inevitably is a new way to quantal
29
00:01:01,920 --> 00:01:06,160
quantisize quantisize quantisization of
30
00:01:04,960 --> 00:01:08,200
is that a
31
00:01:06,160 --> 00:01:11,040
a real word? It's an efficient way to
32
00:01:08,200 --> 00:01:12,960
extreme compression. Again, I'm not an
33
00:01:11,040 --> 00:01:15,440
expert in all this but it's an
34
00:01:12,960 --> 00:01:17,800
interesting way to optimize key value or
35
00:01:15,440 --> 00:01:22,600
KV cache and whilst it hasn't been
36
00:01:17,800 --> 00:01:25,600
incorporated into Llama.cpp yet Tom here
37
00:01:22,600 --> 00:01:27,880
has gone ahead and created a branch that
38
00:01:25,600 --> 00:01:31,040
enables this. Eventually in time I've no
39
00:01:27,880 --> 00:01:33,160
doubt that we'll get it in Llama.cpp
40
00:01:31,040 --> 00:01:35,680
now Tom is the one who's done this so
41
00:01:33,160 --> 00:01:37,880
give him a give him a thanks up give his
42
00:01:35,680 --> 00:01:39,320
give his repo a star there. So whilst
43
00:01:37,880 --> 00:01:41,520
we've got the base model and
44
00:01:39,320 --> 00:01:43,400
quantization happens
45
00:01:41,520 --> 00:01:45,520
on the model itself we can't really do
46
00:01:43,400 --> 00:01:48,320
anything about that we just choose the
47
00:01:45,520 --> 00:01:50,320
model we want to use is that KV storage
48
00:01:48,320 --> 00:01:52,840
that we want to optimize and that's
49
00:01:50,320 --> 00:01:54,520
exactly what Turbo Quant does. Now the
50
00:01:52,840 --> 00:01:57,200
first thing you're going to want to do
51
00:01:54,520 --> 00:01:59,160
is head on over to the Tom and Turbo
52
00:01:57,200 --> 00:02:02,080
Quant plus. I'll leave everything linked
53
00:01:59,160 --> 00:02:04,240
below and there's a lot to take in here
54
00:02:02,080 --> 00:02:06,560
and I'm not going to claim to know all
55
00:02:04,240 --> 00:02:08,560
of this stuff but we'll go through it
56
00:02:06,560 --> 00:02:10,800
step by step and I'll try and point out
57
00:02:08,560 --> 00:02:12,320
the things you should really care about.
58
00:02:10,800 --> 00:02:13,760
This is the section you're going to want
59
00:02:12,320 --> 00:02:16,640
to really care about and the first thing
60
00:02:13,760 --> 00:02:19,120
you want to do is actually clone the Git
61
00:02:16,640 --> 00:02:21,120
repo. And this won't be a lesson on the
62
00:02:19,120 --> 00:02:23,160
terminal but I'll do my best to explain
63
00:02:21,120 --> 00:02:24,960
it here. Do yourself a favor and
64
00:02:23,160 --> 00:02:26,640
download Warp links are down below in
65
00:02:24,960 --> 00:02:28,480
the description. This will make it
66
00:02:26,640 --> 00:02:31,040
things go wrong a hell of a lot easier
67
00:02:28,480 --> 00:02:32,760
because it's an AI powered terminal and
68
00:02:31,040 --> 00:02:34,640
it will just help you figure it out. It
69
00:02:32,760 --> 00:02:36,240
did so on one of my old machines which
70
00:02:34,640 --> 00:02:38,840
obviously had some sort of conflicts or
71
00:02:36,240 --> 00:02:40,880
something like that. So, within terminal
72
00:02:38,840 --> 00:02:42,720
you want to CD which is change direction
73
00:02:40,880 --> 00:02:44,800
to the direct the directory that you
74
00:02:42,720 --> 00:02:46,280
want to store this app in. Now it's not
75
00:02:44,800 --> 00:02:48,400
going to be a traditional app which you
76
00:02:46,280 --> 00:02:49,840
double click and then it opens up. It
77
00:02:48,400 --> 00:02:52,040
you're literally going to be downloading
78
00:02:49,840 --> 00:02:54,800
the raw files that build this
79
00:02:52,040 --> 00:02:56,200
application. So what I'd suggest is CD
80
00:02:54,800 --> 00:02:58,520
and if you know the location of the
81
00:02:56,200 --> 00:03:00,520
directory just type it in here. I know
82
00:02:58,520 --> 00:03:02,480
mine is in sites as an example that's
83
00:03:00,520 --> 00:03:03,920
where I want to download mine or if
84
00:03:02,480 --> 00:03:06,760
you're brand new to this open up your
85
00:03:03,920 --> 00:03:09,000
finder literally just drag the folder in
86
00:03:06,760 --> 00:03:11,200
there and click enter and it will change
87
00:03:09,000 --> 00:03:12,520
that directory. Press enter I've already
88
00:03:11,200 --> 00:03:14,400
got it so it's probably going to fail
89
00:03:12,520 --> 00:03:16,520
there but it will
90
00:03:14,400 --> 00:03:18,640
it will download that code. Next we're
91
00:03:16,520 --> 00:03:19,960
going to CD into that folder but Warp's
92
00:03:18,640 --> 00:03:21,920
probably going to help us out when this
93
00:03:19,960 --> 00:03:23,920
is done.
94
00:03:21,920 --> 00:03:27,560
And sure enough it does.
95
00:03:23,920 --> 00:03:29,160
And then you check out the branch
96
00:03:27,560 --> 00:03:32,160
because this is all stored on GitHub
97
00:03:29,160 --> 00:03:34,080
where as I say all the code lives this
98
00:03:32,160 --> 00:03:36,680
is a special feature that they've made
99
00:03:34,080 --> 00:03:37,960
off of the main branch that makes sense.
100
00:03:36,680 --> 00:03:40,040
Basically you're going to want to check
101
00:03:37,960 --> 00:03:42,120
out that branch here which again I
102
00:03:40,040 --> 00:03:44,800
already have so just copy that line of
103
00:03:42,120 --> 00:03:46,280
code there paste it in and it should
104
00:03:44,800 --> 00:03:47,200
just check that out. I'm not going to do
105
00:03:46,280 --> 00:03:49,400
that.
106
00:03:47,200 --> 00:03:52,480
And obviously we're Apple silicon here
107
00:03:49,400 --> 00:03:53,720
but if you're Nvidia or AMD you're going
108
00:03:52,480 --> 00:03:55,840
to want to follow the relevant
109
00:03:53,720 --> 00:03:58,240
instructions but as I say a metal and
110
00:03:55,840 --> 00:04:00,280
this is where on my other machine Warp
111
00:03:58,240 --> 00:04:01,800
came in clutch cuz it couldn't find
112
00:04:00,280 --> 00:04:04,120
something or there was a conflict
113
00:04:01,800 --> 00:04:06,200
elsewhere doing this command and I was
114
00:04:04,120 --> 00:04:08,160
able to step through that using Warp as
115
00:04:06,200 --> 00:04:10,000
a as as a helpful tool. And then you're
116
00:04:08,160 --> 00:04:12,320
going to want to actually run the build
117
00:04:10,000 --> 00:04:14,480
script which again I already have but
118
00:04:12,320 --> 00:04:16,400
just copy that back into that and this
119
00:04:14,480 --> 00:04:18,200
might take a little bit of time and then
120
00:04:16,400 --> 00:04:21,040
what you'll be left with this is the
121
00:04:18,200 --> 00:04:24,880
folder that got checked out here
122
00:04:21,040 --> 00:04:26,560
and in build bin this is basically what
123
00:04:24,880 --> 00:04:27,680
you just created all of these different
124
00:04:26,560 --> 00:04:29,680
little applications. We're not
125
00:04:27,680 --> 00:04:32,280
interested in most of these we're only
126
00:04:29,680 --> 00:04:34,360
interested in Llama.cpp
127
00:04:32,280 --> 00:04:36,320
and we're we're interested in Llama
128
00:04:34,360 --> 00:04:38,200
server here. It's worth saying here this
129
00:04:36,320 --> 00:04:40,200
will only need to be done once and then
130
00:04:38,200 --> 00:04:41,920
you navigate to this folder with CD to
131
00:04:40,200 --> 00:04:44,240
do everything from this point onwards so
132
00:04:41,920 --> 00:04:45,880
you only do that build step once. So now
133
00:04:44,240 --> 00:04:48,160
with all that done we're going to
134
00:04:45,880 --> 00:04:51,200
actually want to choose a model that we
135
00:04:48,160 --> 00:04:53,360
want to use. So for this we're going to
136
00:04:51,200 --> 00:04:54,680
go to Hugging Face which is basically
137
00:04:53,360 --> 00:04:56,640
where everyone uploads all of their
138
00:04:54,680 --> 00:04:58,600
models and you've got some of the raw
139
00:04:56,640 --> 00:05:01,000
models
140
00:04:58,600 --> 00:05:03,080
such as Qwen 3.6 which is what we're
141
00:05:01,000 --> 00:05:05,520
going to deal with today. This is the
142
00:05:03,080 --> 00:05:08,120
raw files. We go in here and there's a
143
00:05:05,520 --> 00:05:09,960
bunch of just all of these safe tensor
144
00:05:08,120 --> 00:05:12,200
files and stuff like that. We're not
145
00:05:09,960 --> 00:05:14,240
going to be touching all that today cuz
146
00:05:12,200 --> 00:05:17,360
what people have kindly done and if we
147
00:05:14,240 --> 00:05:19,280
search Qwen 3.6 the thing that we're
148
00:05:17,360 --> 00:05:22,920
going to want to make sure we include is
149
00:05:19,280 --> 00:05:22,920
you'll want to download
150
00:05:22,960 --> 00:05:27,920
one of these here. Now a popular one a
151
00:05:25,560 --> 00:05:31,200
trusted one is Unsloth they're another
152
00:05:27,920 --> 00:05:34,680
AI app and they have
153
00:05:31,200 --> 00:05:37,040
built the model as GGUF
154
00:05:34,680 --> 00:05:39,240
which is basically a a model just
155
00:05:37,040 --> 00:05:41,280
condensed down to a single file that we
156
00:05:39,240 --> 00:05:43,720
can then use. And just to break down the
157
00:05:41,280 --> 00:05:46,680
actual naming here the the model is
158
00:05:43,720 --> 00:05:48,160
obviously Qwen 3.6 it's a 35 billion
159
00:05:46,680 --> 00:05:50,720
parameter model and because it's a
160
00:05:48,160 --> 00:05:52,720
mixture of expert model it has 3 billion
161
00:05:50,720 --> 00:05:55,960
active parameters. We don't have any
162
00:05:52,720 --> 00:06:00,000
other model of Qwen 3.6 but if we look
163
00:05:55,960 --> 00:06:01,760
at Qwen 3.5 we've got 27 billion model
164
00:06:00,000 --> 00:06:03,280
here we've got a 9 billion parameter
165
00:06:01,760 --> 00:06:04,720
model here we've got a 27 billion
166
00:06:03,280 --> 00:06:07,080
parameter model here. That's kind of how
167
00:06:04,720 --> 00:06:09,000
to read these model names. And with that
168
00:06:07,080 --> 00:06:10,920
you want to take a look at this section
169
00:06:09,000 --> 00:06:13,520
here. Now this will be entirely
170
00:06:10,920 --> 00:06:15,880
dependent on how much VRAM that you
171
00:06:13,520 --> 00:06:18,040
have. If you've got bucket loads
172
00:06:15,880 --> 00:06:20,000
hundreds of gigs of VRAM then you're
173
00:06:18,040 --> 00:06:22,600
really looking at the 16-bit which is
174
00:06:20,000 --> 00:06:24,360
the unquantized version there's been no
175
00:06:22,600 --> 00:06:26,880
sort of compression added to the model
176
00:06:24,360 --> 00:06:27,880
it's a full fat version you're probably
177
00:06:26,880 --> 00:06:29,600
want to be
178
00:06:27,880 --> 00:06:33,320
you're going to be safe to download this
179
00:06:29,600 --> 00:06:36,040
one. I have 64 gigabytes of RAM so I'll
180
00:06:33,320 --> 00:06:37,720
start to look into all of these versions
181
00:06:36,040 --> 00:06:40,040
of the model. These are quantization
182
00:06:37,720 --> 00:06:42,520
levels how much they've been compressed
183
00:06:40,040 --> 00:06:44,680
in order to an attempt to maintain
184
00:06:42,520 --> 00:06:47,480
quality but reduce the size but
185
00:06:44,680 --> 00:06:49,960
realistically the more quantized you get
186
00:06:47,480 --> 00:06:51,520
the more sacrifice the more detriment to
187
00:06:49,960 --> 00:06:53,640
the model is going to take place. So the
188
00:06:51,520 --> 00:06:55,600
higher up you can get the better which
189
00:06:53,640 --> 00:06:58,320
is why we talk about the more VRAM the
190
00:06:55,600 --> 00:07:00,360
better. However, there's another layer
191
00:06:58,320 --> 00:07:02,800
that we need to think of which is again
192
00:07:00,360 --> 00:07:05,520
that KV storage which is why Turbo Quant
193
00:07:02,800 --> 00:07:07,720
comes into so much handy is because it
194
00:07:05,520 --> 00:07:09,600
compresses that and reduces that in
195
00:07:07,720 --> 00:07:11,800
order to squeeze more out of a model.
196
00:07:09,600 --> 00:07:15,520
Whereas traditionally you might look to
197
00:07:11,800 --> 00:07:17,440
reserve another 15 gig on top of the
198
00:07:15,520 --> 00:07:20,400
actual model size that you need in your
199
00:07:17,440 --> 00:07:23,360
VRAM to support it we can now think of
200
00:07:20,400 --> 00:07:26,640
about adding about 10 gig more. So as an
201
00:07:23,360 --> 00:07:29,480
example my 64 gig can can't even run
202
00:07:26,640 --> 00:07:33,720
this model however the 8-bit version
203
00:07:29,480 --> 00:07:36,560
the 38 gig model plus 10 gig between 48
204
00:07:33,720 --> 00:07:39,160
and 50 gig let's say this will fit on my
205
00:07:36,560 --> 00:07:40,920
machine with the full context is that
206
00:07:39,160 --> 00:07:43,200
context amount that we're dealing with.
207
00:07:40,920 --> 00:07:44,960
However, again depending how much VRAM
208
00:07:43,200 --> 00:07:47,080
you do you might have to go lower and
209
00:07:44,960 --> 00:07:49,000
lower and lower which starts to become
210
00:07:47,080 --> 00:07:51,880
arguable when you're quantizing it at
211
00:07:49,000 --> 00:07:53,960
3-bit 2-bit and 1-bit. So
212
00:07:51,880 --> 00:07:56,640
this will be dependent on you however
213
00:07:53,960 --> 00:07:59,280
much RAM you've got will determine what
214
00:07:56,640 --> 00:08:02,120
quantization level you can go to. Now
215
00:07:59,280 --> 00:08:05,080
the excel so we've got quantized eight
216
00:08:02,120 --> 00:08:07,600
and K is the way that they've quantized
217
00:08:05,080 --> 00:08:10,560
it. You've got some other ones here IQ
218
00:08:07,600 --> 00:08:11,880
you've got MXFP4
219
00:08:10,560 --> 00:08:14,440
you've got all of these different types
220
00:08:11,880 --> 00:08:16,000
of quantization. Honestly I always just
221
00:08:14,440 --> 00:08:19,360
look to the K ones. This is the more
222
00:08:16,000 --> 00:08:21,560
modern way to quantize a model. And then
223
00:08:19,360 --> 00:08:25,880
within that you've got another sort of
224
00:08:21,560 --> 00:08:28,320
extra large or medium or small again
225
00:08:25,880 --> 00:08:30,840
sort of micro adjustments made to the
226
00:08:28,320 --> 00:08:32,479
type of quantization that further brings
227
00:08:30,840 --> 00:08:35,080
down the
228
00:08:32,479 --> 00:08:36,760
the amount of space it takes up. Long
229
00:08:35,080 --> 00:08:39,880
story short you're going to want to go
230
00:08:36,760 --> 00:08:42,880
as high bit rate as you possibly can as
231
00:08:39,880 --> 00:08:45,320
larger sort of size as you possibly can
232
00:08:42,880 --> 00:08:48,240
and basically just looking at the number
233
00:08:45,320 --> 00:08:49,480
of gigabytes in relation to your RAM is
234
00:08:48,240 --> 00:08:51,520
the thing that you're going to want to
235
00:08:49,480 --> 00:08:53,760
care about. So you simply click the
236
00:08:51,520 --> 00:08:55,160
version you've you want to download I've
237
00:08:53,760 --> 00:08:56,400
already again I've already done that
238
00:08:55,160 --> 00:09:00,160
it'll take some time. I've just
239
00:08:56,400 --> 00:09:02,320
downloaded 30 gig worth 38 gigs worth of
240
00:09:00,160 --> 00:09:04,960
model and it should just download as a
241
00:09:02,320 --> 00:09:08,480
single file into your downloads folder.
242
00:09:04,960 --> 00:09:11,120
Now coming back into our app folder here
243
00:09:08,480 --> 00:09:12,960
if we navigate backwards and there
244
00:09:11,120 --> 00:09:14,640
should be a little models folder. This
245
00:09:12,960 --> 00:09:16,640
model can be stored anywhere but I just
246
00:09:14,640 --> 00:09:19,000
find it's a lot easier and a lot simpler
247
00:09:16,640 --> 00:09:22,920
just to store it in this area here. I'm
248
00:09:19,000 --> 00:09:24,920
going to download my GGUF model Qwen 3.6
249
00:09:22,920 --> 00:09:28,560
into that folder there. You can see it's
250
00:09:24,920 --> 00:09:30,360
38.45 gig and this is just a nice place
251
00:09:28,560 --> 00:09:31,600
for us to work. This is where the fun
252
00:09:30,360 --> 00:09:34,920
begins.
253
00:09:31,600 --> 00:09:36,920
So if we come to this build script here
254
00:09:34,920 --> 00:09:39,120
which is
255
00:09:36,920 --> 00:09:40,920
let's just run the CLI first this is
256
00:09:39,120 --> 00:09:42,400
nice and easy. Now I'm going to go
257
00:09:40,920 --> 00:09:45,280
through all of these parameters here
258
00:09:42,400 --> 00:09:47,240
with you. If you paste that in this is
259
00:09:45,280 --> 00:09:50,560
this is what it takes to run it. Now we
260
00:09:47,240 --> 00:09:52,920
are not going to be passing in a query
261
00:09:50,560 --> 00:09:55,600
so we can already delete that. We don't
262
00:09:52,920 --> 00:09:58,280
need ginger and we don't need N100 being
263
00:09:55,600 --> 00:10:00,520
on a Mac. This is how much G G how many
264
00:09:58,280 --> 00:10:02,440
GPU cores you're going to designate to
265
00:10:00,520 --> 00:10:03,916
the model. We don't really worry about
266
00:10:02,440 --> 00:10:03,920
that because we're on a Mac.
267
00:10:03,916 --> 00:10:06,080
>> [snorts]
268
00:10:03,920 --> 00:10:07,960
>> Now, the this is the KV storage that we
269
00:10:06,080 --> 00:10:09,840
talk about. Now, Tom has done the
270
00:10:07,960 --> 00:10:11,720
research here. I'm not going to claim to
271
00:10:09,840 --> 00:10:13,720
be Jesus. Tom should be Jesus, but he's
272
00:10:11,720 --> 00:10:17,440
done a lot of testing here into
273
00:10:13,720 --> 00:10:20,480
different ways to quantize the model.
274
00:10:17,440 --> 00:10:24,200
And long story short, the result of that
275
00:10:20,480 --> 00:10:25,760
is asymmetric quantization gets the
276
00:10:24,200 --> 00:10:28,320
uncompromised
277
00:10:25,760 --> 00:10:30,800
performance out of the model whilst
278
00:10:28,320 --> 00:10:33,360
leveraging Turbo Quant. Symmetrical
279
00:10:30,800 --> 00:10:35,840
Turbo Quant, you would sacrifice the the
280
00:10:33,360 --> 00:10:38,280
model performance, which is interesting.
281
00:10:35,840 --> 00:10:40,840
So, coming [snorts] back into here, the
282
00:10:38,280 --> 00:10:44,760
V storage he recommends leaving it as
283
00:10:40,840 --> 00:10:47,360
Turbo 3, but the K storage actually
284
00:10:44,760 --> 00:10:51,120
quantizing it asymmetrically. So,
285
00:10:47,360 --> 00:10:53,800
quantizing it at 8-bit here or and Turbo
286
00:10:51,120 --> 00:10:56,840
3 or Turbo 4 with the V storage. Now,
287
00:10:53,800 --> 00:10:59,520
this is the default. So, long we can
288
00:10:56,840 --> 00:11:02,080
just delete that there and we can be
289
00:10:59,520 --> 00:11:03,800
quite happy with that. Now,
290
00:11:02,080 --> 00:11:06,800
this allows us to break onto a new line,
291
00:11:03,800 --> 00:11:09,040
so don't worry about that. We want FA on
292
00:11:06,800 --> 00:11:11,800
and we are going to want to actually
293
00:11:09,040 --> 00:11:15,400
delete this one here and we're left with
294
00:11:11,800 --> 00:11:16,880
basically that. Now, C is going to be
295
00:11:15,400 --> 00:11:19,720
very important. This is the amount of
296
00:11:16,880 --> 00:11:21,600
context you're allowing the model to
297
00:11:19,720 --> 00:11:23,240
have. Now, if you're running into
298
00:11:21,600 --> 00:11:25,520
issues, you might need to reduce the
299
00:11:23,240 --> 00:11:27,240
context. So, you might be able to Again,
300
00:11:25,520 --> 00:11:29,160
fit the model on your on your hard
301
00:11:27,240 --> 00:11:31,200
drive, but can you fit the full context
302
00:11:29,160 --> 00:11:33,560
amount? You might be able to squeeze a
303
00:11:31,200 --> 00:11:36,640
little bit more juice by Let's say, for
304
00:11:33,560 --> 00:11:38,400
example, you've got 40 gig of RAM here.
305
00:11:36,640 --> 00:11:40,600
Um you might be able to download this
306
00:11:38,400 --> 00:11:42,640
level, but reduce the context so much
307
00:11:40,600 --> 00:11:44,320
that only takes a couple of gig of
308
00:11:42,640 --> 00:11:46,520
space. So, you're you're managing your
309
00:11:44,320 --> 00:11:50,839
context like that. I'm fortunate enough
310
00:11:46,520 --> 00:11:55,520
to have 64 GB of RAM on this M1 Max. So,
311
00:11:50,839 --> 00:11:58,800
what I recommend doing, if we go to
312
00:11:55,520 --> 00:12:01,280
Quen 3.6 here, is finding out the
313
00:11:58,800 --> 00:12:03,400
context size of the window of the of the
314
00:12:01,280 --> 00:12:05,839
model. So, if I just search context
315
00:12:03,400 --> 00:12:05,839
here,
316
00:12:06,000 --> 00:12:10,320
the context length here is 262,144.
317
00:12:10,440 --> 00:12:14,320
Can extend up to a million tokens. We're
318
00:12:12,880 --> 00:12:16,720
not going to push that. I'm going to
319
00:12:14,320 --> 00:12:20,040
copy that number there and actually give
320
00:12:16,720 --> 00:12:22,360
myself the maximum number of context
321
00:12:20,040 --> 00:12:24,400
size. And honestly,
322
00:12:22,360 --> 00:12:26,440
having done this quite a few times now,
323
00:12:24,400 --> 00:12:28,080
it's always a balancing act depending on
324
00:12:26,440 --> 00:12:29,920
what you're doing, how much context you
325
00:12:28,080 --> 00:12:31,880
want to give it. Obviously, simple chat
326
00:12:29,920 --> 00:12:33,920
applications don't need a lot of
327
00:12:31,880 --> 00:12:36,880
context. However, if you're doing code,
328
00:12:33,920 --> 00:12:39,520
if you're doing like multi-step
329
00:12:36,880 --> 00:12:41,320
like file reading and things like that,
330
00:12:39,520 --> 00:12:42,720
the more context you can get, the
331
00:12:41,320 --> 00:12:44,480
better.
332
00:12:42,720 --> 00:12:46,240
Now, finally, we're just going to pass
333
00:12:44,480 --> 00:12:48,120
in the actual model. Now, we're already
334
00:12:46,240 --> 00:12:52,600
navigating to the models folder, which
335
00:12:48,120 --> 00:12:54,920
we moved our model file in. So, if I go
336
00:12:52,600 --> 00:12:57,880
here, I literally just copy the name of
337
00:12:54,920 --> 00:13:00,080
the file, paste that in instead of that,
338
00:12:57,880 --> 00:13:01,760
and it's going to turn blue there. Now,
339
00:13:00,080 --> 00:13:04,240
hit enter. This is going to actually
340
00:13:01,760 --> 00:13:07,600
enable us to run this model. Maybe I can
341
00:13:04,240 --> 00:13:11,000
bring in my GPU history here
342
00:13:07,600 --> 00:13:13,120
and even my activity monitor.
343
00:13:11,000 --> 00:13:16,240
You can already see that our RAM has
344
00:13:13,120 --> 00:13:17,600
just been filled up totally with our
345
00:13:16,240 --> 00:13:19,040
model.
346
00:13:17,600 --> 00:13:20,680
Uh
347
00:13:19,040 --> 00:13:24,560
All right.
348
00:13:20,680 --> 00:13:26,520
And now I'm using Quen 3.6 35 billion
349
00:13:24,560 --> 00:13:28,320
parameters model.
350
00:13:26,520 --> 00:13:31,200
It's not the fastest and this is what we
351
00:13:28,320 --> 00:13:34,200
need to come to expect of local models,
352
00:13:31,200 --> 00:13:35,720
but at a modest 53 tokens per second,
353
00:13:34,200 --> 00:13:38,240
it's not so bad.
354
00:13:35,720 --> 00:13:39,680
And if you're happy using llama.cpp in
355
00:13:38,240 --> 00:13:42,040
this way and using the model in this
356
00:13:39,680 --> 00:13:44,040
way, more power to you, you're done. But
357
00:13:42,040 --> 00:13:45,320
if you want to use another application,
358
00:13:44,040 --> 00:13:47,680
you want to bring it into another app,
359
00:13:45,320 --> 00:13:49,600
you want to build your own app or
360
00:13:47,680 --> 00:13:51,440
something like that, what we really want
361
00:13:49,600 --> 00:13:53,800
to do, if we cancel that and press up,
362
00:13:51,440 --> 00:13:57,800
which will go to the previous
363
00:13:53,800 --> 00:14:00,600
command that we ran, if we change CLI to
364
00:13:57,800 --> 00:14:02,280
server, which if you remember correctly
365
00:14:00,600 --> 00:14:05,640
inside of
366
00:14:02,280 --> 00:14:09,280
build bin, I said we'd be interested in
367
00:14:05,640 --> 00:14:10,360
the llama server or llama CLI, wherever
368
00:14:09,280 --> 00:14:12,560
it is.
369
00:14:10,360 --> 00:14:14,560
We're actually going to use the server
370
00:14:12,560 --> 00:14:15,839
application, let's call it. So, if we
371
00:14:14,560 --> 00:14:18,000
hit that there, it's going to do
372
00:14:15,839 --> 00:14:21,240
everything it needs to do to enable us
373
00:14:18,000 --> 00:14:23,520
to serve this model across our local
374
00:14:21,240 --> 00:14:25,600
machine or across the network, if we
375
00:14:23,520 --> 00:14:27,040
choose to. So, we're waiting for this.
376
00:14:25,600 --> 00:14:29,839
Main server is listening on
377
00:14:27,040 --> 00:14:31,760
127.0.0.1:8080.
378
00:14:29,839 --> 00:14:34,720
If I click that, open it up in my
379
00:14:31,760 --> 00:14:37,480
browser, and we have a simple user
380
00:14:34,720 --> 00:14:38,480
interface running the Quen 3.6 model
381
00:14:37,480 --> 00:14:40,000
here.
382
00:14:38,480 --> 00:14:41,440
And you can chat with it and do all the
383
00:14:40,000 --> 00:14:44,440
rest of it, but we're really not that
384
00:14:41,440 --> 00:14:46,200
interested in using a UI for our use
385
00:14:44,440 --> 00:14:48,800
case. What I'm going to do now is
386
00:14:46,200 --> 00:14:51,920
actually bring it into VS Code to enable
387
00:14:48,800 --> 00:14:53,800
us to run this on across codebases and
388
00:14:51,920 --> 00:14:56,480
actually code with it.
389
00:14:53,800 --> 00:14:59,480
So, if you download VS Code, it's
390
00:14:56,480 --> 00:15:01,520
completely free, and what I recommend is
391
00:14:59,480 --> 00:15:03,720
downloading the Kilo Code extension,
392
00:15:01,520 --> 00:15:05,079
which again is completely free. They are
393
00:15:03,720 --> 00:15:07,079
a silver sponsor of the channel, but
394
00:15:05,079 --> 00:15:08,320
they are not sponsoring this episode
395
00:15:07,079 --> 00:15:11,480
whatsoever.
396
00:15:08,320 --> 00:15:13,280
Um we If you go into your settings here
397
00:15:11,480 --> 00:15:15,839
and providers,
398
00:15:13,280 --> 00:15:18,839
this is what the different providers
399
00:15:15,839 --> 00:15:21,640
that provide AI. And we're going to want
400
00:15:18,839 --> 00:15:23,880
to add our own custom provider. Now, we
401
00:15:21,640 --> 00:15:25,640
can put llama.cpp,
402
00:15:23,880 --> 00:15:27,400
really anything you want, and then a
403
00:15:25,640 --> 00:15:30,040
human-readable
404
00:15:27,400 --> 00:15:32,520
version of that is in the display name.
405
00:15:30,040 --> 00:15:35,360
If we paste in our URL here, which is
406
00:15:32,520 --> 00:15:37,800
127.0.0.1:8080,
407
00:15:35,360 --> 00:15:39,200
the thing that you saw in the terminal
408
00:15:37,800 --> 00:15:40,839
here,
409
00:15:39,200 --> 00:15:43,079
/v1,
410
00:15:40,839 --> 00:15:45,600
this is automatically going to pick up
411
00:15:43,079 --> 00:15:47,640
the fact that we're running llama.cpp
412
00:15:45,600 --> 00:15:49,360
and it's found the model that we've
413
00:15:47,640 --> 00:15:51,160
already got loaded. So, we're going to
414
00:15:49,360 --> 00:15:54,800
add that there
415
00:15:51,160 --> 00:15:56,800
and submit that. And now, when we click
416
00:15:54,800 --> 00:15:59,880
on the model selector down here, you've
417
00:15:56,800 --> 00:16:03,880
got all the models under llama.cpp, we
418
00:15:59,880 --> 00:16:07,520
have Quen 3.6 35 billion. And I'm just
419
00:16:03,880 --> 00:16:09,560
going to say, "What does this codebase
420
00:16:07,520 --> 00:16:11,959
do?"
421
00:16:09,560 --> 00:16:14,040
If we flip back into the terminal here,
422
00:16:11,959 --> 00:16:16,240
you'll see this progress build up. Now,
423
00:16:14,040 --> 00:16:18,240
this is the prefill. This is it loading
424
00:16:16,240 --> 00:16:20,320
up and getting ready to answer our
425
00:16:18,240 --> 00:16:22,959
question. Now, I have an old machine.
426
00:16:20,320 --> 00:16:25,480
This is an M1 Max. This is where the
427
00:16:22,959 --> 00:16:28,760
actual performance of your chip comes
428
00:16:25,480 --> 00:16:30,520
into play over the actual VRAM size.
429
00:16:28,760 --> 00:16:32,320
This is an old chip. It's going to take
430
00:16:30,520 --> 00:16:34,079
a long time to do this. We're already at
431
00:16:32,320 --> 00:16:35,680
0.27
432
00:16:34,079 --> 00:16:38,040
and this might seem fast in the
433
00:16:35,680 --> 00:16:40,920
beginning, but we'll start to see the
434
00:16:38,040 --> 00:16:44,600
GPU start ramping up. We'll start to see
435
00:16:40,920 --> 00:16:47,160
the RAM just the memory pressure and the
436
00:16:44,600 --> 00:16:48,800
memory use ramp up a little bit there.
437
00:16:47,160 --> 00:16:51,280
And this will go round and round and
438
00:16:48,800 --> 00:16:55,000
round and eventually, we will get a
439
00:16:51,280 --> 00:16:57,000
response inside of Kilo Code. You with a
440
00:16:55,000 --> 00:16:59,000
newer machine, a newer chip, this will
441
00:16:57,000 --> 00:17:00,360
be a lot faster for you, but for me,
442
00:16:59,000 --> 00:17:02,240
this takes a while. So, we're not going
443
00:17:00,360 --> 00:17:03,680
to sit around and do it here, but you
444
00:17:02,240 --> 00:17:06,199
can pat yourself on the back and know
445
00:17:03,680 --> 00:17:09,199
that you've got a local LLM running
446
00:17:06,199 --> 00:17:11,160
inside of Kilo Code ready to start
447
00:17:09,199 --> 00:17:12,920
coding completely free and completely
448
00:17:11,160 --> 00:17:15,400
privately. You will see I've done videos
449
00:17:12,920 --> 00:17:17,760
on the M5 MacBook Pro and the MacBook
450
00:17:15,400 --> 00:17:19,079
Air, so go check those out as well and
451
00:17:17,760 --> 00:17:21,520
you'll start to see the performance on
452
00:17:19,079 --> 00:17:22,959
some of the newer chips, but honestly, I
453
00:17:21,520 --> 00:17:25,199
showed you on here just cuz it's a bit
454
00:17:22,959 --> 00:17:27,280
easier to set up. Now, here in Open
455
00:17:25,199 --> 00:17:29,679
Claw, it's a similar deal. Now, I
456
00:17:27,280 --> 00:17:31,240
wouldn't go and run any auto
457
00:17:29,679 --> 00:17:33,679
configurators or anything like that. I
458
00:17:31,240 --> 00:17:36,600
would literally just go into config and
459
00:17:33,679 --> 00:17:38,400
open the raw config. And [snorts]
460
00:17:36,600 --> 00:17:41,440
looking here,
461
00:17:38,400 --> 00:17:44,160
you're concerned with models, providers,
462
00:17:41,440 --> 00:17:45,640
and you can see I've got LM Studio here.
463
00:17:44,160 --> 00:17:48,760
I've got Ollama, I've got MiniMax
464
00:17:45,640 --> 00:17:50,160
Portal, and I've also got llama.cpp
465
00:17:48,760 --> 00:17:51,960
added. This can be anything you want.
466
00:17:50,160 --> 00:17:53,800
It's just a human It's just a readable
467
00:17:51,960 --> 00:17:55,800
name just to name the set of
468
00:17:53,800 --> 00:17:58,560
configurations here.
469
00:17:55,800 --> 00:18:02,520
And just copy these. So, you've got the
470
00:17:58,560 --> 00:18:05,280
base URL is the exactly the same one as
471
00:18:02,520 --> 00:18:08,000
we put into Kilo Code with the the V1
472
00:18:05,280 --> 00:18:09,760
there. API key can be anything that you
473
00:18:08,000 --> 00:18:12,880
want. It doesn't actually need to be
474
00:18:09,760 --> 00:18:15,000
there. API needs to be OpenAI responses
475
00:18:12,880 --> 00:18:17,600
because that's the the format that
476
00:18:15,000 --> 00:18:21,080
llama.cpp expects. And then you're going
477
00:18:17,600 --> 00:18:23,720
to want to add the these model format.
478
00:18:21,080 --> 00:18:25,160
Now, if you copy and paste this,
479
00:18:23,720 --> 00:18:28,520
couple of things you want to
480
00:18:25,160 --> 00:18:33,679
uh know is the ID. And the way we get an
481
00:18:28,520 --> 00:18:37,360
ID is if I just curl request HTTP colon
482
00:18:33,679 --> 00:18:37,360
um 127.0.0.1:8080,
483
00:18:38,640 --> 00:18:42,640
we can actually just hit
484
00:18:40,400 --> 00:18:44,800
models. And this will be the models that
485
00:18:42,640 --> 00:18:46,640
we have. It's a little bit hard to see,
486
00:18:44,800 --> 00:18:48,919
but if you look carefully, there's an ID
487
00:18:46,640 --> 00:18:50,440
here. You can have multiple
488
00:18:48,919 --> 00:18:52,600
here depending on if you just want to
489
00:18:50,440 --> 00:18:54,840
save that information. Uh so, you want
490
00:18:52,600 --> 00:18:56,760
to put the ID in here and then name it a
491
00:18:54,840 --> 00:18:58,720
human-readable name. And then the next
492
00:18:56,760 --> 00:19:00,440
bit will be unique to the model itself.
493
00:18:58,720 --> 00:19:02,200
And you'll remember the context size cuz
494
00:19:00,440 --> 00:19:04,040
we actually passed it in here. So, if
495
00:19:02,200 --> 00:19:05,280
you just select that and copy it in. And
496
00:19:04,040 --> 00:19:06,960
those are the only values you're going
497
00:19:05,280 --> 00:19:08,159
to want to touch. And with that, you can
498
00:19:06,960 --> 00:19:10,280
save
499
00:19:08,159 --> 00:19:12,000
that file,
500
00:19:10,280 --> 00:19:13,919
go into Open Claw and literally just
501
00:19:12,000 --> 00:19:15,640
refresh.
502
00:19:13,919 --> 00:19:17,960
And when you go into your chat, you
503
00:19:15,640 --> 00:19:19,600
should see down the bottom of this list
504
00:19:17,960 --> 00:19:21,760
here,
505
00:19:19,600 --> 00:19:23,280
your model. On top of that, if we scroll
506
00:19:21,760 --> 00:19:26,480
down here,
507
00:19:23,280 --> 00:19:28,760
if we go to agents default model
508
00:19:26,480 --> 00:19:30,679
primary, this is where you're going to
509
00:19:28,760 --> 00:19:32,720
want to set up the whether you want it
510
00:19:30,679 --> 00:19:34,960
to be your primary model or you want it
511
00:19:32,720 --> 00:19:36,600
to be a fallback. And you basically just
512
00:19:34,960 --> 00:19:39,560
copy the provider name, which which we
513
00:19:36,600 --> 00:19:42,360
named C llama.cpp, and then you're going
514
00:19:39,560 --> 00:19:45,360
to want to copy the ID
515
00:19:42,360 --> 00:19:47,159
and put that in after a slash.
516
00:19:45,360 --> 00:19:49,000
Now, this means every time you start up
517
00:19:47,159 --> 00:19:51,400
a new chat, start to feel the machine
518
00:19:49,000 --> 00:19:53,280
get chuggy.
519
00:19:51,400 --> 00:19:55,600
19%
520
00:19:53,280 --> 00:19:57,960
29%. Now, generally, you're going to
521
00:19:55,600 --> 00:19:59,840
want not a lot of stuff running on your
522
00:19:57,960 --> 00:20:02,520
laptop during this time cuz everything
523
00:19:59,840 --> 00:20:04,000
you use will take up VRAM, will take up
524
00:20:02,520 --> 00:20:07,560
because it's unified, it's going to take
525
00:20:04,000 --> 00:20:08,920
up the RAM as well. So, I tend to think
526
00:20:07,560 --> 00:20:12,000
that
527
00:20:08,920 --> 00:20:14,400
having a separate computer, whether it's
528
00:20:12,000 --> 00:20:16,200
a Mac mini, for Open Claw, we start to
529
00:20:14,400 --> 00:20:18,000
get a little bit more justifiable to get
530
00:20:16,200 --> 00:20:20,160
a separate Mac mini. So, I just have it
531
00:20:18,000 --> 00:20:22,320
running on a separate computer just in
532
00:20:20,160 --> 00:20:25,800
the back there. And for coding, I really
533
00:20:22,320 --> 00:20:27,960
want to set up a remote a local server
534
00:20:25,800 --> 00:20:28,680
so that I can do I can
535
00:20:27,960 --> 00:20:30,400
um
536
00:20:28,680 --> 00:20:31,960
that I can connect to and actually code
537
00:20:30,400 --> 00:20:33,720
on my main machine, but the processing
538
00:20:31,960 --> 00:20:35,160
is happening on a different computer.
539
00:20:33,720 --> 00:20:36,440
Let us know if you want that video, but
540
00:20:35,160 --> 00:20:38,720
in the meantime, this is how to get it
541
00:20:36,440 --> 00:20:41,040
all set up on your main machine.
542
00:20:38,720 --> 00:20:42,560
And there we go. There is uh
543
00:20:41,040 --> 00:20:44,360
I just came online. I'm your AI
544
00:20:42,560 --> 00:20:46,280
assistant. We're good to go with Open
545
00:20:44,360 --> 00:20:48,720
Claw. So, there you go. I hope you
546
00:20:46,280 --> 00:20:51,000
enjoyed that getting local AI set up on
547
00:20:48,720 --> 00:20:52,840
your machine. I've had zero problems
548
00:20:51,000 --> 00:20:55,280
with Open Claw. I've had zero problems
549
00:20:52,840 --> 00:20:57,480
with Kobold Claw running local models
550
00:20:55,280 --> 00:20:59,240
using this method. And on top of that,
551
00:20:57,480 --> 00:21:00,880
using some cutting-edge technology like
552
00:20:59,240 --> 00:21:02,800
Turbo Quant. Let me know what you're
553
00:21:00,880 --> 00:21:05,600
using your local LLMs for, and I'll see
554
00:21:02,800 --> 00:21:05,600
you in the next one.
41261
Can't find what you're looking for?
Get subtitles in any language from opensubtitles.com, and translate them here.