Decent models are huge; an average one requires 8GB to be kept in memory (better models requires something like 40 to 70 GB), and most currently available engines are extremely slow on a CPU and requires dedicated hardware (and even relatively powerful GPU requires a few seconds of “thinking” time). It is unlikely that these requirements will be easily squeezable in current computers, and more likely that dedicated hardware will be required.
I don’t think any inference engines have actually been optimised to run on CPUs. You’re stuck with 32-bit floats but OTOH that just means that you can do gigantic winograd transformations with the excess precision, needing far fewer fmuladds in total and CPUs are better at dealing with the memory access patterns that come with transforming the convolution. Most people have at least around 1TFLOP of compute in their CPU (e.g. a Ryzen 3600 has that much) that’s not ever seeing the light of day. About a fifth of what an RX 570 has, it’s a difference but not a magnitude and you can run SDXL with that kind of class of card (maybe not the 570 dunno about software support but a 5500 works, despite AMD’s best efforts to cripple rocm).
Also from what I gather they’re more or less doing summarybot for your browsing history, that’s not a ChatGPT or Llama-style giant model you can talk with.
ONNX Runtime is actually decently well optimized to run on CPUs; even with large models. However, the simple truth is that there’s really no escaping that Billion+parameter models need to be quantized and even pruned heavily to fit in memory and not saturate the CPU cache so inferences/generations don’t take forever. That’s a reduction in accuracy, so the quality of the generations aren’t great.
There is a lot of really interesting research and development being done right now on smart quantization and pruning. Model serving technologies are improving rapidly too—paged attention is a really cool technique (for transformer based models) for effectively leveraging tensor core hardware—I don’t think that’s supported on CPU yet but it’s probably not that far off.
It’s a really active field and there’s just as much interest in running huge models on huge hardware as there is big models on small hardware. I recently heard of layerwise inference for CPUs; load each layer of the network to the CPU cache on demand. That’s typically a bottleneck operation on GPUs but CPU memoery so bloody fast that it might actually work fine. I haven’t played with it myself, or read the paper all that deeply so I can’t really comment more than it’s an interesting idea.
Sorry but has anyone in this thread actually tried running local LLMs on CPU? You can easily run a 7B model at varying levels of quantization (ie. 5 bit quantization) and get a generalized prompt-able LLM. Yeah, of course it’s going to take ~4GB of RAM (which is mem-mapped and paged into memory), but you can easily fine tune smaller more specific models (like the translation one mentioned above) and have surprising intelligence at a fraction of the resources.
Take, for example, phi-2 which performs as well as 13B param models but with 2.7B params. Yeah, that’s still going to take 1.5GB RAM which Firefox wouldn’t reasonably ship, but many lighter weight specialized tasks could easily use something like a fine tuned 0.3B model with quantization.
Yes, I did. And yes, it is possible. It’s terribly slow in comparison, making it less useful. It very quickly devolves into random mumbling or get stuck in weird loops. It also hogs resources that are actually used by other tasks you may be doing.
I mainly test dev AI solutions, and moving from 1B to 7B models made them vastly more pertinent. And moving from CPU implementation (Ryzen 7 3700X) to GPU (RTX 3080 Ti) made them fast enough to be used as quick completion and immediate suggestion without breaking workflow, in addition to freeing resources for IDE, building tools and the actual software being run, while running it on CPU had multi-seconds delay, which made this use case completely useless.
Decent models are huge; an average one requires 8GB to be kept in memory (better models requires something like 40 to 70 GB), and most currently available engines are extremely slow on a CPU and requires dedicated hardware (and even relatively powerful GPU requires a few seconds of “thinking” time). It is unlikely that these requirements will be easily squeezable in current computers, and more likely that dedicated hardware will be required.
I don’t think any inference engines have actually been optimised to run on CPUs. You’re stuck with 32-bit floats but OTOH that just means that you can do gigantic winograd transformations with the excess precision, needing far fewer fmuladds in total and CPUs are better at dealing with the memory access patterns that come with transforming the convolution. Most people have at least around 1TFLOP of compute in their CPU (e.g. a Ryzen 3600 has that much) that’s not ever seeing the light of day. About a fifth of what an RX 570 has, it’s a difference but not a magnitude and you can run SDXL with that kind of class of card (maybe not the 570 dunno about software support but a 5500 works, despite AMD’s best efforts to cripple rocm).
Also from what I gather they’re more or less doing summarybot for your browsing history, that’s not a ChatGPT or Llama-style giant model you can talk with.
Also to all those people complaining: There’s already AI in firefox, the translation models are about 17MB per language pair, gzipped.
ONNX Runtime is actually decently well optimized to run on CPUs; even with large models. However, the simple truth is that there’s really no escaping that Billion+parameter models need to be quantized and even pruned heavily to fit in memory and not saturate the CPU cache so inferences/generations don’t take forever. That’s a reduction in accuracy, so the quality of the generations aren’t great.
There is a lot of really interesting research and development being done right now on smart quantization and pruning. Model serving technologies are improving rapidly too—paged attention is a really cool technique (for transformer based models) for effectively leveraging tensor core hardware—I don’t think that’s supported on CPU yet but it’s probably not that far off.
It’s a really active field and there’s just as much interest in running huge models on huge hardware as there is big models on small hardware. I recently heard of layerwise inference for CPUs; load each layer of the network to the CPU cache on demand. That’s typically a bottleneck operation on GPUs but CPU memoery so bloody fast that it might actually work fine. I haven’t played with it myself, or read the paper all that deeply so I can’t really comment more than it’s an interesting idea.
Sorry but has anyone in this thread actually tried running local LLMs on CPU? You can easily run a 7B model at varying levels of quantization (ie. 5 bit quantization) and get a generalized prompt-able LLM. Yeah, of course it’s going to take ~4GB of RAM (which is mem-mapped and paged into memory), but you can easily fine tune smaller more specific models (like the translation one mentioned above) and have surprising intelligence at a fraction of the resources.
Take, for example, phi-2 which performs as well as 13B param models but with 2.7B params. Yeah, that’s still going to take 1.5GB RAM which Firefox wouldn’t reasonably ship, but many lighter weight specialized tasks could easily use something like a fine tuned 0.3B model with quantization.
Yes, I did. And yes, it is possible. It’s terribly slow in comparison, making it less useful. It very quickly devolves into random mumbling or get stuck in weird loops. It also hogs resources that are actually used by other tasks you may be doing.
I mainly test dev AI solutions, and moving from 1B to 7B models made them vastly more pertinent. And moving from CPU implementation (Ryzen 7 3700X) to GPU (RTX 3080 Ti) made them fast enough to be used as quick completion and immediate suggestion without breaking workflow, in addition to freeing resources for IDE, building tools and the actual software being run, while running it on CPU had multi-seconds delay, which made this use case completely useless.