This is a follow-up to my prior post about whisper.cpp. Georgi Gerganov has adapted his GGML framework to run the recently-circulating LLaMA weights. The PPC64 optimizations I made for whisper.cpp seem to carry over directly; after updating my Talos II’s PyTorch installation, I was able to get llama.cpp generating text from a prompt — completely offline — using the LLaMA 7B model.
$ ./main -m ./models/7B/ggml-model-q4_0.bin -t 32 -n 128 -p "Hello world in Common Lisp" main: seed = 1678578687 llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ... llama_model_load: n_vocab = 32000 llama_model_load: n_ctx = 512 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 2 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: ggml ctx size = 4529.34 MB llama_model_load: memory_size = 512.00 MB, n_mem = 16384 llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin' llama_model_load: .................................... done llama_model_load: model size = 4017.27 MB / num tensors = 291 main: prompt: 'Hello world in Common Lisp' main: number of tokens in prompt = 7 1 -> '' 10994 -> 'Hello' 3186 -> ' world' 297 -> ' in' 13103 -> ' Common' 15285 -> ' Lis' 29886 -> 'p' sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000 Hello world in Common Lisp! We are going to learn the very basics of Common Lisp, an open source lisp implementation, which is a descendant of Lisp1. Common Lisp is the de facto standard lisp implementation of Mozilla Labs, who are using it to create modern and productive lisps for Firefox. We are going to start by having a look at its implementation of S-Expressions, which are at the core of how Common Lisp implements its lisp features. Then, we will explore its other features such as I/O, Common Lisp has a really nice and modern I main: mem per token = 14828340 bytes main: load time = 1009.64 ms main: sample time = 334.95 ms main: predict time = 86867.07 ms / 648.26 ms per token main: total time = 90653.54 ms
The above example was just the first thing I tried; no tuning or prompt engineering — as Georgi mentioned in his README, don’t judge the model by the above output; this was just a quick test. The text is printed as soon as each token prediction is made, at a rate of about one word per second, which makes the generation interesting to watch.