This is a follow-up to my prior post about whisper.cpp. Georgi Gerganov has adapted his GGML framework to run the recently-circulating LLaMA weights. The PPC64 optimizations I made for whisper.cpp seem to carry over directly; after updating my Talos II’s PyTorch installation, I was able to get llama.cpp generating text from a prompt — completely offline — using the LLaMA 7B model.
$ ./main -m ./models/7B/ggml-model-q4_0.bin -t 32 -n 128 -p "Hello world in Common Lisp"
main: seed = 1678578687
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
main: prompt: 'Hello world in Common Lisp'
main: number of tokens in prompt = 7
1 -> ''
10994 -> 'Hello'
3186 -> ' world'
297 -> ' in'
13103 -> ' Common'
15285 -> ' Lis'
29886 -> 'p'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000
Hello world in Common Lisp!
We are going to learn the very basics of Common Lisp, an open source lisp implementation, which is a descendant of Lisp1.
Common Lisp is the de facto standard lisp implementation of Mozilla Labs, who are using it to create modern and productive lisps for Firefox.
We are going to start by having a look at its implementation of S-Expressions, which are at the core of how Common Lisp implements its lisp features.
Then, we will explore its other features such as I/O, Common Lisp has a really nice and modern I
main: mem per token = 14828340 bytes
main: load time = 1009.64 ms
main: sample time = 334.95 ms
main: predict time = 86867.07 ms / 648.26 ms per token
main: total time = 90653.54 ms
The above example was just the first thing I tried; no tuning or prompt engineering — as Georgi mentioned in his README, don’t judge the model by the above output; this was just a quick test. The text is printed as soon as each token prediction is made, at a rate of about one word per second, which makes the generation interesting to watch.