使用llama.cpp實現LLM大模型的格式轉換、量化、推理、部署

使用llama.cpp實現LLM大模型的量化、推理、部署

  • 大模型的格式轉換、量化、推理、部署
    • 概述
    • 克隆和編譯
    • 環境準備
    • 模型格式轉換
      • GGUF格式
      • bin格式
    • 模型量化
    • 模型加載與推理
    • 模型API服務
    • 模型API服務(第三方)
    • GPU推理

大模型的格式轉換、量化、推理、部署

概述

llama.cpp的主要目標是能夠在各種硬件上實現LLM推理,只需最少的設置,并提供最先進的性能。提供1.5位、2位、3位、4位、5位、6位和8位整數量化,以加快推理速度并減少內存使用。

GitHub:https://github.com/ggerganov/llama.cpp

克隆和編譯

克隆最新版llama.cpp倉庫代碼

git clone https://github.com/ggerganov/llama.cpp

對llama.cpp項目進行編譯,在目錄下會生成一系列可執行文件

main:使用模型進行推理quantize:量化模型server:提供模型API服務

1.編譯構建CPU執行環境,安裝簡單,適用于沒有GPU的操作系統

cd llama.cppmkdir 

2.編譯構建GPU執行環境,確保安裝CUDA工具包,適用于有GPU的操作系統

如果CUDA設置正確,那么執行nvidia-sminvcc --version沒有錯誤提示,則表示一切設置正確。

make clean &&  make LLAMA_CUDA=1

3.如果編譯失敗或者需要重新編譯,可嘗試清理并重新編譯,直至編譯成功

make clean

環境準備

1.下載受支持的模型

要使用llamma.cpp,首先需要準備它支持的模型。在官方文檔中給出了說明,這里僅僅截取其中一部分

在這里插入圖片描述
2.安裝依賴

llama.cpp項目下帶有requirements.txt 文件,直接安裝依賴即可。

pip install -r requirements.txt

模型格式轉換

根據模型架構,可以使用convert.pyconvert-hf-to-gguf.py文件。

轉換腳本讀取模型配置、分詞器、張量名稱+數據,并將它們轉換為GGUF元數據和張量。

GGUF格式

Llama-3相比其前兩代顯著擴充了詞表大小,由32K擴充至128K,并且改為BPE詞表。因此需要使用--vocab-type參數指定分詞算法,默認值是spm,如果是bpe,需要顯示指定

注意:

官方文檔說convert.py不支持LLaMA 3,喊使用convert-hf-to-gguf.py,但它不支持--vocab-type,且出現異常:error: unrecognized arguments: --vocab-type bpe,因此使用convert.py且沒出問題

使用llama.cpp項目中的convert.py腳本轉換模型為GGUF格式

root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.gguf
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.gguf, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   1
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   2
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   2
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   2
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   2
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   2
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   2
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   3
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   3
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   3

轉換為FP16的GGUF格式,模型體積大概15G。

root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf

bin格式

root@master:~/work/llama.cpp# python3 ./convert.py  /root/work/models/Llama3-Chinese-8B-Instruct/ --outtype f16 --vocab-type bpe --outfile ./models/Llama3-FP16.bin
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00001-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00002-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00003-of-00004.safetensors
INFO:convert:Loading model file /root/work/models/Llama3-Chinese-8B-Instruct/model-00004-of-00004.safetensors
INFO:convert:model parameters count : 8030261248 (8B)
INFO:convert:params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=8192, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct'))
INFO:convert:Loaded vocab file PosixPath('/root/work/models/Llama3-Chinese-8B-Instruct/tokenizer.json'), type 'bpe'
INFO:convert:Vocab info: <BpeVocab with 128000 base tokens and 256 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 280147 merges, special tokens {'bos': 128000, 'eos': 128001}, add special tokens unset>
INFO:convert:Writing models/Llama3-FP16.bin, format 1
WARNING:convert:Ignoring added_tokens.json since model matches vocab size without it.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>' }}
INFO:convert:[  1/291] Writing tensor token_embd.weight                      | size 128256 x   4096  | type F16  | T+   4
INFO:convert:[  2/291] Writing tensor blk.0.attn_norm.weight                 | size   4096           | type F32  | T+   4
INFO:convert:[  3/291] Writing tensor blk.0.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   4
INFO:convert:[  4/291] Writing tensor blk.0.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  5/291] Writing tensor blk.0.ffn_up.weight                    | size  14336 x   4096  | type F16  | T+   5
INFO:convert:[  6/291] Writing tensor blk.0.ffn_norm.weight                  | size   4096           | type F32  | T+   5
INFO:convert:[  7/291] Writing tensor blk.0.attn_k.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[  8/291] Writing tensor blk.0.attn_output.weight               | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[  9/291] Writing tensor blk.0.attn_q.weight                    | size   4096 x   4096  | type F16  | T+   5
INFO:convert:[ 10/291] Writing tensor blk.0.attn_v.weight                    | size   1024 x   4096  | type F16  | T+   5
INFO:convert:[ 11/291] Writing tensor blk.1.attn_norm.weight                 | size   4096           | type F32  | T+   5
INFO:convert:[ 12/291] Writing tensor blk.1.ffn_down.weight                  | size   4096 x  14336  | type F16  | T+   5
INFO:convert:[ 13/291] Writing tensor blk.1.ffn_gate.weight                  | size  14336 x   4096  | type F16  | T+   5
root@master:~/work/llama.cpp# ll models -h
-rw-r--r--  1 root root  15G May 17 07:47 Llama3-FP16.gguf
-rw-r--r--  1 root root  15G May 17 08:02 Llama3-FP16.bin

模型量化

模型量化使用quantize命令,其具體可用參數與允許量化的類型如下:

root@master:~/work/llama.cpp# ./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit--leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing--pure: Disable k-quant mixtures and quantize all tensors to the same type--imatrix file_name: use data in file_name as importance matrix for quant optimizations--include-weights tensor_name: use importance matrix for this/these tensor(s)--exclude-weights tensor_name: use importance matrix for this/these tensor(s)--output-tensor-type ggml_type: use this ggml_type for the output.weight tensor--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor--keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUEAdvanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used togetherAllowed quantization types:2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B19  or  IQ2_XXS :  2.06 bpw quantization20  or  IQ2_XS  :  2.31 bpw quantization28  or  IQ2_S   :  2.5  bpw quantization29  or  IQ2_M   :  2.7  bpw quantization24  or  IQ1_S   :  1.56 bpw quantization31  or  IQ1_M   :  1.75 bpw quantization10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B23  or  IQ3_XXS :  3.06 bpw quantization26  or  IQ3_S   :  3.44 bpw quantization27  or  IQ3_M   :  3.66 bpw quantization mix12  or  Q3_K    : alias for Q3_K_M22  or  IQ3_XS  :  3.3 bpw quantization11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B25  or  IQ4_NL  :  4.50 bpw non-linear quantization30  or  IQ4_XS  :  4.25 bpw non-linear quantization15  or  Q4_K    : alias for Q4_K_M14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B17  or  Q5_K    : alias for Q5_K_M16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B0  or  F32     : 26.00G              @ 7BCOPY    : only copy tensors, no quantizing

使用quantize量化模型,它提供各種量化位數的模型:Q2、Q3、Q4、Q5、Q6、Q8、F16。

量化模型的命名方法遵循: Q + 量化比特位 + 變種。量化位數越少,對硬件資源的要求越低,但是模型的精度也越低。

模型經過量化之后,可以發現模型的大小從15G降低到8G,但模型精度從16位浮點數降低到8位整數。

root@master:~/work/llama.cpp# ./quantize ./models/Llama3-FP16.gguf  ./models/Llama3-q8.gguf q8_0
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/root/work/models/Llama3-FP16.gguf' to '/root/work/models/Llama3-q8.gguf' as Q8_0
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/work/models/Llama3-FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 1
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,128256]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["? ?", "? ???", "?? ??", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q8_0 .. size =  1002.00 MiB ->   532.31 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q8_0 .. size =   112.00 MiB ->    59.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q8_0 .. size =    32.00 MiB ->    17.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q8_0 .. size =     8.00 MiB ->     4.25 MiB
root@master:~/work/llama.cpp# ll -h models/
-rw-r--r--  1 root root 8.0G May 17 07:54 Llama3-q8.gguf

模型加載與推理

模型加載與推理使用main命令,其支持如下可用參數:

root@master:~/work/llama.cpp# ./main -husage: ./main [options]options:-h, --help            show this help message and exit--version             show version and build info-i, --interactive     run in interactive mode--interactive-specials allow special tokens in user text, in interactive mode--interactive-first   run in interactive mode and wait for input right away-cnv, --conversation  run in conversation mode (does not print special tokens and suffix/prefix)-ins, --instruct      run in instruction mode (use with Alpaca models)-cml, --chatml        run in chatml mode (use with ChatML-compatible models)--multiline-input     allows you to write or paste multiple lines without ending each in '\'-r PROMPT, --reverse-prompt PROMPThalt generation at PROMPT, return control in interactive mode(can be specified more than once for multiple prompts).--color               colorise output to distinguish prompt and user input from generations-s SEED, --seed SEED  RNG seed (default: -1, use random seed for < 0)-t N, --threads N     number of threads to use during generation (default: 30)-tb N, --threads-batch Nnumber of threads to use during batch and prompt processing (default: same as --threads)-td N, --threads-draft N                        number of threads to use during generation (default: same as --threads)-tbd N, --threads-batch-draft Nnumber of threads to use during batch and prompt processing (default: same as --threads-draft)-p PROMPT, --prompt PROMPTprompt to start generation with (default: empty)

可以加載預訓練模型或者經過量化之后的模型,這里選擇加載量化后的模型進行推理。

在llama.cpp項目的根目錄,執行如下命令,加載模型進行推理。

root@master:~/work/llama.cpp# ./main -m models/Llama3-q8.gguf --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.1
Log start
main: build = 2908 (359cbe3f)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1715935175
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to LLaMa.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.<|begin_of_text|>Below is an instruction that describes a task. Write a response that appropriately completes the request.
> hi
Hello! How can I help you today?<|eot_id|>>

在提示符>之后輸入prompt,使用ctrl+c中斷輸出,多行信息以\作為行尾。執行./main -h命令查看幫助和參數說明,以下是一些常用的參數:
`

命令描述
-m指定 LLaMA 模型文件的路徑
-mu指定遠程 http url 來下載文件
-i以交互模式運行程序,允許直接提供輸入并接收實時響應。
-ins以指令模式運行程序,這在處理羊駝模型時特別有用。
-f指定prompt模板,alpaca模型請加載prompts/alpaca.txt
-n控制回復生成的最大長度(默認:128)
-c設置提示上下文的大小,值越大越能參考更長的對話歷史(默認:512)
-b控制batch size(默認:8),可適當增加
-t控制線程數量(默認:4),可適當增加
--repeat_penalty控制生成回復中對重復文本的懲罰力度
--temp溫度系數,值越低回復的隨機性越小,反之越大
--top_p, top_k控制解碼采樣的相關參數
--color區分用戶輸入和生成的文本

模型API服務

llama.cpp提供了完全與OpenAI API兼容的API接口,使用經過編譯生成的server可執行文件啟動API服務。

root@master:~/work/llama.cpp# ./server -m models/Llama3-q8.gguf --host 0.0.0.0 --port 8000
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2942,"msg":"build info","build":2908,"commit":"359cbe3f"}
{"tid":"140018656950080","timestamp":1715936504,"level":"INFO","function":"main","line":2947,"msg":"system info","n_threads":30,"n_threads_batch":-1,"total_threads":30,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "}
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama3-Chinese-8B-Instruct
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336

啟動API服務后,可以使用curl命令進行測試

curl --request POST \--url http://localhost:8000/completion \--header "Content-Type: application/json" \--data '{"prompt": "Hi"}'

模型API服務(第三方)

在llamm.cpp項目中有提到各種語言編寫的第三方工具包,可以使用這些工具包提供API服務,這里以Python為例,使用llama-cpp-python提供API服務。

安裝依賴

pip install llama-cpp-pythonpip install llama-cpp-python -i https://mirrors.aliyun.com/pypi/simple/

注意:可能還需要安裝以下缺失依賴,可根據啟動時的異常提示分別安裝。

pip install sse_starlette starlette_context pydantic_settings

啟動API服務,默認運行在http://localhost:8000

python -m llama_cpp.server --model models/Llama3-q8.gguf

安裝openai依賴

pip install openai

使用openai調用API服務

import os
from openai import OpenAI  # 導入OpenAI庫# 設置OpenAI的BASE_URL
os.environ["OPENAI_BASE_URL"] = "http://localhost:8000/v1"client = OpenAI()  # 創建OpenAI客戶端對象# 調用模型
completion = client.chat.completions.create(model="llama3", # 任意填messages=[{"role": "system", "content": "你是一個樂于助人的助手。"},{"role": "user", "content": "你好!"}]
)# 輸出模型回復
print(completion.choices[0].message)

在這里插入圖片描述

GPU推理

如果編譯構建了GPU執行環境,可以使用-ngl N --n-gpu-layers N參數,指定offload層數,讓模型在GPU上運行推理

例如:-ngl 40表示offload 40層模型參數到GPU

未使用-ngl N --n-gpu-layers N參數,程序默認在CPU上運行

root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000

可從以下關鍵啟動日志看出,模型并沒有在GPU上執行

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512

使用-ngl N --n-gpu-layers N參數,程序默認在GPU上運行

root@master:~/work/llama.cpp# ./server -m models/Llama3-FP16.gguf  --host 0.0.0.0 --port 8000   --n-gpu-layers 1000

可從以下關鍵啟動日志看出,模型在GPU上執行

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:Device 0: Tesla V100S-PCIE-32GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0

執行nvidia-smi命令,可以進一步驗證模型已在GPU上運行。
在這里插入圖片描述

本文來自互聯網用戶投稿,該文觀點僅代表作者本人,不代表本站立場。本站僅提供信息存儲空間服務,不擁有所有權,不承擔相關法律責任。
如若轉載,請注明出處:http://www.pswp.cn/bicheng/14516.shtml
繁體地址,請注明出處:http://hk.pswp.cn/bicheng/14516.shtml
英文地址,請注明出處:http://en.pswp.cn/bicheng/14516.shtml

如若內容造成侵權/違法違規/事實不符,請聯系多彩編程網進行投訴反饋email:809451989@qq.com,一經查實,立即刪除!

相關文章

【代碼隨想錄算法訓練營第37期 第十五天 | LeetCode226.翻轉二叉樹、101.對稱二叉樹 2】

代碼隨想錄算法訓練營第37期 第十五天 | LeetCode226.翻轉二叉樹、101.對稱二叉樹 2 一、226.翻轉二叉樹 解題代碼C&#xff1a; /*** Definition for a binary tree node.* struct TreeNode {* int val;* TreeNode *left;* TreeNode *right;* TreeNode() : …

【軟考中級 軟件設計師】數據結構

數據結構是計算機科學中一個基礎且重要的概念&#xff0c;它研究數據的存儲結構以及在此結構上執行的各種操作。在準備軟考中級-軟件設計師考試時&#xff0c;掌握好數據結構部分對于通過考試至關重要。下面是一些核心知識點概覽&#xff1a; 基本概念&#xff1a; 數據結構定義…

VBA_MF系列技術資料1-615

MF系列VBA技術資料1-615 為了讓廣大學員在VBA編程中有切實可行的思路及有效的提高自己的編程技巧&#xff0c;我參考大量的資料&#xff0c;并結合自己的經驗總結了這份MF系列VBA技術綜合資料&#xff0c;而且開放源碼&#xff08;MF04除外&#xff09;&#xff0c;其中MF01-0…

spring-boot集成slf4j(二)logback配置詳解

一、configuration 根節點&#xff1a;configuration&#xff0c;作為頂級標簽&#xff0c; 可以用來配置一些lockback的全局屬性&#xff0c;常見的屬性如下&#xff1a; &#xff08;1&#xff09;scan“true” &#xff1a;scan是否開啟自動掃描&#xff0c;監控配置文件更…

el-table 組件實現 “合并單元格 + N行數據小計” 功能

目錄 需求 - 要實現的效果初始代碼代碼升級&#xff08;可供多個表格使用&#xff09;CommonTable.vue 子組件 使用子組件1 - 父組件 - 圖1~圖3使用效果展示 使用子組件2 - 父組件 - 圖4使用效果展示 注意【代碼優化 - 解決bug】 需求 - 要實現的效果 父組件中 info 數據示例 …

內網安全之證書服務基礎知識

PKI公鑰基礎設施 PKI(Public Key Infrastructure)公鑰基礎設施&#xff0c;是提供公鑰加密和數字簽名服務的系統或平臺&#xff0c;是一個包括硬件、軟件、人員、策略和規程的集合&#xff0c;用來實現基于公鑰密碼體制的密鑰和證書的產生、管理、存儲、分發和撤銷等功能。企業…

Android Debug Bridge(ADB)命令使用

引言 Android Debug Bridge&#xff08;ADB&#xff09;是一套功能強大的命令行工具&#xff0c;它為Android開發者和高級用戶提供了與Android設備通信的能力。無論是進行應用開發、測試還是執行日常設備管理任務&#xff0c;ADB都是不可或缺的工具。本文將詳細介紹一些常用的…

element-plus:踩坑日記

el-table Q&#xff1a;有fixed屬性時&#xff0c;無數據時&#xff0c;可能出現底部邊框消失的bug 現象&#xff1a; 解決方法&#xff1a; .el-table__empty-block {border-bottom: 1px solid var(--el-table-border-color); } el-collapse 折疊面板 Q&#xff1a;標題上…

云平臺的安全能力提升解決方案

提升云平臺的安全能力是確保數據和服務安全的關鍵步驟。針對大型云平臺所面臨的云上安全建設問題&#xff0c;安全狗提供完整的一站式云安全解決方案&#xff0c;充分匹配云平臺安全管理方的需求和云租戶的安全需求。協助大型云平臺建設全網安全態勢感知、統一風險管理、統一資…

加強堆(大根堆)

way&#xff1a;看上去好像就是加了個indexMap記錄節點在數組heap中的下標&#xff0c;然后就是可以查到某個元素是否在堆里并且可以進行位置的調整&#xff0c;普通的堆是沒法知道元素是不是在的&#xff0c;只能彈堆頂元素&#xff0c;插入到堆尾這樣子。如果覺得heapSize有點…

PCIE協議-4-物理層邏輯模塊

4.1 簡介 物理層將事務層和數據鏈路層與用于鏈路數據交換的信令技術隔離開來。物理層被劃分為邏輯物理層和電氣物理層子模塊&#xff08;見圖4-1&#xff09;。 4.2 邏輯物理層子模塊 邏輯子模塊有兩個主要部分&#xff1a;一個發送部分&#xff0c;它準備從數據鏈路層傳遞過…

Spring 中常用的手動裝載 bean 方法

在 Spring 的 bean 裝載條件中&#xff0c;雖然 Spring 給我們提供了非常好用便捷的 Condition 相關注解&#xff0c;但是很多時候 Condition 相關注解并不滿足我們的需求&#xff0c;我需要更復雜的條件手動控制是否裝置 bean。這個時候我們就可以實現 Spring 為我們提供的幾個…

域名DNS添加CAA記錄

目錄 概述 檢測CAA記錄 添加CAA記錄 概述 DNS CAA(Certificate Authority Authorization)記錄是一種不太常見的DNS記錄類型,它主要用于鎖定證書頒發機構(CA)列表,以確保只有特定的CA可以為某個域名頒發SSL/TLS證書。CAA記錄是保護域名免受釣魚攻擊的安全措施,通過限制…

v-md-editor和SSE實現ChatGPT的打字機式輸出

概述 不論是GPT還是文心一言&#xff0c;在回答的時候類似于打字機式的將答案呈現給我們&#xff0c;這樣的交互一方面比較友好&#xff0c;另一方面&#xff0c;當答案比較多、生成比較慢的時候也能爭取一些答案的生成時間。本文后端使用express和stream&#xff0c;使用SSE將…

Boosting Cache Performance by Access Time Measurements——論文泛讀

TOC 2023 Paper 論文閱讀筆記整理 問題 大多數現代系統利用緩存來減少平均數據訪問時間并優化其性能。當緩存未命中的訪問時間不同時&#xff0c;最大化緩存命中率與最小化平均訪問時間不同。例如&#xff1a;系統使用多種不同存儲介質時&#xff0c;不同存儲介質訪問時間不同…

【C++初階】—— 類和對象 (上)

&#x1f4dd;個人主頁&#x1f339;&#xff1a;EterNity_TiMe_ ?收錄專欄?&#xff1a;C “ 登神長階 ” &#x1f339;&#x1f339;期待您的關注 &#x1f339;&#x1f339; 類和對象 1. 初步認識C2. 類的引入3. 類的定義聲明和定義全部放在類體中聲明和定義分開存放 4.…

8個實用網站和軟件,收藏起來一定不后悔~

整理了8個日常生活中經常能用得到的網站和軟件&#xff0c;收藏起來一定不會后悔~ 1.ZLibrary zh.zlibrary-be.se/這個網站收錄了超千萬的書籍和文章資源&#xff0c;國內外的各種電子書資源都可以在這里搜索&#xff0c;98%以上都可以在網站內找到&#xff0c;并且支持免費下…

Android系統的/etc/mkshrc文件

/etc/mkshrc 文件是用于配置 mksh&#xff08;MirBSD Korn Shell&#xff09;環境的啟動腳本。mksh 是 Android 默認使用的 shell&#xff0c;在 shell 啟動時會讀取并執行這個文件中的配置。以下是關于 /etc/mkshrc 文件的詳細信息及其用途。 /etc/mkshrc 文件的作用 環境配…

sql server專題實驗4 復雜查詢

SQL Server 是微軟開發的數據庫管理系統&#xff0c;它支持復雜的查詢操作&#xff0c;允許用戶從數據庫中檢索、分析和處理數據。在進行復雜查詢時&#xff0c;通常會用到以下幾種SQL語句和概念&#xff1a; 連接&#xff08;Join&#xff09;: INNER JOIN&#xff1a;只返回兩…

設計模式--備忘錄模式

備忘錄模式是一種行為設計模式&#xff0c;它用于在不破壞封裝的前提下&#xff0c;保存一個對象的內部狀態&#xff0c;以便以后可以恢復到這個狀態。這種模式在許多應用場景中非常有用&#xff0c;例如在實現撤銷操作、保存游戲進度、恢復文件備份以及保持工作狀態等。 備忘…