|
1 | | -ggllm.cpp is a ggml-based tool to run quantized Falcon Models on CPU and GPU |
| 1 | +ggllm.cpp is a ggml-backed tool to run quantized Falcon 7B and 40B Models on CPU and GPU |
2 | 2 |
|
3 | | -For detailed (growing) examples and help check the new Wiki: |
| 3 | +For growing examples and help check the new Wiki: |
4 | 4 | https://github.com/cmp-nct/ggllm.cpp/wiki |
5 | 5 |
|
6 | 6 | **Features that differentiate from llama.cpp for now:** |
7 | 7 | - Support for Falcon 7B and 40B models (inference, quantization and perplexity tool) |
8 | | -- Fully automated GPU offloading based on available and total VRAM |
| 8 | +- Fully automated CUDA-GPU offloading based on available and total VRAM |
| 9 | +- Run any Falcon Model at up to 16k context without losing sanity |
| 10 | +- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated |
| 11 | +- Supports running Falcon 40B on a single 4090/3090 (24tk/sec, 15tk/sec), even on a 3080 with a bit of quality sacrifice |
| 12 | +- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt") |
9 | 13 | - Higher efficiency in VRAM usage when using batched processing (more layers being offloaded) |
10 | 14 | - 16 bit cuBLAs support (takes half the VRAM for those operations) |
11 | 15 | - Improved loading screen and visualization |
12 | 16 | - New tokenizer with regex emulation and BPE merge support |
13 | | -- Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add `-ins` for a chat or `-enc -p "Question"` and optional -sys "System prompt") |
14 | | -- Stopwords support (-S) |
15 | | -- Optimized RAM and VRAM calculation with batch processing support up to 8k |
16 | | -- More command line parameter options (like disabling GPUs) |
17 | | -- Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B-4-5bit and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated |
| 17 | + |
| 18 | +- Optimized RAM and VRAM calculation with batch processing support |
| 19 | +- More command line selective features (like disabling GPUs, system prompt, stopwords) |
| 20 | + |
18 | 21 |
|
19 | 22 | **What is missing/being worked on:** |
| 23 | +- priority: performance |
| 24 | +- web frontend example |
20 | 25 | - Full GPU offloading of Falcon |
21 | 26 | - Optimized quantization versions for Falcon |
22 | 27 | - A new instruct mode |
23 | 28 | - Large context support (4k-64k in the work) |
24 | 29 |
|
| 30 | + |
25 | 31 | **Old model support** |
26 | 32 | If you use GGML type models (file versions 1-4) you need to place tokenizer.json into the model directory ! (example: https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226/blob/main/tokenizer.json) |
27 | 33 | If you use updated model binaries they are file version 10+ and called "GGCC", those do not need the load and convert that json file |
@@ -49,10 +55,9 @@ https://huggingface.co/tiiuae/falcon-7b-instruct |
49 | 55 | https://huggingface.co/OpenAssistant |
50 | 56 | https://huggingface.co/OpenAssistant/falcon-7b-sft-mix-2000 |
51 | 57 | https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226 |
| 58 | +_The sft-mix variants appear more capable than the top variants._ |
52 | 59 | _Download the 7B or 40B Falcon version, use falcon_convert.py (latest version) in 32 bit mode, then falcon_quantize to convert it to ggcc-v10_ |
53 | 60 |
|
54 | | -**Prompting finetuned models right:** |
55 | | -https://github.com/cmp-nct/ggllm.cpp/discussions/36 |
56 | 61 |
|
57 | 62 | **Conversion of HF models and quantization:** |
58 | 63 | 1) use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly |
|
0 commit comments