Skip to content

Commit 27da5e7

Browse files
author
xiaying
committed
MNN:Sync: sync internal 3.0.3
1 parent c23d82c commit 27da5e7

File tree

74 files changed

+18338
-23808
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+18338
-23808
lines changed

docs/faq.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,17 @@ OpenCL / Vulkan 采用静态变量自注册的方式往 MNN 主库注册后端.
258258
### Register 相关内存泄露说明
259259
用 valgrind 工具检查时会报 MNN Register 相关的内存泄露,这个属于一次性的初始化内存,后续也不会增加,可视为误报
260260
261+
### Metal 相关内存增长说明
262+
263+
Metal 后端使用的是OC对象,需要用OC的自动回收机制来清除内存,可在使用代码中把相关mnn的API调用放到autorealse中以自动回收内存
264+
265+
```
266+
@autoreleasepool {
267+
/* MNN 相关调用代码 */
268+
}
269+
270+
```
271+
261272
262273
## 性能相关
263274
### 使用 GPU 时,调用 copyToHostTensor / readMap 非常慢

docs/tools/compress.md

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ MNN模型压缩工具提供了包括低秩分解、剪枝、量化等模型压
2828
| 训练量化 | 将float卷积转换为int8卷积计算,需要进行训练,可提高量化模型精度,降低存储量到原始模型的四分之一,降低内存,加速计算(某些模型可能会比float模型慢,因为float的优化方法和int8不同) | LSQ,OAQ,WAQ |
2929
| 直接权值量化 | 仅将模型中的权值进行量化,计算时还原为float进行计算,因此仅减少模型存储量,计算速度和float相同,可以在模型转换时一键完成,8bit量化情况下,精度基本不变,模型大小减小到原来的1/4 | 对称量化,非对称量化 |
3030
| 训练权值量化 | 特点同直接权值量化,但通过mnncompress压缩算法插件实现,因而可以提供更低比特的权值量化,以减少更多的存储量,并提高权值量化之后模型的精度,例如4bit量化情况下,模型大小减小到原来的1/8 | 对称量化 |
31-
| FP16 | 将FP32计算转换为FP16计算,可在模型转换时一键完成,模型大小减小为原来的1/2,精度基本无损 | - |
31+
| FP16 | 将FP32的权重转换成FP16的类型,可在模型转换时一键完成,模型大小减小为原来的1/2,精度基本无损 | - |
3232

3333
### 怎么用?
3434
1. 使用模型转换工具中的压缩功能无需额外数据,只要在模型转换时加对应参数即可,开启动态量化功能后也可以对卷积等计算量大的算子实现量化加速。
@@ -64,19 +64,42 @@ MNN模型压缩工具提供了包括低秩分解、剪枝、量化等模型压
6464
--weightQuantBits 8 [--weightQuantAsymmetric](可选) [--weightQuantBlock 128](可选)
6565
```
6666
`--weightQuantAsymmetric` 选项是指使用非对称量化方法,精度要比默认的对称量化精度好一些。
67-
`--weightQuantBlock 128` 表示以128为单位进行量化,如不设置则以输入通道数为单位进行量化。如果牺牲一些存储大小来提升量化精度,可以增加这个设置,理论上越小精度越高,但建议不要低于32。
67+
`--weightQuantBlock 128` 表示以128为单位进行量化,如不设置则以输入通道数为单位进行量化。若希望牺牲一些存储空间来提升量化精度,可以增加这个设置。理论上越小精度越高,但不能低于32。
68+
6869
- 动态量化
6970
可以通过如下方式打开MNN运行时的动态量化支持,使权值量化后的模型中卷积等核心算子使用量化计算,降低内存并提升性能
71+
7072
1. 打开 MNN_LOW_MEMORY 编译宏编译 MNN (支持动态量化功能)
73+
```
74+
cmake .. -DMNN_LOW_MEMORY=ON
75+
```
76+
7177
2. 使用 mnn 模型时 memory mode 设成 low
7278
79+
```
80+
MNN::ScheduleConfig config;
81+
BackendConfig backendConfig;
82+
backendConfig.memory = BackendConfig::Memory_Low;
83+
config.backendConfig = &backendConfig;
84+
```
85+
7386
### FP16压缩
74-
- 将模型中FP32权值转换为FP16存储,并在支持的设备上开启FP16推理,可以获得推理加速,并且速度减少到原来的1/2。可以在模型转换时一键完成,使用方便
87+
- 将模型中FP32权值转换为FP16存储,大小减少到原来的1/2
7588
- 使用`MNNConvert`(c++)或者`mnnconvert`(python包中自带)进行转换,转换命令行中加上下述选项即可:
7689
```bash
7790
--fp16
7891
```
7992

93+
注意:FP16压缩与FP16加速无关,只要设置 precision = low ,无论 FP32 还是 FP16 的模型,MNN 都会在支持的设备上启用FP16加速功能
94+
95+
```
96+
MNN::ScheduleConfig config;
97+
BackendConfig backendConfig;
98+
backendConfig.precision = BackendConfig::Precision_Low;
99+
config.backendConfig = &backendConfig;
100+
```
101+
102+
80103
## 离线量化工具
81104
### 离线量化工具安装
82105
- c++工具安装

express/Executor.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,13 @@ void Executor::RuntimeManager::setHint(Interpreter::HintMode mode, int value) {
233233
void Executor::RuntimeManager::setExternalPath(std::string path, int type) {
234234
mInside->modes.setExternalPath(path, type);
235235
}
236+
void Executor::RuntimeManager::setHintPtr(Interpreter::HintMode mode, void* value) {
237+
auto current = ExecutorScope::Current();
238+
auto rt = current->getRuntime();
239+
for (auto& iter : rt.first) {
240+
iter.second->pMeta = value;
241+
}
242+
}
236243

237244
bool Executor::RuntimeManager::getInfo(Interpreter::SessionInfoCode code, void* ptr) {
238245
// Only support get memory

include/MNN/Interpreter.hpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,12 @@ class MNN_PUBLIC Interpreter {
236236
KVCACHE_SIZE_LIMIT = 8,
237237
// Op encoder number for commit
238238
OP_ENCODER_NUMBER_FOR_COMMIT = 9,
239+
240+
// KVCache Info
241+
KVCACHE_INFO = 10,
242+
// mmap allocate file size, KB
243+
MMAP_FILE_SIZE = 11,
244+
USE_CACHED_MMAP = 12
239245
};
240246

241247
enum ExternalPathType {

include/MNN/MNNDefine.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,6 @@ MNN_ERROR("Check failed: %s ==> %s\n", #success, #log); \
7575
#define STR(x) STR_IMP(x)
7676
#define MNN_VERSION_MAJOR 3
7777
#define MNN_VERSION_MINOR 0
78-
#define MNN_VERSION_PATCH 2
78+
#define MNN_VERSION_PATCH 3
7979
#define MNN_VERSION STR(MNN_VERSION_MAJOR) "." STR(MNN_VERSION_MINOR) "." STR(MNN_VERSION_PATCH)
8080
#endif /* MNNDefine_h */

include/MNN/expr/Executor.hpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ class MNN_PUBLIC Executor {
125125
friend class Executor;
126126
void setMode(Interpreter::SessionMode mode);
127127
void setHint(Interpreter::HintMode mode, int value);
128+
void setHintPtr(Interpreter::HintMode mode, void* value);
128129
bool getInfo(Interpreter::SessionInfoCode code, void* ptr);
129130
BackendConfig* getBnConfig();
130131
const RuntimeAttr* getInside() const {

project/android/build_32.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ cmake ../../../ \
99
-DMNN_USE_LOGCAT=false \
1010
-DMNN_USE_SSE=OFF \
1111
-DMNN_BUILD_TEST=ON \
12+
-DMNN_ARM82=OFF \
1213
-DMNN_BUILD_FOR_ANDROID_COMMAND=true \
1314
-DNATIVE_LIBRARY_OUTPUT=. -DNATIVE_INCLUDE_OUTPUT=. $1 $2 $3 $4 $5 $6 $7
1415

pymnn/src/llm.h

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,25 @@ static PyObject* PyMNNLLM_generate(LLM *self, PyObject *args) {
5252
return toPyObj<int, toPyObj>(output_ids);
5353
}
5454

55+
static PyObject* PyMNNLLM_eraseHistory(LLM *self, PyObject *args) {
56+
if (self->is_embedding) {
57+
Py_RETURN_NONE;
58+
}
59+
size_t history = 0;
60+
size_t end = 0;
61+
if (!PyArg_ParseTuple(args, "LL", &history, &end)) {
62+
Py_RETURN_NONE;
63+
}
64+
self->llm->eraseHistory(history, end);
65+
Py_RETURN_NONE;
66+
}
67+
static PyObject* PyMNNLLM_getCurrentHistory(LLM *self, PyObject *args) {
68+
if (self->is_embedding) {
69+
Py_RETURN_NONE;
70+
}
71+
auto history = self->llm->getCurrentHistory();
72+
return PyLong_FromLong(history);
73+
}
5574
static PyObject* PyMNNLLM_response(LLM *self, PyObject *args) {
5675
if (self->is_embedding) {
5776
Py_RETURN_NONE;
@@ -62,8 +81,8 @@ static PyObject* PyMNNLLM_response(LLM *self, PyObject *args) {
6281
Py_RETURN_NONE;
6382
}
6483
std::ostringstream null_os;
65-
auto res = self->llm->response(query, stream ? &std::cout : &null_os);
66-
return string2Object(res);
84+
self->llm->response(query, stream ? &std::cout : &null_os);
85+
return string2Object(null_os.str());
6786
}
6887

6988
static PyObject* PyMNNLLM_tokenizer_encode(LLM *self, PyObject *args) {
@@ -109,6 +128,8 @@ static PyMethodDef PyMNNLLM_methods[] = {
109128
{"forward", (PyCFunction)PyMNNLLM_forward, METH_VARARGS, "forward `logits` by `input_ids`."},
110129
{"generate", (PyCFunction)PyMNNLLM_generate, METH_VARARGS, "generate `output_ids` by `input_ids`."},
111130
{"response", (PyCFunction)PyMNNLLM_response, METH_VARARGS, "response `query` without hsitory."},
131+
{"get_current_history", (PyCFunction)PyMNNLLM_getCurrentHistory, METH_VARARGS, "Get Current History."},
132+
{"erase_history", (PyCFunction)PyMNNLLM_eraseHistory, METH_VARARGS, "Erase History."},
112133
{"tokenizer_encode", (PyCFunction)PyMNNLLM_tokenizer_encode, METH_VARARGS, "tokenizer encode."},
113134
{"tokenizer_decode", (PyCFunction)PyMNNLLM_tokenizer_decode, METH_VARARGS, "tokenizer decode."},
114135
{"txt_embedding", (PyCFunction)PyMNNLLM_txt_embedding, METH_VARARGS, "txt embedding."},

source/backend/cpu/CPUAttention.cpp

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ ErrorCode CPUAttention::onResize(const std::vector<Tensor*>& inputs, const std::
177177
backend()->onAcquireBuffer(mPackQ.get(), Backend::DYNAMIC);
178178
backend()->onAcquireBuffer(mPackQKV.get(), Backend::DYNAMIC);
179179
backend()->onReleaseBuffer(mPackQ.get(), Backend::DYNAMIC);
180-
backend()->onReleaseBuffer(mPackQKV.get(), Backend::DYNAMIC);
180+
backend()->onReleaseBuffer(mPackQKV.get(), Backend::DYNAMIC);
181181
}
182182
return NO_ERROR;
183183
}
@@ -193,9 +193,6 @@ ErrorCode CPUAttention::onExecute(const std::vector<Tensor*>& inputs, const std:
193193
int mask_kvlen = mask->length(3);
194194
int seq_len = query->length(1);
195195
MNN_ASSERT(seq_len == mask_seqlen);
196-
mIsPrefill = (seq_len > 1);
197-
// isPrefill and mask is Square Matrix, is FirstPrefill
198-
mIsFirstPrefill = mIsPrefill && (mask_kvlen == mask_seqlen);
199196
int tileCount = UP_DIV(mNumHead, mThreadNum);
200197
int group_size = mNumHead / mKvNumHead;
201198
// reduce the value of 'query' to avoid fp16 overflow
@@ -215,15 +212,12 @@ ErrorCode CPUAttention::onExecute(const std::vector<Tensor*>& inputs, const std:
215212
mScale /= q_scale;
216213
}
217214

218-
if (mIsPrefill) {
219-
if (mIsFirstPrefill) {
220-
mKVCacheManager->onClear();
221-
mKVCacheManager->onAlloc(seq_len);
222-
} else {
223-
mKVCacheManager->onRealloc(mKVCacheManager->kvLength() + seq_len);
224-
}
225-
} else { // Decode
226-
mKVCacheManager->onRealloc(mKVCacheManager->kvLength() + 1);
215+
if (mMeta->previous == mMeta->remove) {
216+
mKVCacheManager->onClear();
217+
mKVCacheManager->onAlloc(mMeta->add);
218+
} else {
219+
MNN_ASSERT(mMeta->previous == mKVCacheManager->kvLength());
220+
mKVCacheManager->onRealloc(mMeta);
227221
}
228222
// Add the new kv to the kvcache
229223
mKVCacheManager->onPushBack(key, value);
@@ -383,6 +377,7 @@ bool CPUAttention::onClone(Backend* bn, const Op* op, Execution** dst) {
383377
}
384378

385379
CPUAttention::CPUAttention(Backend *backend, bool kv_cache) : Execution(backend), mKVCache(kv_cache) {
380+
mMeta = (KVMeta*)(backend->getRuntime()->pMeta);
386381
if (mKVCache) {
387382
mPackQ.reset(Tensor::createDevice<float>({1, 1, 1, 1}));
388383
mPackQKV.reset(Tensor::createDevice<float>({1, 1, 1, 1}));

source/backend/cpu/CPUAttention.hpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313

1414
#include <functional>
1515
#include "core/Execution.hpp"
16+
#include "core/OpCommonUtils.hpp"
1617
#include "MNN/ErrorCode.hpp"
1718
#include "KVCacheManager.hpp"
1819

@@ -26,8 +27,6 @@ class CPUAttention : public Execution {
2627
virtual ErrorCode onExecute(const std::vector<Tensor *> &inputs, const std::vector<Tensor *> &outputs) override;
2728
virtual bool onClone(Backend* bn, const Op* op, Execution** dst) override;
2829
private:
29-
bool mIsPrefill = true;
30-
bool mIsFirstPrefill = true;
3130
bool mKVCache = true;
3231
bool mUseGemmInt8 = false;
3332
int bytes = 4;
@@ -40,6 +39,7 @@ class CPUAttention : public Execution {
4039
std::vector<float> mMinQ, mMaxQ, mQueryScale, mQueryZeroPoint;
4140
template <typename T> void pack_query(Tensor* query, char* pack_q, char* sum_q, int seq_len, int h, float q_scale);
4241
template <typename T> void unpack_QK(float * unpack_qk_dst, char * pack_qk_src, int seq_len, int kv_seq_len);
42+
KVMeta* mMeta;
4343
};
4444

4545
} // namespace MNN

0 commit comments

Comments
 (0)