Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Document] 更新Mac部署说明 #899

Merged
merged 9 commits into from
May 3, 2023
Merged

[Document] 更新Mac部署说明 #899

merged 9 commits into from
May 3, 2023

Conversation

yfyang86
Copy link
Contributor

@yfyang86 yfyang86 commented May 3, 2023

更新Mac部署说明

  • Type: Document
  • FILES: README.md; README_en.md
  • Keywords: OPENMP; MPS

具体更新内容

chatglm-6b-int4量化模型为例,做如下配置:

  • 安装libomp的步骤;
  • 对量化后模型等配置gcc编译项,并启用OMP加速推理;
  • 量化后模型启用MPS(然后失败)的解释。

Mac 启用OMP涉及https://huggingface.co/THUDM/chatglm-6b-int4quantization.py的修改由于需要手动安装一些依赖,不单独commit,而直接描述在了说明中。

已经验证环境:

Mac M1 Ultra 128GB
Mac OS: 13.3.1
GCC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
conda 23.3.1
torch (two versions, with MPS)

  • '2.0.0';
  • '2.1.0.dev20230502'

hiyouga and others added 9 commits April 29, 2023 22:50
[Document] 更新Mac部署
- FILE: Readme.md
- ADD: OPENMP; MPS
[Document] 更新Mac部署
- FILE: README.md; README_en.md
- ADD: OPENMP; MPS

# 具体内容

以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,做如下配置:

- 安装libomp的步骤;
- 对量化后模型等配置gcc编译项;
- 量化后模型启用MPS的解释。
 [Document] 更新Mac部署
- FILE: README.md; README_en.md
- ADD: OPENMP; MPS

# 具体内容

以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,做如下配置:

- 安装libomp的步骤;
- 对量化后模型等配置gcc编译项;
- 量化后模型启用MPS的解释。
 [Document] 更新Mac部署
 [Document] 更新Mac部署
 [Document] 更新Mac部署
- FILE: README.md/README_end.md
- ADD: OPENMP; MPS

# 具体内容

以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,做如下配置:

- 安装libomp的步骤;
- 对量化后模型等配置gcc编译项;
- 量化后模型启用MPS的解释;
- 缩短文本长度。
 [Document] 更新Mac部署
@duzx16
Copy link
Member

duzx16 commented May 3, 2023

我的系统也是 MacOS 13.3.1的,用半精度进行 MPS 计算没有问题。你用半精度计算会报什么错?

@yfyang86
Copy link
Contributor Author

yfyang86 commented May 3, 2023

  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log

--- Logging error --- Traceback (most recent call last): File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 19, in from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/__init__.py", line 1, in from . import library File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/__init__.py", line 1, in from . import nvrtc File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/nvrtc.py", line 5, in nvrtc = Lib("nvrtc") File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/cpm_kernels/library/base.py", line 59, in __init__ raise RuntimeError("Unknown platform: %s" % sys.platform) RuntimeError: Unknown platform: darwin

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/yifanyang/miniconda3/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/Users/yifanyang/miniconda3/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/Users/yifanyang/miniconda3/lib/python3.10/logging/init.py", line 678, in format
record.message = record.getMessage()
File "/Users/yifanyang/miniconda3/lib/python3.10/logging/init.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 6, in
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/models/auto/auto_factory.py", line 463, in from_pretrained
return model_class.from_pretrained(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/transformers-4.29.0.dev0-py3.10.egg/transformers/modeling_utils.py", line 2637, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1061, in init
self.quantize(self.config.quantization_bit, self.config.quantization_embeddings, use_quantization_cache=True, empty_init=True)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1424, in quantize
from .quantization import quantize, QuantizedEmbedding, QuantizedLinear, load_cpu_kernel
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 46, in
logger.warning("Failed to load cpm_kernels:", exception)
Message: 'Failed to load cpm_kernels:'
Arguments: (RuntimeError('Unknown platform: darwin'),)
No compiled kernel found.
Compiling kernels : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c
Compiling gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.c -shared -o /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so
Load kernel : /Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization_kernels_parallel.so
Setting CPU quantization kernel threads to 10
Using quantization cache
Applying quantization to glm layers
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Traceback (most recent call last):
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict
output = await app.get_blocks().process_api(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api
result = await self.call_function(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration
return next(iterator)
File "/Users/yifanyang/Git/ChatGLM-6B/web_demo.py", line 61, in predict
for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1311, in stream_chat
for outputs in self.stream_generate(**inputs, **gen_kwargs):
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1388, in stream_generate
outputs = self(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 1190, in forward
transformer_outputs = self.transformer(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 996, in forward
layer_ret = layer(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 627, in forward
attention_outputs = self.attention(
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/modeling_chatglm.py", line 445, in forward
mixed_raw_layer = self.query_key_value(hidden_states)
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 391, in forward
output = W8A16Linear.apply(input, self.weight, self.weight_scale, self.weight_bit_width)
File "/Users/yifanyang/miniconda3/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 56, in forward
weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
File "/Users/yifanyang/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4/quantization.py", line 274, in extract_weight_to_half
func = kernels.int4WeightExtractionHalf
AttributeError: 'NoneType' object has no attribute 'int4WeightExtractionHalf'

@duzx16 duzx16 changed the base branch from main to dev May 3, 2023 11:25
@duzx16
Copy link
Member

duzx16 commented May 3, 2023

  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log

需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题,现在应该已经修复了。你现在还能复现这个问题吗?

@duzx16 duzx16 merged commit ba8daf4 into THUDM:dev May 3, 2023
@yfyang86
Copy link
Contributor Author

yfyang86 commented May 3, 2023

  1. 某些情况需要half() 改成float(), 这个有些issue里面已经说了;
  2. 载入量化后的模型to("mps")失效
# eg: web_demo.py
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float().to('mps')
model = model.eval()

非量化的issue-462里面你也回复了(6B没啥问题,只是chatglm-6b-int4会有问题),原因在quantization_code 这个文件(bz2了一个ELF/so文件)里面是NV的,当前mps不起作用。至于这个要启用的话,量化的代码估计要动很多。

error log

需要从 .half() 改成 .float() 这个问题以前是因为 PyTorch 在 MPS 后端的 baddbmm 实现有问题,现在应该已经修复了。你现在还能复现这个问题吗?

用了最新(2023/05)的pytorch-nightly没有这个问题。详细版本号和测试情况见下面:

torch version status
2.1.0.dev20230502 half(), ✓float()
2.0.0 x half(), ✓float()

torch==2.0.0中(大陆广泛使用的anaconda镜像没有同步pytorch-nightly,安装不注意会触发这个问题),会有MPS的bug。

Python 3.10.10 (main, Mar 21 2023, 13:41:05) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch 
>>> torch.__version__
'2.0.0'
>>> torch.backends.mps.is_available()
True

---------------- error logs ----------------

loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/97f6331a-ba75-11ed-a4bc-863efbbaf80d/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":228:0)): error: input types 'tensor<1x5x1xf16>' and 'tensor<1xf32>' are not broadcast compatible

@zhaozhiming
Copy link

能跑但是很慢,请问要怎么解决,mac m1

@yfyang86
Copy link
Contributor Author

能跑但是很慢,请问要怎么解决,mac m1

只用了CPU

如题,这个issue上文解释了为什么量化模型MPS调用有问题

内存不够

可能要看下内存(显存)。我的M1配置是64GB(Macbook pro M1 Max)和128GB(Mac Studio),会观测到显存占用是比较高的,但(上下文)token数没那么大的时候,问题不大。

运行的时候监控下内存占用,比如:

 while :; do clear; top -l 1 | grep "python" | awk '{print "MEM="$9 "\tRPRVT="$10}'; sleep 2; done

把里面的python 换成你用的bash调用命令的关键字, ctrl+c中断。看看到底有多少内存占用。

至于细节,需要更多日志看了。Mac运行推理只是一个可行解而已。

综上,可能原因:

  • int8/int4等量化模型可以摸索下,但这些量化后的模型在Mac当前应该只能用CPU调用,自然慢;
  • 内存不够频繁用swap。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants