[WIP] 8-bit quantization for inference#771
[WIP] 8-bit quantization for inference#771mjdenkowski merged 41 commits intoawslabs:sockeye_2_heafield_quantizefrom
Conversation
Works with this quantization program (TODO integrate):
import mxnet as mx
model = mx.nd.load("/home/ubuntu/idid-enus/model.amt.sf-concat/params.best")
dense = [k[0:-7] for k in model.keys() if k.endswith('.weight') and not k.startswith("embedding_source.")]
dense.remove("encoder.pos_embedding")
dense.remove("decoder.pos_embedding")
for param in dense:
name = param + ".weight"
b = model[name]
b_max = mx.nd.contrib.intgemm_maxabsolute(b)
# The disk format just quantizes.
b_prepared = mx.nd.contrib.intgemm_prepare_data(b, b_max)
model[name] = b_prepared
model[param + ".scaling"] = b_max / 127.0
mx.nd.save("/home/ubuntu/idid-enus/model.amt.sf-concat.quant/params.best", model)
But it doesn't check all parameters are in the provided model
|
Updated:
You'll also need to |
…eafield-quantize
fhieber
left a comment
There was a problem hiding this comment.
Really looking forward to the corresponding mxnet change to get this merged!
Left a few comments, mostly minor style comments.
I think it would be nice to test in8 quantization in the system tests. This would entail quantizing the model in the test suite and have another decoding pass which allows you to assert on output similarity and/or BLEU. This would also clarify the workflow with int8 quantization.
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| class QuantizableDense(mx.gluon.HybridBlock): |
There was a problem hiding this comment.
Couldn't you inherit from mx.gluon.nn.basic_layers.Dense directly and only overwrite cast() and hybrid_forward?
There was a problem hiding this comment.
I agree. Tried to do this, will need a consultation with a gluon expert.
There was a problem hiding this comment.
I guess we need to carefully set the prefix for the inheriting class to make sure the parameter names match.
| model.cast(model_config.dtype) | ||
|
|
||
| if quantizing: | ||
| logger.info("Model dtype: quantizing from float32 to int8") |
There was a problem hiding this comment.
We could potentially quantize from FP16, right? Or everything on disk is FP32?
There was a problem hiding this comment.
There isn't a kernel to quantize from FP16 to INT8. CPUs aren't so great at FP16 anyway; they only have instructions to convert to/from FP32 then do all the math in FP32.
There was a problem hiding this comment.
So this means that being able to quantize to int8 for inference requires having trained an FP32 model?
There was a problem hiding this comment.
Do you have stable training in FP16? I guess I could add a code path to convert FP16 -> FP32 -> int8 which, sadly, is how the CPU would do it anyway.
Co-Authored-By: Felix Hieber <fhieber@users.noreply.github.com>
Co-Authored-By: Felix Hieber <fhieber@users.noreply.github.com>
…to heafield-quantize
|
Now supports three disk formats:
Adding scaling factors (transition 1 -> 2): import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='float32', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference. Load from disk.Adding scaling factors and quantizing (transition 1 -> 3): import sockeye.model
model = sockeye.model.load_model("model", for_disk_saving='int8', dtype='int8')
model[0].save_parameters("model.annotated/params.best")
model[0].save_config("model.annotated")
#Warning: do not use the loaded model for inference. Load from disk.In both cases you'll need the |
mjdenkowski
left a comment
There was a problem hiding this comment.
Approved for merge into an intermediate branch for final cleanup.
This pull request adds wrappers to the intgemm matrix multiplication library: https://github.com/kpu/intgemm . A performance comparison with DNNL aka MKL-DNN is at kpu/intgemm#59 The library targets thin matrix sizes seen in neural machine translation inference and was part of the top submission to the 2018 Workshop on Neural Generation and Translation efficiency task: https://neural.mt/papers/edinburgh/wnmt_marian_paper.pdf . The purpose of this issue is to add similar functionality to Sockeye: awslabs/sockeye#771 . Quantized Sockeye performance is 2.95x as fast. One problem with the current MXQuantizeSymbol approach is that Sockeye does not have a static graph for everything. intgemm uses a custom memory layout for the weight matrix to make more memory accesses consecutive, so there are operators to convert weights to that format. The idea is that weights are typically loaded once for inference. On architectures without VNNI, intgemm uses saturating 16-bit accumulation. This avoids an expensive madd_epi16 instruction every multiply by exploiting the fact that most neural network parameters are near 0. Because x86 only offers a unsigned * signed instruction and most people want signed * signed, there are two strategies one can take. Add 128 to data so now it's unsigned. But that biases the output. DNNL calculates this bias on the fly by summing weights then subtracts it out during GEMM. intgemm calculates this bias in advance, which can then be subtracted from the bias term with no overhead at runtime. A problem with this strategy is that it makes the accumulator bigger, requiring more upcasting with an expensive madd_epi16 instruction. Emulate signed * signed by normalizing the sign bit into the second argument. This requires extra instructions in the hot loop but keeps the accumulator small, so it's less necessary to accumulate into 32-bit integers and madd_epi16 can be avoided. Both intgemm and DNNL implement strategy 1; intgemm also implements strategy 2. Similar to DNNL, intgemm has runtime CPUID selection among backends for SSSE3, AVX2, AVX512BW, and AVX512VNNI.
Add support for 8-bit quantized matrix multiplication in inference.
This code depends on the intgemm branch in my fork of MXNet: https://github.com/kpuatamazon/incubator-mxnet/tree/intgemm . This will turn into a pull request against MXNet.
Quantization on one thread runs 2.95x as fast as the baseline on one thread. Quantization on one thread is 1.28x as fast as the baseline on four threads. Results on an AWS c5.x12large.
BLEU: 42.6 quantized, 42.5 baseline float32. No significant change.
Note that the on-disk format of the int8 file is dependent on the CPU architecture. A fix for this is pending a change to intgemm to separate the quantization and rearrangement steps.
The model is converted to 8-bit offline using a program. Here's a program to convert a model from fp32 to int8. You should also change the
configfile'sdtypetoint8. I'm soliciting suggestions on how to make this cleanly, probably another command-line program.Pull Request Checklist
until you can check this box.
pytest)pytest test/system)./style-check.sh)sockeye/__init__.py. Major version bump if this is a backwards incompatible change.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.