Conversation
|
Wow the speed and memory improvements are insane. Especially saw what you did with GPT-J! |
|
@regstuff For example, I observe that GPT-2 Overall, the intuition is that the larger the model is, the more resilient to quantisation it will be. I think.. |
|
Sorry in advance because i'm pretty much out of my depth here but i'm trying things, so feel free to dismiss me as a noob :) I played a little bit on the wasm version converting the Audio Lenght: 196.9 sec The quality of the transcription is way lower. Is the choice to use 4 bits quantization instead of 8bit driven by something specific? Is an higher resolution in the quantization related in any way to the performance of the algorithm? |
|
As a small note to this PR: my tests on this branch on Neoverse V1 CPUs (with correct flags set in compilation) have shown a dramatic drop of performance in medium model. In medium.en-q4_0: |
|
Hi @ggerganov this whisper/4-bit doesn't works with quantized models from ggml/master. |
* whisper : add integer quantization support * examples : add common-ggml + prepare to add "quantize" tool * whisper : quantization tool ready * whisper : fix F32 support * whisper : try to fix shared lib linkage * wasm : update quantized models to Q5 * bench.wasm : remove "medium" button * bench.wasm : fix custom model button * ggml : add Q5_0 and Q5_1 WASM SIMD * wasm : add quantized models to all WASM examples * wasm : bump DB version number to 2 * talk-llama : update example to latest llama.cpp * node : increase test timeout to 10s * readme : add information for model quantization * wasm : add links to other examples
* whisper : add integer quantization support * examples : add common-ggml + prepare to add "quantize" tool * whisper : quantization tool ready * whisper : fix F32 support * whisper : try to fix shared lib linkage * wasm : update quantized models to Q5 * bench.wasm : remove "medium" button * bench.wasm : fix custom model button * ggml : add Q5_0 and Q5_1 WASM SIMD * wasm : add quantized models to all WASM examples * wasm : bump DB version number to 2 * talk-llama : update example to latest llama.cpp * node : increase test timeout to 10s * readme : add information for model quantization * wasm : add links to other examples
* whisper : add integer quantization support * examples : add common-ggml + prepare to add "quantize" tool * whisper : quantization tool ready * whisper : fix F32 support * whisper : try to fix shared lib linkage * wasm : update quantized models to Q5 * bench.wasm : remove "medium" button * bench.wasm : fix custom model button * ggml : add Q5_0 and Q5_1 WASM SIMD * wasm : add quantized models to all WASM examples * wasm : bump DB version number to 2 * talk-llama : update example to latest llama.cpp * node : increase test timeout to 10s * readme : add information for model quantization * wasm : add links to other examples
* whisper : add integer quantization support * examples : add common-ggml + prepare to add "quantize" tool * whisper : quantization tool ready * whisper : fix F32 support * whisper : try to fix shared lib linkage * wasm : update quantized models to Q5 * bench.wasm : remove "medium" button * bench.wasm : fix custom model button * ggml : add Q5_0 and Q5_1 WASM SIMD * wasm : add quantized models to all WASM examples * wasm : bump DB version number to 2 * talk-llama : update example to latest llama.cpp * node : increase test timeout to 10s * readme : add information for model quantization * wasm : add links to other examples
Q4_0,Q4_1,Q4_2,Q5_0,Q5_1,Q8_0quantizetool for model quantizationtalk-llamawith latestllama.cppUsage: