Mistral 7B
Mistral 7B is a 7.3B parameter model that outperforms Llama 2 13B on all benchmarks, outperforms Llama 1 34B on many benchmarks, approaches CodeLlama 7B performance on code, while remaining good at English tasks, uses Grouped-query attention (GQA) for faster inference, and uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost.
Performance in details
Mistral 7B significantly outperforms Llama 2 13B on all metrics, and is on par with Llama 34B. It is also vastly superior in code and reasoning benchmarks.
Flash and Furious: Attention drift
Mistral 7B uses a sliding window attention (SWA) mechanism, which yields a 2x speed improvement for sequence length of 16k with a window of 4k. This saves half of the cache memory for inference on sequence length of 8192, without impacting model quality.
Fine-tuning Mistral 7B for chat
Mistral 7B can be fine-tuned on instruction datasets publicly available on HuggingFace. The resulting model, Mistral 7B Instruct, outperforms all 7B models on MT-Bench, and is comparable to 13B chat models.