Benchmarking xAI’s Grok-1
March 26, 2024
March 26, 2024
By Adil Asif, Matthew Choi, Mark Coatsworth, Jacob Junqi Tian, and John Willes
Grok-1 was open-sourced by xAI on March 17, 2024. To date, it is the largest language model ever made publicly available, weighing in at a whopping 314B parameters. Grok-1 implements the increasingly popular Mixture of Experts (MoE) architecture that is speculated to underpin Google and OpenAI’s largest models, Gemini and GPT-4.
Does Grok-1 represent the new state-of-the-art in open-source? The model was first announced with a blog post in November 2023, but the model weights were quietly released last week alongside a barebones inference implementation. In an effort to determine whether Grok-1 represents a new state-of-the-art in open-source, members of Vector’s AI Engineering team benchmarked the model, comparing it to leading models while looking at the considerations for its responsible use.
Let’s first take a look at Grok-1 under the hood.
Grok-1 is a 314B parameter MoE model with a maximum context length of 8192 tokens. This places Grok-1 at ~4.5 times the size of Llama-2-70B and ~6.7 times the size of Mixtral 8x7B. xAI has not yet released any information detailing the data or hardware that was used to train Grok-1, although they claim that it was trained in 2 months using a custom Jax, Rust, and Kubernetes-based training stack.
The MoE architecture implements eight experts with two of them used per token. This means that only ~25% (or ~79B) parameters are active for any given token. MoE models replace dense feed-forward layers with sparse MoE layers and a routing mechanism. The advantages of MoE models are that they are capable of both significantly faster pre-training and inference than dense models with similar parameter counts.
Grok-1’s weights are the raw pre-training stage output. The model weights have not undergone any fine-tuning or alignment training using methods such as reinforcement learning with human feedback (RLHF). Without these additional training stages, we should not expect Grok-1 to be performant in chat applications.
Grok-1 was released with the permissive Apache 2.0 license, meaning that you can use the model weights for commercial purposes.
To run Grok-1 inference, users need 8x A100 80GB GPUs or hardware with equivalent VRAM. While only two of the eight experts are in use for any given token, and weights in the official release were quantized to reduce the memory footprint, significant VRAM is still required to load all experts into memory.
Alongside the model weights, xAI released a lightweight inference script that prioritized correctness of the MoE layer implementation over optimization. Expect inference to be slow — even on powerful hardware — until optimized JAX inferencing is released or developed by the open-source community.
To get an early glimpse of Grok-1’s performance capabilities it was benchmarked on the Massive Multi-task Language Understanding (MMLU) dataset. Due to inference speed constraints, we benchmarked only on three MMLU subjects: high school mathematics, college mathematics, and global facts. Using the 5-shot evaluation scheme, we compared Grok-1 to a few generations of large models from the past couple of years.
MMLU (5-Shot)
Model (Parameters) | High School Mathematics | College Mathematics | Global Facts |
---|---|---|---|
BLOOM (176B) | 27.0% | 25.0% | – |
OPT (175B) | 24.4% | 33.0% | – |
Llama-2 (70B) | 35.56% | 40% | 48% |
Mixtral 8x7B (47B) | 38.5% | 46.0% | 51% |
Grok-1 (314B) | 39.63% | 41% | 44% |
For a model with such a significant hardware requirement, the results are underwhelming. Grok-1 outperforms older models but falls short of the current state-of-the-art open-source models on this subset of MMLU.
When evaluating a new model, performance is only one dimension that should be considered. It is equally important to understand if a model is safe to deploy. Will the model produce toxic, biased, or other unsafe output? Similar to the performance benchmarking, we evaluate Grok-1 on the Challenging subset of the RealToxicityPrompts dataset. We compute the average toxicity score of prompt completions and compare against top models from closed and open-source. Toxicity scores are obtained from the Perspective API.
Average Toxicity – RealToxicityPrompts – Challenging
Model | Average Toxicity Score |
---|---|
GPT-3.5 | 0.255 |
GPT-4 | 0.222 |
Mixtral 8x7B | 0.378 |
Grok-1 | 0.355 |
Our results suggest that Grok-1 could be significantly more toxic than closed-source models but similar to state-of-the-art open-source models. This level of toxicity may be intentional; xAI advertised Grok as a model with “a bit of wit and a rebellious streak.” An important caveat to remember is that Grok-1 is the raw pre-trained model and does not benefit from secondary training stages which often target toxicity reduction.
As it stands, Grok-1 is the largest open-source AI model ever made available. It is multiple times larger than the current state-of-the-art open models such as Llama-2-70B or Mixtral 8x7B, however, early results suggest that Grok-1 falls flat in performance comparisons with open-source state-of-the-art and generated output is significantly more toxic than closed-source alternatives.