Benchmarking xAI’s Grok-1

March 26, 2024

Generative AI Research

xAI just released a 314B MoE model called Grok-1. So what?

By Adil Asif, Matthew Choi, Mark Coatsworth, Jacob Junqi Tian, and John Willes

Grok-1 was open-sourced by xAI on March 17, 2024. To date, it is the largest language model ever made publicly available, weighing in at a whopping 314B parameters. Grok-1 implements the increasingly popular Mixture of Experts (MoE) architecture that is speculated to underpin Google and OpenAI’s largest models, Gemini and GPT-4. 
Does Grok-1 represent the new state-of-the-art in open-source? The model was first announced with a blog post in November 2023, but the model weights were quietly released last week alongside a barebones inference implementation. In an effort to determine whether Grok-1 represents a new state-of-the-art in open-source, members of Vector’s AI Engineering team benchmarked the model, comparing it to leading models while looking at the considerations for its responsible use.

Let’s first take a look at Grok-1 under the hood.

Grok-1: Explained

Grok-1 is a 314B parameter MoE model with a maximum context length of 8192 tokens. This places Grok-1 at ~4.5 times the size of Llama-2-70B and ~6.7 times the size of Mixtral 8x7B. xAI has not yet released any information detailing the data or hardware that was used to train Grok-1, although they claim that it was trained in 2 months using a custom Jax, Rust, and Kubernetes-based training stack.

The MoE architecture implements eight experts with two of them used per token. This means that only ~25% (or ~79B) parameters are active for any given token. MoE models replace dense feed-forward layers with sparse MoE layers and a routing mechanism. The advantages of MoE models are that they are capable of both significantly faster pre-training and inference than dense models with similar parameter counts. 

Grok-1’s weights are the raw pre-training stage output. The model weights have not undergone any fine-tuning or alignment training using methods such as reinforcement learning with human feedback (RLHF). Without these additional training stages, we should not expect Grok-1 to be performant in chat applications.

Grok-1 was released with the permissive Apache 2.0 license, meaning that you can use the model weights for commercial purposes.

Grok-1: In Practice

To run Grok-1 inference, users need 8x A100 80GB GPUs or hardware with equivalent VRAM. While only two of the eight experts are in use for any given token, and weights in the official release were quantized to reduce the memory footprint, significant VRAM is still required to load all experts into memory.

Alongside the model weights, xAI released a lightweight inference script that prioritized correctness of the MoE layer implementation over optimization. Expect inference to be slow — even on powerful hardware — until optimized JAX inferencing is released or developed by the open-source community.

To get an early glimpse of Grok-1’s performance capabilities it was benchmarked on the Massive Multi-task Language Understanding (MMLU) dataset. Due to inference speed constraints, we benchmarked only on three MMLU subjects: high school mathematics, college mathematics, and global facts. Using the 5-shot evaluation scheme, we compared Grok-1 to a few generations of large models from the past couple of years.

MMLU (5-Shot)

Model (Parameters)High School MathematicsCollege MathematicsGlobal Facts
BLOOM (176B)27.0%25.0%
OPT (175B)24.4%33.0%
Llama-2 (70B)35.56%40%48%
Mixtral 8x7B (47B)38.5%46.0%51%
Grok-1 (314B)39.63%41%44%

For a model with such a significant hardware requirement, the results are underwhelming. Grok-1 outperforms older models but falls short of the current state-of-the-art open-source models on this subset of MMLU.

When evaluating a new model, performance is only one dimension that should be considered. It is equally important to understand if a model is safe to deploy. Will the model produce toxic, biased, or other unsafe output? Similar to the performance benchmarking, we evaluate Grok-1 on the Challenging subset of the RealToxicityPrompts dataset. We compute the average toxicity score of prompt completions and compare against top models from closed and open-source. Toxicity scores are obtained from the Perspective API.

Average Toxicity – RealToxicityPrompts – Challenging 

ModelAverage Toxicity Score
GPT-3.50.255
GPT-40.222
Mixtral 8x7B0.378
Grok-10.355

Our results suggest that Grok-1 could be significantly more toxic than closed-source models but similar to state-of-the-art open-source models. This level of toxicity may be intentional; xAI advertised Grok as a model with “a bit of wit and a rebellious streak.” An important caveat to remember is that Grok-1 is the raw pre-trained model and does not benefit from secondary training stages which often target toxicity reduction.

Conclusion

As it stands, Grok-1 is the largest open-source AI model ever made available. It is multiple times larger than the current state-of-the-art open models such as Llama-2-70B or Mixtral 8x7B, however, early results suggest that Grok-1 falls flat in performance comparisons with open-source state-of-the-art and generated output is significantly more toxic than closed-source alternatives.

Related:

Man typing on laptop
Generative AI

How businesses can balance AI innovation and cybersecurity

Vector Faculty Member Frank Rudzicz welcoming participants to the workshop.
Natural Language Processing
Research

Breaking Ground: Natural language processing headlines Vector Institute’s latest workshop gathering

Two people playing chess
Insights
Research

Vector Research Blog: Is Your Neural Network at Risk? The Pitfall of Adaptive Gradient Optimizers