Mixtral MoE: First open-sourced model to beat GPT3.5

February 9, 2024 admin

Recently, Mistral AI made a big splash with the publication of their latest model, Mixtral-8x7B: a Mixture-of-Experts model.

First, because 𝐌𝐢𝐱𝐭𝐫𝐚𝐥 𝐛𝐞𝐜𝐚𝐦𝐞 𝐭𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐦𝐨𝐝𝐞𝐥 𝐭𝐨 𝐨𝐯𝐞𝐫𝐜𝐨𝐦𝐞 𝐆𝐏𝐓3.5 𝐨𝐧 𝐋𝐌𝐒𝐲𝐬’𝐬 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚.

Then, because the MoE architecture is amazing (and funny). It’s basically 8 models in a trenchcoat: the feedforward layers of the decoder blocks are divided into 8 experts, and for each token, a router will decide which 2 experts to allocate the processing to.
The advantage of this architecture is that even though you have 7x8B= 47B parameters in total (yes the product is wrong, it’s a bit more complicated), the model is much cheaper and fast to run since only 2/8 experts are activated for each prediction.

But how do you maintain good performance with only 1/4th of your model running at one time? The picture below (from a post linked in comment) gives us a view of the answer: there’s a marked specialization between experts, with one being stronger on logic, the other on history, and so on. The router knows which one is good at each subject, and like an excellent TV host, it carefully pick its experts to always get a good answer.

website

GitHub

Join Upaspro to get email for news in AI and Finance

Mixtral MoE: First open-sourced model to beat GPT3.5

Like this:

Related

Leave a Reply Cancel reply

Share this:

Like this:

Related

You May Also Like

Deepdive: Custom Collate in DataLoader, Gemini cookbook, Build LLM from scratch

Deepdive: half memory with sequential backward calls, SaySelf, Diffusion On Syntax Trees

Autonomous Driving: from Sensor Fusion to End-to-End Control

Leave a Reply Cancel reply