Mixtral MoE: First open-sourced model to beat GPT3.5
Recently, Mistral AI made a big splash with the publication of their latest model, Mixtral-8x7B: a Mixture-of-Experts model.
First, because 𝐌𝐢𝐱𝐭𝐫𝐚𝐥 𝐛𝐞𝐜𝐚𝐦𝐞 𝐭𝐡𝐞 𝐟𝐢𝐫𝐬𝐭 𝐎𝐩𝐞𝐧-𝐒𝐨𝐮𝐫𝐜𝐞 𝐦𝐨𝐝𝐞𝐥 𝐭𝐨 𝐨𝐯𝐞𝐫𝐜𝐨𝐦𝐞 𝐆𝐏𝐓3.5 𝐨𝐧 𝐋𝐌𝐒𝐲𝐬’𝐬 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚.
Then, because the MoE architecture is amazing (and funny). It’s basically 8 models in a trenchcoat: the feedforward layers of the decoder blocks are divided into 8 experts, and for each token, a router will decide which 2 experts to allocate the processing to.
The advantage of this architecture is that even though you have 7x8B= 47B parameters in total (yes the product is wrong, it’s a bit more complicated), the model is much cheaper and fast to run since only 2/8 experts are activated for each prediction.
But how do you maintain good performance with only 1/4th of your model running at one time? The picture below (from a post linked in comment) gives us a view of the answer: there’s a marked specialization between experts, with one being stronger on logic, the other on history, and so on. The router knows which one is good at each subject, and like an excellent TV host, it carefully pick its experts to always get a good answer.
Join Upaspro to get email for news in AI and Finance