Shisa V2 405B: Japan's Highest Performing LLM

4 months ago 20

We are incredibly excited to announce one more addition to the Shisa V2 family: Shisa V2 405B

Shisa V2 405B is the highest-performing LLM ever developed in Japan, and surpasses GPT-4 (0603) and GPT-4 Turbo (2024-04-09) in our eval battery. (It also goes toe-to-toe with GPT-4o (2024-11-20) and DeepSeek-V3 (0324) on Japanese MT-Bench!)
Like all other Shisa V2 models, Shisa V2 405B is open-weight, commercial-ready, and available now on Hugging Face.
- Our core datasets are also open source (Apache 2.0) and after successfully applying them to models ranging from 7B to 405B parameters, we can confidently say that they should be able to improve the Japanese language capabilities of practically any model
If you’d like to practice (or learn!) 日本語 now, you can chat right now with Shisa V2 405B at: chat.shisa.ai.

Shisa V2 405B is a slightly special version of Shisa V2. Firstly, it is massive. Using Llama 3.1 405B Instruct as the base model, it required >50x the compute for SFT+DPO vs Shisa V2 70B. And while it uses the same Japanese data mix as the other Shisa V2 models, it also has some contributed KO and ZH-TW language data mixed in as well, making it more explicitly multi-lingual than just bilingual.

Most notably, Shisa V2 405B not only outperforms Shisa V2 70B on our eval suite, but GPT-4 (0603) and GPT-4 Turbo (2024-04-09) as well. Shisa V2 405B also goes toe-to-toe with GPT-4o (2024-11-20) and DeepSeek-V3 (0324) on Japanese MT-Bench. Based on the evaluation results, we believe that Shisa V2 405B is the highest performing LLM ever trained in Japan and we believe that our results help point the way towards how even smaller Japanese AI labs can excel on the global stage.

JA MT-Bench results competitive with GPT-4o and DeepSeek-V3

Sovereign AI

There’s been a lot of talk recently about Sovereign AI, the ability for nations to develop and run their own AI systems. Interestingly, here at Shisa.AI, we’re actually all immigrants who have actively chosen to make Japan our home. Although we were drawn here for different reasons, all of us share a deep appreciation for Japanese culture, language, and the other aspects that make Japan a unique and wonderful place.

We strongly believe that it’s important for homegrown AI to be developed both in Japan (and globally!), and not just for the sake of cultural diversity and linguistic preservation, but also for data privacy and security, geopolitical resilience, and ultimately, independence.

We believe the open-source approach is the only realistic way to achieve sovereignty in AI, not just for Japan, or even for nation states, but for the global community at large.

Benchmarks/Evals

While it’s tempting to go overboard reciting benchmark numbers (we get it, we’re proud of our results!), we’ll point to our Shisa V2 405B Model Card or our Overview Report for those interested in the full details.

One thing that we would like to mention about evals is that we took great care with our Shisa V2 datasets to minimize contamination and avoid “benchmaxxing.” Besides using a diverse set of evals, we also developed brand new Japanese evals to to test specific real-world downstream use-cases and we believe that our improved scores accurately reflect improved model capabilities. Our goal is ultimately, to make useful models, and benchmark results should ideally simply act as a guide to help us get there.

Data, Data, Data

It should come as no surprise that the single most important factor in improving our model performance was by improving data quality.

While we took some time chasing down some of the most intriguing new training techniques from Arxiv papers (most failed to replicate for us), what worked best in reliably improving model performance, and the best “bang/buck” in terms of time/resources, was in simply focusing on improving our datasets.

Since our first models, Shisa has always been about synthetic data, and we pushed that further than ever with extensive filtering, rating, annotation, and generations from multiple SOTA open models. Since our focus is multilingual, we also experimented with different approaches for native language generation (translation, prompting, etc), and also with different language pair ordering and curriculum learning while training.

The results were not always intuitive. One of our experiments involved pairwise training - we expected that pairwise training of matched multilingual samples would give better results, but in the end simple random shuffling worked better!

Training Details

The compute for training Shisa V2 405B was generously provided by Ubitus K.K. and was significantly higher than with the other Shisa V2 models.

Our training expanded to a 32 H100 node (256 GPU) Slurm cluster. For SFT, 30 nodes (240 H100 GPUs) were used for training, with 2 nodes allocated for continuous evaluation of checkpoints. The DPO was trained on all 32 H100 nodes (256 H100 GPUs). Overall training time for the final 405B model was approximately 65,000+ H100 hours for the SFT and 4,000 H100 hours for the DPO over the course of about 2 weeks in April 2025. As a point of comparison, training Shisa V2 70B took only 1200 H100 hours for SFT+DPO.

There are very few teams that have publicly done full-parameter fine-tuning of the Llama 3 405B model (we corresponded with two of the teams that did, AI2 and Nous Research during the course of our development), and there were a number of unique challenges that we faced. We share some of the details in our Overview Report, but plan on going in full-depth in a future technical report.

Try Shisa V2 405B!

Shisa V2 405B is available for download now under the Llama 3.1 Community License at our Hugging Face repository:

shisa-ai/shisa-v2-llama3.1-405b

Of course, running the full FP16 model requires at least 800GB of memory, which is a substantial ask for most. (2xH100 nodes; 1xH100 node; 1xMI300X node, etc). We have created FP8, INT8, and several GGUF versions, but realize that even that is still a challenge for most (I Q3_XS is still pushing 150GB):

However, if you’d like to chat with Shisa V2 405B, we have an FP8 running now thanks to compute generously provided by Jon Durbin and chutes.ai:

chat.shisa.ai

Acknowledgments

@lhl: I’ll take a small break for the “editorial we” to write a brief personal missive. The Shisa V2 research team was basically just two people, who’ve been grinding away for almost 5 months straight to make it happen. The Shisa V2 405B represents a culmination of our efforts, and of course, I’m incredibly proud of what we’ve done, especially considering both the other commitments we’ve had as a growing startup, and some of the out-of-the-ordinary behind-the-scenes challenges we’ve faced.

So, first, a big thank you to Adam Lensenmayer for his incredible contributions to this project. Adam’s linguistic expertise, tireless work ethic, and ability to roll with the punches have been invaluable. If there is ever anyone who undervalues or fails to recognize what you’ve done, you should know that I hold your efforts in the highest esteem, and am profoundly thankful for your hard work.

Ahem.

Although we were a small development team, of course we didn’t do it alone. Besides the administrative support and constant cranking from Jia Shen and the rest of the Shisa.AI team, we should again mention Ubitus for their contributions as compute sponsor of this project.

Shisa V2 405B, of course, benefits from Llama 3 405B’s 30M+ H100 hours of training, so a big thank you to Meta Llama and all teams that provide their base models to the open source community.

We also extend our thanks to all open source AI developers and researchers - without their publicly shared research, tooling, and datasets, none of our work would be possible. We hope that our own contributions will further support the broader community.

A special thanks to Jon Durbin for his work on Shisa V1 and to chutes.ai for providing additional compute for inference hosting and evaluations of the 70B and 405B models.

What’s Next?

Shisa V2 405B represents not just a technical achievement, but a statement: that with the help of the open source community, Japan can compete at the highest levels of AI development.

We will be publishing more technical details in the near future.

Also, look for more open source code and evals once we get a chance to clean some things up.

And, there might be a V2.1 in the near future…

Overview Report

We share a full Overview Report we have authored in both English and Japanese here:

You can also easily view them as images online:

English Version

Japanese Version

Read Entire Article