LLMs are lousy at reading Asian languages, finds Singapore's Grab

19 hours ago 1

Proprietary large language models are bad at interpreting Asian languages, according to Singaporean super-app company Grab, which has built its own model instead.

Grab’s superapp offers ride-sharing, food delivery, shopping, and even some financial services. The company is so prominent and dominant in some Asian countries that Uber sold itself to Grab and took a stake in the Singaporean company rather than compete directly.

Today, Grab is a major player in Singapore, Malaysia, Indonesia, the Philippines, Vietnam, Thailand, Cambodia, and Myanmar, all of which use scripts that employ alphabets other than the Latin script used by English.

In a Tuesday post on its Engineering blog, four Grab staffers explained that the company needs to accurately extract information from ID cards, driver’s licenses, and registration certificates for compliance chores like know-your-customer checks. Grab tried Optical Character Recognition (OCR) systems, but its chosen tech “struggled with the variety of document templates it had to process.”

It's 2025, so the org investigated whether large language models could solve its problem.

“While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding [South East Asian] SEA languages, produced errors, hallucinations, and had high latency,” the post reveals. “On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production.”

The company decided building its own Vision LLM – a model that vectorizes images so a large language model can extract text – was its best option.

“We evaluated a range of LLMs capable of performing OCR and Key Information Extraction (KIE),” the post states, and chose Alibaba Cloud’s Qwen2-VL 2B for reasons including:

  • Efficient size: It is small enough for full fine-tuning on GPUs with limited VRAM resources.
  • SEA language support: Its tokenizer is efficient for languages like Thai and Vietnamese, indicating decent native vocabulary coverage.
  • Dynamic resolution: Unlike models that require fixed-size image inputs, Qwen2-VL can process images in their native resolution. This is crucial for OCR tasks as it prevents the distortion of text characters that can happen when images are resized or cropped.

To build its model, Grab extracted SEA language content from the Common Crawl, an open collection of data scraped from the web, then built what the authors describe as “an in-house synthetic data pipeline to generate text images by rendering SEA text contents in various fonts, backgrounds and augmentations.”

The team next tried to fine-tune a Vision LLM using Qwen2VL and Low-Rank Adaptation (LoRA), a technique they found “efficient because it allows lightweight updates to the model’s parameters, minimizing the need for extensive computational resources.”

“We trained the model on our curated document data, which included various document templates in multiple languages. The performance was promising for documents with Latin scripts. Our experiment of LoRA fine-tuned Qwen2VL-2B achieved high field-level of accuracy for Indonesian documents.”

Thai and Vietnamese remained hard to recognize, as did documents with unstructured layouts and small, dense text.

Further experiments showed that existing vision LLMs “lack visual text in SEA languages during vision encoder and joint training.” Grab’s team therefore decided to perform full-parameter fine-tuning of its model.

“We first trained the vision components of the model using synthetic OCR datasets that we created for Bahasa Indonesia, Thai, Vietnamese, and English. This helps the model to learn the unique visual patterns of SEA scripts,” the team wrote. Next came full-parameter fine-tuning to refine all components of the model with task-specific document data.

Grab rated the resulting model a success but admitted the fine-tuning process “pushed the limits of GPUs.”

“To optimize resources used and to create a model perfectly tailored to our needs, we decided to build a lightweight Vision LLM (~1B parameters) from scratch.”

Grab’s post explains the process it used to create its model, and the results – performance better than OCR tools, Qwen2, ChatGPT, and Google’s Gemini.

The company concluded that “strategic training with high-quality data enables smaller, specialized models to achieve remarkable efficiency and effectiveness.”

Grab now plans more of its own models.

“We’re developing Chain of Thought-based OCR and Key Information Extraction (KIE) models to strengthen generalisation capabilities and tackle even more diverse document scenarios,” the post states, and will also extend its advanced document processing tech “to Myanmar, Cambodia, and beyond.”

Grab’s experience aligns with predictions this Vulture often hears about the future of AI in the enterprise, namely that many organizations will develop their own models to handle specialized tasks that general-purpose models weren’t built to address. ®

Read Entire Article