Grab, the Singapore-based superapp company, recently disclosed its decision to develop an in-house artificial intelligence (AI) model for internal use. This specialized, lightweight vision large language model (LLM) is designed to efficiently scan documents and extract vital information. The company emphasized that this move was necessitated by the significant shortcomings of both proprietary and open-source AI models in understanding Southeast Asian languages. This revelation brings to light persistent concerns regarding the accessibility and reliability of advanced AI models from tech giants like Google, OpenAI, and Anthropic.
The Lingering Challenge of AI in Non-English Languages
In a detailed blog post outlining the architecture and training of their custom vision model, Grab shared the challenges encountered when attempting to leverage external AI solutions. They noted that “While powerful proprietary Large Language Models (LLMs) were an option, they often fell short in understanding SEA languages, produced errors, hallucinations, and had high latency. On the other hand, open-sourced Vision LLMs were more efficient but not accurate enough for production.”
The difficulty AI models face with non-English languages is a long-standing issue that researchers have consistently highlighted, and AI developers have been striving to address. Despite achieving basic proficiency in widely spoken foreign languages such as Hindi, Japanese, Spanish, and Chinese, these models frequently struggle to grasp the subtle nuances of their lexicons. This limitation makes them suitable for general conversations but severely restricts their applicability for specialized enterprise or research-based requirements.
For instance, a recent academic paper revealed that even AI models developed by Chinese firms perform poorly when processing Chinese minority languages, a problem that mirrors the struggles observed in Western models. This issue persists across both proprietary offerings from Google, OpenAI, Meta, and Anthropic, as well as various open-source alternatives.
The primary reason for this struggle is the scarcity of readily available and sufficiently large datasets to train the model on these languages. This critical data gap has prompted major AI companies to collaborate with institutions in countries like India to gather more Indic language datasets. Google, for example, partnered with IIT Bombay in July to develop Indic language AI speech models. Similarly, Meta is reportedly investing in contractors, paying $55 an hour, to train its models in Hindi. OpenAI has also announced a $500,000 research collaboration with IIT Madras, aiming to enhance its models’ understanding of diverse languages.
While the collection of such extensive datasets is a costly endeavor, it is achievable for prominent Asian and other major languages. However, minority languages, particularly those not officially scheduled in countries like India, will likely continue to pose a significant challenge for these models to master. Without adequate linguistic competence in these diverse languages, the accessibility and practical functionality of AI will remain fundamentally limited for a large segment of the global population.