The Equitable AI Imperative: Why Collaboration is Key to Avoiding Bias in Nepal's GenAI Model

Nepal stands at a critical juncture in the global Artificial Intelligence revolution. With the establishment of the National AI Centre and a focus on developing a localized Generative AI (GenAI) model, the nation has the opportunity to harness technology for development and good governance. However, if this model is trained on unrepresentative data, it risks becoming a digital engine for reinforcing, rather than resolving, the country’s deep-seated socio-economic, gender, and linguistic biases

The solution is not just technical; it is fundamentally social and political, demanding comprehensive collaboration between the National AI Centre and the private sector for ethical data collection and curation.

 

The Challenge of 124 Languages: Linguistic Bias

 

Nepal is a land of rich linguistic diversity, boasting 124 different mother tongues. Yet, digital data is overwhelmingly dominated by the Nepali standard language, effectively sidelining over 55% of the population who speak languages like Maithili, Bhojpuri, Tharu, Tamang, and Nepal Bhasa (Newari).

The Risk: An AI model trained on this skewed dataset will inevitably perform poorly for regional and indigenous dialect speakers. This is more than a technical glitch; it is a serious risk of cultural assimilation and deepening the digital divide for rural and indigenous communities.

 

The Solution: The National AI Centre must leverage private sector partners (media houses, telecom providers, local software firms) to actively crowdsource, annotate, and validate diverse linguistic datasets. This ensures the GenAI model can serve all of Nepal. 

 

The Gender Gap: Stereotypes in the Training Data

 

Gender disparities in literacy and digital access directly translate into an unrepresentative training dataset. With a significant gap between male literacy (83.6%) and female literacy (69.4%), the digital content produced by and about women is intrinsically smaller.

The Risk: AI models learn from historical context, and when that context is patriarchal, the AI reinforces it. This can manifest as the model associating male pronouns with high-status, technical, or leadership roles, and female pronouns with traditional, domestic roles. Such sexist outputs actively perpetuate gender inequality in a new technological layer.

The Solution: Targeted data curation is necessary, including active efforts to source content from women’s digital initiatives and gender-focused organizations. Furthermore, engaging sociologists and anthropologists to audit the training data is crucial for flagging and mitigating implicit patriarchal stereotypes before the model is deployed.  

The Urban-Caste Divide: Socio-Economic Inequity

Nepal’s complex ethnic and caste mosaic, comprising 142 distinct groups, introduces deep socio-economic bias into digital representation. Data is often skewed towards dominant groups like the Khas-Arya (39.4% of the population), while marginalized groups like Dalits (8.12%) are severely underrepresented in the digital sphere. Internet access in Kathmandu stands at a high 79.3%, contrasting sharply with the low 17.4% in rural areas.

The Risk: AI models will, therefore, be highly effective for urban, privileged users but deliver ineffective, culturally insensitive, or even discriminatory advice unsuitable for rural, low-resource, or marginalized caste contexts. This data underscores the need for caste auditing in GenAI development to prevent the model from delivering outputs that are ineffective or discriminatory toward marginalized communities in lower-resource, rural, or non-dominant caste contexts.

The Solution: The strategy must center on inclusivity. Private sector and NGO collaboration is essential to collect ethically sourced, high-quality data from rural communities. Diverse community representatives must be integrated into the auditing process to check for outputs that could perpetuate caste discrimination.

The Foundational Role of Collaboration and Auditing

 

The path to trustworthy and equitable AI in Nepal hinges on transforming the National AI Centre into a nexus of multi-stakeholder collaboration. The statistics provided in the analysis are primarily drawn from the Nepal Population and Housing Census 2021. To further emphasize the scale of the challenges for the GenAI model, here is a detailed breakdown of the linguistic and socio-economic groups based on the 2021 Census data.

Linguistic Diversity: The Scale of Non-Nepali Speakers

The analysis correctly states that Nepali is the dominant digital language, potentially sidelining non-Nepali speakers. The 2021 Census reveals the concrete numbers for the top mother tongues in Nepal (total population 29,164,578), which highlights the magnitude of the linguistic data gap Nepali 44.86 %, Maithali 11.05%, Bhojpuri 6.24%, Tharu 5.88%, Tamang 4.88%, Bajjaka 3.89%, Avadhi 2.96% and Nepal Bhasa (Newari)  2.96%. This data confirms that the five non-Nepali languages you mentioned (Maithili, Bhojpuri, Tharu, Tamang, and Nepal Bhasa) are spoken as a mother tongue by over 9 million people (approximately 31% of the population), all of whom are at high risk of being underserved by an AI model trained primarily on Nepali data.

 

Socio-Economic and Caste Skewness

The data confirms the stark contrasts in representation among Nepal's 142 distinct caste and ethnic groups:

• Dominant Groups (Khas-Arya): The Khas-Arya group, which includes Chhetri (16.45%) and Brahman-Hill (11.29%), represents the single largest bloc in the country. Data generated by and about these groups is likely over-represented in the digital sphere, leading to the risk of AI models optimizing for their specific contexts.

• Marginalized Groups (Dalit and Janajati):

• The Dalit caste group constitutes 8.12% of the population.

• Major Janajati (Indigenous Nationalities) groups like Magar (6.9%), Tharu (6.2%), and Tamang (5.62%) are also significant but remain vulnerable to digital underrepresentation, despite constituting large proportions of the populace.

The need for Caste Auditing is underscored by the fact that the most dominant and the most marginalized groups coexist within this digital divide.

Gender Gap: Literacy Statistics

The numbers cited are accurate and are based on the population aged 5 years and above in the 2021 Census:

Overall Literacy Rate: 76.2%, Male Literacy Rate: 83.6%, Female Literacy Rate: 69.4%

The 14.2 percentage point gap (83.6% - 69.4%) in literacy rate directly impacts the volume and nature of digital content created by women, reinforcing the risk that the GenAI model will learn and perpetuate historical gender stereotypes.

These statistics collectively highlight that the challenge for the National AI Centre is not abstract; it involves ensuring that the technology serves the 2.7 million Tharu speakers, the 14.9 million women, and the 8.12% of the population identifying as Dalit, who are all at risk of being digitally sidelined.

The model’s success is dependent not just on technical expertise, but on social responsibility. By forging strong partnerships with the private sector for data collection and actively engaging sociologists, ethicists, and community leaders for rigorous auditing, Nepal can develop a localized GenAI model that truly reflects the richness and diversity of its people, ultimately bridging the digital divide rather than deepening it.