Building & fine-Tuning a SLM (Small Language Model) for Chotanagpuri Language with Government Schemes information as use case
The advent of artificial intelligence (AI) and machine learning (ML) has opened new avenues for delivering personalized information to people across different regions and languages. One of the most impactful uses of AI in the Indian context lies in its ability to disseminate information on government schemes to underrepresented communities, often facing barriers such as language and literacy. In states like Jharkhand, where Chotanagpuri (a dialect of Sadri) is widely spoken, fine-tuning a model to deliver information about government schemes in this local language can empower rural populations, particularly in accessing benefits they are often unaware of.
This article explores the process of fine-tuning a small language model for government schemes in Chotanagpuri and the sources of data that can be used to ensure the model's accuracy and relevance. By tailoring models to the needs of marginalized communities, this approach aims to enhance their awareness and access to social welfare programs, including subsidies, loans, healthcare, and education.
The Importance of Language in Government Scheme Awareness
Despite significant efforts by the Indian government to provide welfare schemes aimed at improving the lives of rural citizens, a large portion of the population remains unaware of their eligibility and entitlements. This information gap is often exacerbated by the use of official languages such as Hindi and English, which many tribal communities, including Chotanagpuri speakers, may not fully comprehend. By providing critical information in local dialects, such as Chotanagpuri, AI models can bridge this divide and ensure that people are informed about their rights and opportunities.
Understanding the Chotanagpuri Language and Its Context
Chotanagpuri, a dialect of Sadri, is widely spoken among the tribal communities in the Chotanagpur plateau, including the districts of Ranchi, Khunti, Gumla, and Simdega in Jharkhand. The language is an amalgamation of several regional influences, including Mundari, Kurukh, and Nagpuri. It is primarily spoken in informal settings and lacks extensive written documentation, making it challenging to build a robust corpus for AI model training. However, local folk traditions, oral histories, and cultural materials offer rich data sources that can be used to fine-tune language models for this dialect.
Why Fine-Tune a Model for Chotanagpuri?
Large language models like ChatGPT and LLaMA are powerful tools, but they are designed for general language tasks and often lack the specificity needed for hyper-localized contexts. Fine-tuning allows these models to adapt to specific domains and languages, such as Chotanagpuri, to deliver more accurate and relevant outputs.
For example, a fine-tuned model for Chotanagpuri could provide personalized information about government schemes such as:
1. Pradhan Mantri Kisan Samman Nidhi (PM-KISAN) – Financial support to small and marginal farmers.
2. Ayushman Bharat Yojana – Access to healthcare benefits.
3. Pradhan Mantri Awas Yojana (PMAY) – Housing schemes for the economically weaker sections.
4. MNREGA – Employment guarantee schemes for rural populations.
5. Jharkhand Mukhyamantri Maiyan Samman Yojna (JMMSY) - Under the scheme, ₹ 1,000 will be given to eligible women between 21 and 50 years of age from families living below the poverty line.
By making this information available in Chotanagpuri, the model can address critical gaps in awareness and ensure that beneficiaries are informed in a language they understand.
Fine-Tuning a Model for Chotanagpuri
This process will start with data collection, followed by preprocessing, model selection, model training and finally evaluation of the final fine-tuned model before it can be deployed across the state. Here are the details of each of the steps:-
Data Collection
Data collection is perhaps the most critical step in fine-tuning a model. For a language like Chotanagpuri, which is underrepresented in digital spaces, the process can be challenging. However, the following sources can provide useful data for training the model:
- Government Portals in Hindi: Many government portals such as those for PM-KISAN, MNREGA, and the PDS system provide information in Hindi, which can be translated and adapted to Chotanagpuri. The fine-tuned model can leverage Hindi content and make it accessible in the local language.
- Local Newspapers and Media: Regional newspapers, radio broadcasts, and community programs that provide information on government schemes in colloquial Hindi or Sadri can be valuable. Some local newspapers and magazines already cater to Chotanagpuri speakers, and their archives can be a useful data source.
- Cultural Documentation: The rich oral traditions and folk literature of the Chotanagpuri-speaking community can provide insights into common language structures and phrases. Folktales, songs, and oral histories recorded by linguists or local organizations can help build a dataset that captures the essence of the dialect.
Recommended by LinkedIn
- Community-Based Organizations (CBOs): Many NGOs and community organizations work directly with Chotanagpuri-speaking populations to promote awareness of government schemes. These organizations often publish materials in local dialects or conduct outreach programs that can serve as training data for the model.
- Crowdsourcing: Engaging local speakers through crowdsourcing platforms to contribute Chotanagpuri phrases, explanations of government schemes, and cultural insights can significantly improve the quality and relevance of the dataset.
Data Preprocessing
Once the data has been collected, it needs to be preprocessed. This involves cleaning the text by removing noise, such as irrelevant content or language code-switching (e.g., mixing Hindi and English). Tokenization, which breaks down the text into smaller units like words or subwords, is critical to handling the complex morphology and syntactic structure of Chotanagpuri. Preprocessing should also include normalizing variations in spelling and grammar, given that Chotanagpuri lacks standardized orthography.
Model Selection and Training
For the actual fine-tuning process, a pre-trained model like ChatGPT or LLaMA can be used as a starting point. These models have already been trained on large, general datasets in multiple languages, so they have a foundational understanding of grammar, syntax, and context. However, the key to an expected and feasible prototype/model in Chotanagpuri will be the overall architecture for this solution.
- Transfer Learning: Transfer learning allows the model to retain its general language understanding while focusing on the specific Chotanagpuri dataset. By using the Chotanagpuri dataset for additional training, the model learns the nuances of the language and can generate more accurate, context-aware responses.
- Training for Domain-Specific Tasks: In addition to fine-tuning the model for the Chotanagpuri language, it is essential to train the model for domain-specific tasks such as answering questions about government schemes. This involves providing examples of common queries (e.g., "How do I apply for the PMAY scheme?" in Chotanagpuri) and their correct responses.
Evaluation and Testing
After training the model, it must be evaluated to ensure accuracy and contextual relevance. Evaluation can be done by running a series of tests where users provide real-life queries in Chotanagpuri and compare the model's responses to the expected answers. Metrics such as BLEU scores or perplexity can measure the model's linguistic accuracy, but real-world testing is critical to assess its utility in providing information on government schemes.
Deployment
Once fine-tuning and the final solution is developed & evaluated, the model can be deployed in various ways:
- Chatbots: Chatbots can be integrated into mobile apps or websites, allowing users to interact with the model in Chotanagpuri to get real-time answers about government schemes. We can either use solutions like Whatsapp to integrate or build our own standalone mobile applications.
- Interactive Voice Response (IVR) Systems: For areas with low literacy rates, an IVR system that allows users to ask questions in Chotanagpuri can be a valuable tool for disseminating information.
- Mobile Applications: A mobile app that provides alerts, reminders, and updates on government schemes in Chotanagpuri can ensure timely and actionable information reaches rural populations.
Challenges and Solutions
While fine-tuning models for Chotanagpuri has great potential, there are several challenges to consider:
1. Lack of Digitized Data: Given that Chotanagpuri is primarily an oral language, finding enough written data for training purposes can be difficult. Solutions include crowdsourcing and leveraging existing content in Hindi or Sadri, which can be adapted to Chotanagpuri.
2. Limited Linguistic Resources: Unlike Hindi or English, there are few linguistic resources available for Chotanagpuri. Collaborating with local universities, linguists, and community organizations can help fill this gap by contributing language data and insights.
3. Regional Variations: Chotanagpuri, like many dialects, varies across regions. The model should be adaptable and flexible to account for these variations.
Fine-tuning a language model for Chotanagpuri to disseminate information about government schemes can have a profound impact on rural communities in Jharkhand. By providing localized, accessible information, AI can help bridge the knowledge gap, empowering people to make informed decisions about their entitlements and benefits. While there are challenges in data collection and language variation, a concerted effort from linguists, technologists, and the local community can create a valuable tool for the people of the Chotanagpur plateau.