What started as a small experiment in Malayalam RAG quietly turned into a Kerala Legislative Digital Doc RAG
**"So... I think I'm done?"**
Weird feeling to type that out.
Two months ago, I asked myself a dumb question: "Why can't my mom just *ask* for what she needs instead of flipping through 200 pages?"
Today, I have... an answer? A working answer? I think?
**Let me rewind.**
The thing nobody tells you about building AI projects:
It's not the code that breaks you. It's the *"wait, why doesn't this work in Malayalam?"* moments at 2 AM.
I started simple. Built an English RAG. Felt smart. Then my brain went: "Cool. Now do it in Malayalam."
**The problem I was actually trying to solve:**
My mom is judicial court staff. She spends hours every week hunting through old legal documents for specific sections. Physical books. Dense PDFs. The whole nightmare.
Meanwhile, I'm over here asking ChatGPT to explain quantum physics in haikus.
Felt... wrong.
Then during a college competition, my team needed to cite Kerala legislative documents. We had a 60-page Malayalam PDF. Formal language. Zero structure.
We basically gave up and guessed.
That's when it stopped being a "fun experiment" and became: *this is a real problem that real people have every single day.*
**What I built (in between the breakdowns):**
Niyamadarshini. A RAG chatbot for Kerala's legislative documents.
You ask a question in Malayalam or English. It:
- Finds the relevant bill/act
- Explains it in simple language (Malayalam/English)
- Shows you the exact source with page numbers
No more PDF diving. No more ctrl+F-ing through scanned documents that aren't even searchable.
**The pipeline:**
- Tesseract + PaddleOCR for handling scanned Malayalam documents
- BGE-M3 embeddings that don't butcher the language
- Gemini handling the multilingual conversation layer
- A pipeline that detects if a PDF is text or scan and routes accordingly
- Local storage so you don't lose your chat history
**The parts that almost killed me:**
- Getting chunks that made *sense* in Malayalam (not just random cut-offs mid-sentence)
- Making sure the LLM doesn't hallucinate legal information (kinda important)
- Cross-language queries that don't lose meaning in translation
- Retrieval that pulls the *right* context, not just *similar* words
**Where I'm at now:**
The core works. Like, actually works.
I'm still smoothing edges, fixing weird cases, making the UI less "developer made this" and more "humans can use this."
Once it's properly deployed, I'll drop the link.

I don't know if this will help hundreds of people or just my mom and a few researchers.
But watching her spend hours on something I can now answer in 30 seconds?
Worth every debugging session. Every failed model. Every "why is this still not working" moment.
Sometimes you build things because they're cool.
Sometimes you build them because you're tired of watching someone you care about struggle with something that feels... solvable.
If you've been following along—thank you. For real.
And if you know someone who deals with Kerala legislative documents regularly (lawyers, researchers, students, govt staff), keep an eye out. This might actually help.
More soooon.... ✨