A Journey Through Failed Experiments (and What Finally Appears to Be Working)
When I see many friends working day and night to build an AI chatbot for ERPNext, I feel a strange mix of admiration and quiet sadness. Admiration — because they are genuinely smart, driven, and technically strong. Sadness — because I recognize the road they are walking on. I’ve walked it already. Slowly. Painfully. Often alone.
This work wasn’t a side project or a weekend experiment. It was done for real clients, with real data, real expectations, and very little tolerance for hallucinations. And after many months of failed experiments, broken assumptions, and uncomfortable truths, I feel an urge to speak — not to discourage anyone, but to save them from losing a year learning the same lessons the hard way.
This is not a success story. It’s a log of broken assumptions, failed architectures, and a faint sense that something — at last — might be working.
During an ERP implementation for a large group, the idea of building an AI agent first surfaced in a conversation with the owner. He considered himself tech-savvy, and to be fair, he was curious, engaged, and eager to sound ahead of the curve. Technical jargon came easily to him, often delivered with confidence, almost as a badge of credibility.
“We already have the data,” he said more than once. “Why can’t we just build an AI agent on top of it? Something we can talk to like a human. Ask questions. Get answers. Directly from the system.”
There was a certainty in the way he spoke, as if the problem had already been solved somewhere else and we were simply late to adopt it. Everything was becoming AI, he insisted. We couldn’t afford to fall behind. And in between these declarations, more words would appear — machine learning, transformers, models — floating in the air, disconnected from any real understanding of what they meant in the context of an ERP system, or what it actually takes to make one behave reliably in front of business users.
At the time, I listened quietly. I had heard this tone before — optimistic, impatient, convinced that intelligence emerges naturally the moment data exists. I didn’t argue much. I didn’t know then how long that idea would stay with me, or how many months I would spend trying to make it real, only to slowly learn how wrong the assumptions were.
Soon after, the contract was signed. And almost immediately, without ceremony or planning, I found myself doing what many before me had done — asking AI chatbots how to build an AI chatbot.
To my surprise, ChatGPT responded with complete code for building a chatbot using the OpenAI API, followed by a confident five-hundred-word explanation. A simple YouTube search returned thousands of videos — mostly smart men, a few women — patiently explaining how to build AI agents, many of them with hundreds of thousands of views already. I added myself to that number, watching, pausing, rewinding, quietly wondering how many of those viewers had actually built an agent that worked in the real world, and whether the world was already overflowing with AI chatbots long before I had even written my first serious line of code.
Approach 1: OpenAI API (and the first crack in the illusion)

If you have watched any of those million-view videos, you already know how effortless it looks — asking a question to OpenAI using a few lines of Python and receiving an answer from what appears to be an all-knowing machine. I did exactly that. On top of it, I wired up an ugly but functional chat box directly inside an ERPNext screen. It wasn’t elegant, it wasn’t proud work, but it worked. Questions went out through the API, answers came back. For a brief moment, it felt like progress. Like I had crossed some invisible line.
But that moment didn’t last long — at least not until I stopped and asked myself what I was actually achieving here.
What I did was simple. I asked a question like, “How much tomato did we sell last month?” The chatbot sent an API call to OpenAI to generate ORM/SQL (sorry — this is where the jargon begins), and I executed that SQL locally to get a result. Cool, right?
What was actually happening underneath was far less impressive.
First, to generate correct SQL, OpenAI needed to know the exact table names, field names, and joins. It guessed correctly maybe once in a thousand attempts. The remaining 999 times, it confidently produced something that looked valid, sounded intelligent, and was completely wrong.
(By the way, hallucination is the word all those million-view videos taught me to use for this phenomenon.)
Second, we had custom apps and custom fields. The table names were wrong. The field names were wrong. The joins were imaginary. Needless to say, the generated queries had no relationship with the actual ERPNext database.
Third — and this was the hardest truth — without knowing master data, accuracy was fundamentally impossible. What does tomato mean in this system? Is it an item name? A variant? A translated label? An item group? What is its actual item code? Without that context, there was simply no path beyond 5% accuracy.
Sending the entire master data was not an option. This client alone had more than 10,000 item records. Even sending a partial dataset was rejected outright. The business owner didn’t trust AI companies. He was convinced — perhaps not entirely wrongly — that his data would either leak, be reused, or quietly end up training someone else’s model.
I remember trying to explain safeguards and policies, but he stopped me halfway.
“I told you,” he said calmly, “I read enough about this stuff.”
Yes. The guy knew too much.
Another, quieter reason for failure revealed itself over time: context decay. Each question lived in isolation. The AI had no durable memory of business rules, fiscal calendars, inventory logic, regional nuances, or the thousand small assumptions that live only in people’s heads. Every prompt was a fresh amnesia. What felt like intelligence was really just polite guessing, wrapped in confident language. And in a business system like ERPNext — where meaning is layered, historical, and painfully specific — guessing is not just wrong, it’s dangerous.
This approach failed — completely and irreversibly. There was no room for iteration, no clever prompt that could save it, and no architectural tweak that could push it into something reliable. The foundation itself was wrong. At that point, it became clear that continuing down this path would only produce more demos, more illusions, and more disappointment.
We had to abandon it entirely and look for a fundamentally different approach.
Before jumping into Approach #2, I should admit something honestly. This method did give us some results — enough to keep the illusion alive for a while. We could translate item names. We could ask the system to add descriptions to project tasks. We could even generate compatibility matrices for vehicle spare parts. These were desired outcomes, the kind that look impressive in demos and screenshots.
But they worked precisely because they didn’t touch the dangerous core of ERPNext — the data, the joins, the business logic, the things that actually matter. These were side quests. Helpful, yes. But not the mission we were hired for.
Approach 2: MCP-Based Integrations (where accruacy was promised, but never delivered)

Once Approach #1 failed, you would safely assume I did what everyone else does — I went back to the same YouTubers, the same AI chatbots, the same overconfident tutorials, looking for a newer, smarter, more “enterprise-ready” solution, one that promised far better accuracy this time.
They all pointed me in the same direction: Claude, n8n, and something proudly called MCP.
(Not that MCP — the kind the world already has an embarrassing surplus of. This MCP was supposedly different.)
According to them, MCP was the missing piece. A magical bridge. A structured way to let the AI “understand” your system without really exposing it. Agent talks to tool, tool talks to system, system whispers truth back to the agent. Clean. Elegant. Very convincing — especially when explained with diagrams and calm voices over dark-themed terminals.
So naturally, I believed them.
So, if you haven’t got it yet: n8n (and Claude, depending on how people wire it) are workflow-automation / orchestration platforms — or whatever label makes you feel less guilty about the number of arrows in your diagram. The idea is simple on paper. Your ERPNext “chat window” doesn’t talk to the database directly. Instead, it sends the question to n8n (or to a Claude-driven workflow), and that workflow send an SQL back to your ERP server through a connector — MCP — and then returns an answer. And yes, ERPNext already have an MCP.
The idea was rejected almost at the beginning itself. No serious business was willing to open its database to an external AI tool, regardless of how many security guarantees were promised on paper. There was a deep, almost instinctive skepticism toward AI companies — they steal data, they train on it, they reuse it, they leak it someday. Every client I spoke to echoed the same fear, sometimes softly, sometimes bluntly, but always firmly.
And honestly, I couldn’t blame them. These were not paranoid people. These were owners who had spent decades building their businesses, guarding their customer lists, pricing logic, and operational data like family heirlooms. Asking them to “just trust” an AI platform felt naïve in hindsight.
Still, I tried.
But the accuracy never really improved. There were methods to export master tables into the platform, but again, no one was willing to do that. And even if they had agreed, keeping the data in sync — frequent updates, constant re-exports, fine-tuning every few minutes — was simply not practical.
Custom apps and custom fields barely worked. Hallucination piled on top of hallucination, until it became clear that the system was guessing more than it was understanding.
Yes, we managed to get a few things working. Creating a basic report. Changing a logo on the screen. Small, cosmetic wins that looked impressive in demos. But when it came to real usage, there were no takers.
Approach #3: Agent-to-Agent Architectures (The Most Elegant Failure)

This one looked beautiful on paper. Almost poetic. Instead of trusting a single AI to magically know everything, we decided to let agents talk to each other. One agent lived outside — inside n8n — handling the conversation and intent. Another agent lived inside ERPNext — closer to the database, the schema, the ugly realities.
The flow sounded intelligent. A user asks a question like, “Who hasn’t paid last month?” The n8n agent tries to form an SQL query, then pauses and asks the ERPNext agent: “Which table holds payments? Which field represents outstanding?” The internal agent responds with corrections, table names, joins. The SQL gets refined. In theory, accuracy should improve.
In reality, what we built was a slow, fragile conversation between two systems that barely trusted each other. Latency grew. Context got lost. Small misunderstandings multiplied. One agent hallucinated confidently, the other tried to be helpful but lacked full context. Instead of one wrong answer, we now had two agents politely agreeing on the wrong one.
But again, reality intervened. Businesses were uncomfortable with even an external agent talking to an internal one. Different name, same fear. External access is still external access. The mistrust didn’t change just because the architecture was clever. And with that, this approach — perhaps the most intellectually satisfying of them all — met the same quiet end as the others.
Elegant. Logical. Unusable.
Approach #4: A Locally Orchestrated NLP Pipeline (Where We Almost Made It)

This was the first time things started to feel… serious. No external AI vendors. No black-box APIs. Just open-source models, fine-tuned on Hugging Face and run on Replicate. We have explained this method in another blog ( https://cloud.erpgulf.com/blog/blogs/changai-ai-for-erpnext )
The first stage used RoBERTa, trained to answer only one question: what is the user actually talking about? Not SQL. Not reports. Just intent — Sales Invoice, Purchase Order, Stock Entry, Project, Task. It worked surprisingly well. Once the Doctype was identified, sBERT stepped in, quietly narrowing down which fields mattered in that context. Not everything — just the few that might matter. Then FLAN-T5 took over, doing the delicate work of extracting exact fields and shaping them into something that could eventually become a real Frappe query. No guessing joins across the universe. No hallucinated columns. Just incremental narrowing, each stage reducing the uncertainty of the previous one. Finally, the query was executed locally inside ERPNext, where truth still lives — in tables, rows, and actual data
It was good. Accuracy improved significantly, and for the first time, everyone felt genuinely hopeful that this approach might work. But we couldn’t push accuracy beyond 40%. And for serious businesses, that number is still far too low.
Approach #5: Retrieval Before Reasoning (Where Accuracy Finally Began to Bend)

This was the first time we stopped asking the model to imagine the system, and instead forced it to remember it. We broke the database schema into small, opinionated “cards” — field cards, table and join cards, master-entity cards — each carrying just enough meaning: names, synonyms, descriptions, roles. These weren’t prompts. They were facts. We embedded them, stored them in a local FAISS vector store, and retrieved only what mattered for a given question. No full schema dumps. No blind guessing. Just selective recall.
When a user asked a question, the model no longer stared into the void. It first retrieved relevant tables, fields, joins, and master entities, then injected only those into the prompt. Reasoning came after grounding. SQL generation came after context. For the first time, hallucinations dropped noticeably. Accuracy crossed 70%. Not perfect. Not safe enough yet. But enough to feel the direction shift — from forcing intelligence, to constraining it.
For the first time in this journey, failure stopped feeling inevitable.
After all these failures, false starts, and architectures that looked beautiful on whiteboards but collapsed under real business pressure, we finally found ourselves standing on something solid. Not perfect. Not magical. But acceptable — to the only people who actually matter in this story: serious ERP users. The accuracy crossed 90%, not by adding more “intelligence,” but by removing bravado. By keeping everything local. By respecting the fact that business data is not training fuel, and trust is not something you negotiate with a privacy policy PDF.
What will emerge will not be a flashy chatbot, but a quiet one. One that will answer only when it knows. One that will rely on real schema, real masters, real constraints. No external access. No data leaving the system. No hallucinated confidence. And for the first time in this long journey, users will stop double-checking every answer. They will pause. They will nod. They will trust it. And that — more than any benchmark or demo — will feel like the real destination all along.
After sharing this article, many readers mentioned that it lacks deep technical details—especially around the final approach. That feedback is fair. This piece was never meant to be a technical deep dive; it was meant to document the journey, the failures, and the reasoning behind them. The detailed technical breakdown of the final approach is currently being prepared by my colleagues who are working hands-on with the project. A full technical blog and an explainer video will be published soon as part of the beta release. Please follow the ERPGulf LinkedIn ( https://www.linkedin.com/company/erpgulf/ ) page for updates.
While this narrative is written in the first person, the “I” in the final approach is actually a “We.” This journey would not have reached its destination without the brilliant team beside me: Hyrin, our Product Manager, who kept the vision steady; Rishikesh, who contributed with relentless curiosity; Amandeep and Ayisha, the engineers who patiently turned failed experiments into working code; and Raifa, our Data Builder, who meticulously laid the foundation our model finally learned to trust.

