We Built a Self-Hosted AI Assistant You Can Drop Into Any React App
No cloud. No API keys. No per-seat pricing. Just 3 lines of code and a floating bubble that actually understands your app — powered by Ollama running on your machine.

Hi, my name is oussama and i am a self-taught full stack javascript developer with interests in computers. I like the expend my knowledge and learn new things each day cause i always see the beauty in mystery.
Most AI assistant work the same way: you sign up, pay per seat, paste a script tag, and send your users conversations to someone else's server. It works but you're renting someone else's AI and giving up your data in the process.
We wanted something different. An Assistant that lives inside your app, knows your app , and runs entirely on your on your infrastructure. No cloud. No subscriptions. No data leaving your machine.
So we built Lume - an open source npm package that injects a context-aware AI bubble into any React app, powered by Ollama running locally.
What it looks like
Two Steps. That's the integration:
npm install @ovt2/lume
import { AssistantWidget } from '@ovt2/lume'
function App() {
return (
<AssistantWidget
model="gemma3"
systemPrompt="You are a support assistant for Acme."
context={{ currentPage, userPlan }}
knowledgeBase={docs}
/>
)
}
A floating bubble appears in the corner of your app. Click it - a chat panel slides up. The assistant already knows what page the user is on. Ask it a question about your product -- it pulls the answer from your docs . No training , No fine-tuning. No backend.
The Ollama discovery
Before we talk architecture, we need to talk about Ollama -- because building with it was genuinely one of the most fun engineering experience we've had in a while.
Ollama is a runtime that lets you run open-weight LLMs locally. You install it, pull a model, and you have your own AI server running at localhost:11434 . That's it.
brew install ollama
ollama pull gemma3
ollama serve
The moment we hit that endpoint for the first time, something clicked. NO API Key, No rate limits, No cold start waiting on a remote server. No usage bill at the end of the month. Just a model running on the machine in front of us, responsing in real time, with zero data leaving the room.
Running a model locally felt like a superpower we didn't know we had. It changed how we thought about what was possible to build.
The API is as clean as any cloud provider's:
POST http://localhost:11434/api/chat
{
"model": "gemma3",
"stream": true,
"messages": [
{ "role": "system", "content": "..." },
{ "role": "user", "content": "..." }
]
}
With stream: true, tokens arrive as newline-delimited JSON chunks. Real-time streaming output, no infrastructure complexity. The same UX as ChatGPT, running entirely offline.
How the assistant understands your app
The assistant doesn't get trained on your app. It gets told about your app -- at query time, fresh on every request.
The system prompt is assembled from three layers before each call:
Layer 1 -- Your persona and base instructions What you pass in systemPrompt . Sets the assistant's identity, tone and scope.
Layer 2 -- Live context Whatever you pass in the context prop. Page name, user plan, feature flags, error state -- anything. The assistant reads this before every response.
<AssistantWidget
context={{
currentPage: 'Billing',
userPlan: 'Pro',
lastError: null,
}}
/>
Layer 3 -- RAG from your knowledge base You pass your docs as plain text chunks. Before every query, Lume scores all chunks agains the user's question by keyword frequency and injects the most relevant ones into the prompt.
const docs = [
{
title: 'Billing and invoices',
content: 'Your subscription renews automatically on the billing date...',
},
{
title: 'Cancellation policy',
content: 'You can cancel anytime from Settings → Billing...',
},
]
<AssistantWidget knowledgeBase={docs} />
No embeddings. No vector database. No Python service running in the background. Just keyboard scoring in pure JavaScript, fast enough to run synchronously before each request.
Pushing live context
The context prop covers static state -- but apps are dynamic. A user hits an error, a payment fails, a background job finishes. You want the assistant to know.
That's what pushContext is for
const assistantRef = useRef<AssistantHandle>(null)
// Anywhere in your app — error boundary, event handler, anywhere
assistantRef.current?.pushContext({
type: 'error',
message: 'Payment failed: card declined',
code: 'card_declined',
})
<AssistantWidget ref={assistantRef} ... />
The next time the user opens the panel and types anything — even just "help" — the assistant already knows what happened. You push context silently and let the user reach out at their own pace.
The RAG implementation
Here's the full scoring function — no magic, no dependencies:
function scoreChunk(chunk, query) {
const terms = query
.toLowerCase()
.split(/\W+/)
.filter(t => t.length > 3);
const text = chunk.toLowerCase();
return terms.reduce((score, term) => {
const matches = (text.match(new RegExp(term, 'g')) || []).length;
return score + matches;
}, 0);
}
export function retrieveChunks(query, knowledgeBase, topN = 4) {
return knowledgeBase
.map(chunk => ({ ...chunk, score: scoreChunk(chunk.content, query) }))
.sort((a, b) => b.score - a.score)
.filter(c => c.score > 0)
.slice(0, topN)
.map(c => `## \({c.title}\n\){c.content}`)
.join('\n\n');
}
The retrieved chunks get appended to the system prompt before every request. The model sees them as part of its instructions — not conversation history.
Is it as precise as cosine similarity over dense embeddings? No. Is it fast, zero-dependency, and accurate enough for docs a developer manually curated? Absolutely.
What we learned
The system prompt is the product
The quality of the assistant experience has almost nothing to do with which model you use and almost everything to do with how well you construct the context. A mediocre model with a well-loaded prompt beats a great model with a vague one every time. Invest in your context layers.
Streaming is the baseline, not a feature
Users have been trained by ChatGPT. A chat widget that shows a spinner for 8 seconds before dumping a full response feels broken, even if the answer is good. First token in under a second is the expectation now. Implement streaming from day one.
Local AI removes the "should I even build this" question
With a cloud AI, there's always a calculation: is this feature worth the per-token cost? Is this data safe to send to a third party? Does this use case justify the latency?
With Ollama running locally, all of those questions disappear. We prototyped features we'd never have tried with a cloud API because there was no cost and no data risk. That creative freedom is genuinely underrated — and it's the core reason Lume is built the way it is.
Try it
npm install @ovt2/lume
import { AssistantWidget } from '@ovt2/lume'
<AssistantWidget
model="gemma3"
systemPrompt="You are a helpful assistant for this app."
/>
You need Ollama running locally (ollama serve) and a model pulled. Everything else is in the package.
📦 npm → npmjs.com/package/@ovt2/lume
⭐ GitHub → github.com/ovt2/lume
v0.1.5 — a lot more coming. Feedback welcome.



