How We Built a Local AI Assistant for React Apps with Ollama

Most AI assistant work the same way: you sign up, pay per seat, paste a script tag, and send your users conversations to someone else's server. It works but you're renting someone else's AI and giving up your data in the process.

We wanted something different. An Assistant that lives inside your app, knows your app , and runs entirely on your on your infrastructure. No cloud. No subscriptions. No data leaving your machine.

So we built Lume - an open source npm package that injects a context-aware AI bubble into any React app, powered by Ollama running locally.

What it looks like

Two Steps. That's the integration:

npm install @ovt2/lume

import { AssistantWidget } from '@ovt2/lume'

function App() {
  return (
    <AssistantWidget
      model="gemma3"
      systemPrompt="You are a support assistant for Acme."
      context={{ currentPage, userPlan }}
      knowledgeBase={docs}
    />
  )
}

A floating bubble appears in the corner of your app. Click it - a chat panel slides up. The assistant already knows what page the user is on. Ask it a question about your product -- it pulls the answer from your docs . No training , No fine-tuning. No backend.

The Ollama discovery

Before we talk architecture, we need to talk about Ollama -- because building with it was genuinely one of the most fun engineering experience we've had in a while.

Ollama is a runtime that lets you run open-weight LLMs locally. You install it, pull a model, and you have your own AI server running at localhost:11434 . That's it.

brew install ollama
ollama pull gemma3
ollama serve

The moment we hit that endpoint for the first time, something clicked. NO API Key, No rate limits, No cold start waiting on a remote server. No usage bill at the end of the month. Just a model running on the machine in front of us, responsing in real time, with zero data leaving the room.

Running a model locally felt like a superpower we didn't know we had. It changed how we thought about what was possible to build.

The API is as clean as any cloud provider's:

POST http://localhost:11434/api/chat

{
  "model": "gemma3",
  "stream": true,
  "messages": [
    { "role": "system", "content": "..." },
    { "role": "user",   "content": "..." }
  ]
}

With stream: true, tokens arrive as newline-delimited JSON chunks. Real-time streaming output, no infrastructure complexity. The same UX as ChatGPT, running entirely offline.

How the assistant understands your app

The assistant doesn't get trained on your app. It gets told about your app -- at query time, fresh on every request.

The system prompt is assembled from three layers before each call:

Layer 1 -- Your persona and base instructions What you pass in systemPrompt . Sets the assistant's identity, tone and scope.

Layer 2 -- Live context Whatever you pass in the context prop. Page name, user plan, feature flags, error state -- anything. The assistant reads this before every response.

<AssistantWidget
  context={{
    currentPage: 'Billing',
    userPlan: 'Pro',
    lastError: null,
  }}
/>

Layer 3 -- RAG from your knowledge base You pass your docs as plain text chunks. Before every query, Lume scores all chunks agains the user's question by keyword frequency and injects the most relevant ones into the prompt.

const docs = [
  {
    title: 'Billing and invoices',
    content: 'Your subscription renews automatically on the billing date...',
  },
  {
    title: 'Cancellation policy',
    content: 'You can cancel anytime from Settings → Billing...',
  },
]

<AssistantWidget knowledgeBase={docs} />

No embeddings. No vector database. No Python service running in the background. Just keyboard scoring in pure JavaScript, fast enough to run synchronously before each request.

Pushing live context

The context prop covers static state -- but apps are dynamic. A user hits an error, a payment fails, a background job finishes. You want the assistant to know.

That's what pushContext is for

const assistantRef = useRef<AssistantHandle>(null)

// Anywhere in your app — error boundary, event handler, anywhere
assistantRef.current?.pushContext({
  type: 'error',
  message: 'Payment failed: card declined',
  code: 'card_declined',
})

<AssistantWidget ref={assistantRef} ... />

The next time the user opens the panel and types anything — even just "help" — the assistant already knows what happened. You push context silently and let the user reach out at their own pace.

The RAG implementation

Here's the full scoring function — no magic, no dependencies:

function scoreChunk(chunk, query) {
  const terms = query
    .toLowerCase()
    .split(/\W+/)
    .filter(t => t.length > 3);

  const text = chunk.toLowerCase();

  return terms.reduce((score, term) => {
    const matches = (text.match(new RegExp(term, 'g')) || []).length;
    return score + matches;
  }, 0);
}

export function retrieveChunks(query, knowledgeBase, topN = 4) {
  return knowledgeBase
    .map(chunk => ({ ...chunk, score: scoreChunk(chunk.content, query) }))
    .sort((a, b) => b.score - a.score)
    .filter(c => c.score > 0)
    .slice(0, topN)
    .map(c => `## \({c.title}\n\){c.content}`)
    .join('\n\n');
}

The retrieved chunks get appended to the system prompt before every request. The model sees them as part of its instructions — not conversation history.

Is it as precise as cosine similarity over dense embeddings? No. Is it fast, zero-dependency, and accurate enough for docs a developer manually curated? Absolutely.

What we learned

The system prompt is the product

The quality of the assistant experience has almost nothing to do with which model you use and almost everything to do with how well you construct the context. A mediocre model with a well-loaded prompt beats a great model with a vague one every time. Invest in your context layers.

Streaming is the baseline, not a feature

Users have been trained by ChatGPT. A chat widget that shows a spinner for 8 seconds before dumping a full response feels broken, even if the answer is good. First token in under a second is the expectation now. Implement streaming from day one.

Local AI removes the "should I even build this" question

With a cloud AI, there's always a calculation: is this feature worth the per-token cost? Is this data safe to send to a third party? Does this use case justify the latency?

With Ollama running locally, all of those questions disappear. We prototyped features we'd never have tried with a cloud API because there was no cost and no data risk. That creative freedom is genuinely underrated — and it's the core reason Lume is built the way it is.

Try it

npm install @ovt2/lume

import { AssistantWidget } from '@ovt2/lume'

<AssistantWidget
  model="gemma3"
  systemPrompt="You are a helpful assistant for this app."
/>

You need Ollama running locally (ollama serve) and a model pulled. Everything else is in the package.

📦 npm → npmjs.com/package/@ovt2/lume
⭐ GitHub → github.com/ovt2/lume

v0.1.5 — a lot more coming. Feedback welcome.

We Built a Self-Hosted AI Assistant You Can Drop Into Any React App

What it looks like

The Ollama discovery

How the assistant understands your app

Pushing live context

The RAG implementation

What we learned

The system prompt is the product

Streaming is the baseline, not a feature

Local AI removes the "should I even build this" question

Try it

Comments

More from this blog

Why self-hosting your company data matters more than ever,

Sending Emails Made Simple with Resend and NextJs

Effective Testing in React: A Guide with Real Examples

How to Set Up Shared Shadcn UI Components in an Nx Monorepo

Command Palette

What it looks like

The Ollama discovery

How the assistant understands your app

Pushing live context

The RAG implementation

What we learned

The system prompt is the product

Streaming is the baseline, not a feature

Local AI removes the "should I even build this" question

Try it

Comments

More from this blog