Which model for a RAG chatbot needing to get information from a database?

Hi everyone,

I am rather new here, so please excuse basic questions or requests for clarification.

We are a small foundation and we operate an observatory providing information on political parties and their funding. We were recently granted a VPS as an in-kind donation and we are trying to use it to run a chatbot that would help answer users’ questions.

The idea is that, since the data can be complex, users could directly quiz the bot to get specific answers and, ideally, to draw charts based on this data.

For now we have AnythingLLM set up on the VPS and we are starting to play with it. Now we are trying to figure out what model can use (with an inference point, right?) to answer the queries.

Any comments and suggestions are welcome!

1 Like

When the primary use case involves database lookups, correctly integrating the LLM with its surrounding components often becomes more critical than the LLM’s own granular performance.

Regarding LLM selection itself, models that are relatively new (post-2025), have 7B parameters or more (if possible), and are Instruct models (pre-fine-tuned as chatbots) generally pose few issues.
https://huggingface.co/models?num_parameters=min:3B&apps=vllm,ollama&sort=trending

This happens to be the latest model released just a few days ago, and since it seems to excel at tool calling, it might be usable even for use cases like this one. It’s a super-compact 270M LLM, so I think it should be over 20 times faster than a 7B LLM…

Of course, its actual capabilities as a chatbot probably aren’t very high.:sweat_smile:

Thanks for the replies and links, @John6666 . A couple of follow-up questions, then:

  1. using the search link you provide in your first message, what is the next search criteria to find the right model? inference available is a requirement, right?
  2. for the model you list in your second message, is there a free way to use it? I tried to input "unsloth/functiongemma-270m-it-GGUF” in AnythingLLM where the previous model’s was, but when I try the agent, it says "400 The requested model ‘unsloth/functiongemma-270m-it-GGUF’ is not supported by any provider you have enabled.”
1 Like

Oh… I see. The previously available free Inference API has effectively been discontinued and migrated to Inference Providers, so there is essentially no way to use it for free anymore. https://huggingface.co/docs/inference-providers/en/pricing

You’ll need to either run your own LLM backend server locally for inference or find a free service on another site.

You are on the right track Starting with a VPS and AnythingLLM is smart, especially for structured stuff like political funding. Honestly, clean and well-organized data matters way more than the model itself. AI won’t make charts on its own you just need a simple layer to turn questions into visuals. And since it’s political data, always show your sources to stay trustworthy. Later, you can explore setups like CustomGPT where AI sticks to verified info. For now, keep it simple, clear, and just keep improving you’ve got this.

1 Like

Thanks @liam255 and sorry for the delayed reply – we were off for a bit. However, we are actually a bit stuck at the moment. Running our own LLM backend server locally for inference, as suggested by @John6666 , feels like a pretty tall order (and would most likely require a GPU, wouldn’t it?). Short of access to an API, what would be available options?

1 Like

Short of access to an API, what would be available options?

In such cases, I think services like the ones below would be useful, but personally, I’m not very familiar with them…:sweat_smile:


You have two main paths for “host an open-weight model behind an OpenAI-style API”:

  1. Managed inference providers (cheapest and simplest). You send tokens, they run the model.
  2. Routers / gateways (one OpenAI-compatible endpoint that can switch providers). You still pay usage, but you gain flexibility and fallbacks.

For a RAG chatbot that answers from your database, the highest-leverage requirements are:

  • Low latency for chat (streaming matters more than raw throughput).
  • Good instruction-following (so it uses retrieved context and does not hallucinate).
  • Tool or function calling if you want “ask DB with SQL” style access (optional but common).
  • Embeddings and reranking (often separate endpoints/models from the chat model).
  • Predictable cost (RAG can be prompt-heavy because you stuff retrieved passages into context).

Below are services that fit “OSS LLMs + low cost + OpenAI-compatible endpoint”.


Recommended picks for your case

If you want the simplest “works now” setup for a RAG chatbot

Hugging Face Inference Providers (router) + a cheap generator provider

  • HF gives you a single OpenAI-style endpoint, small included credits, and pay-as-you-go routing with no markup. (Hugging Face)
  • Base URL is https://huggingface.co/proxy/router.huggingface.co/v1. (Hugging Face)
  • Best when you are still experimenting and want to switch models/providers without code churn.

If you want the best interactive speed (and a real free on-ramp)

Groq

  • OpenAI-compatible base URL: https://api.groq.com/openai/v1. (GroqCloud)
  • Groq explicitly positions GroqCloud as free for developers to experiment (with rate limits). (Groq Community)
  • It is routinely cited as extremely fast in third-party benchmarking and by Groq’s own benchmark writeups (tokens/sec). (Groq)
  • Tradeoff: you may still want another provider for embeddings/reranking or specialty models.

If you want the lowest $/token on a strong 70B “Turbo” model

DeepInfra

  • Llama 3.3 70B Instruct Turbo shown at $0.10/M input and $0.32/M output on its model API page. (Deep Infra)
  • OpenAI-compatible chat completions endpoint is shown directly in their docs/examples. (Deep Infra)
  • Best when cost dominates and you can tolerate “normal” hyperscaler-style latency.

If you want cost controls that matter specifically for RAG (prompt-heavy)

Fireworks

  • OpenAI-compatible base URL: https://api.fireworks.ai/inference/v1. (Fireworks AI Docs)

  • Serverless pricing includes $1 free credits. (Fireworks AI)

  • Two RAG-relevant cost levers:

  • Best when you have repeated system prompts, repeated templates, or “same retrieved chunks show up often”.

If you want “one vendor” that covers chat + embeddings + rerank + tool calling

Together

  • Explicit OpenAI compatibility and base URL https://api.together.xyz/v1. (Together.ai Docs)
  • Their docs show embeddings, structured outputs, and function calling via the OpenAI client. (Together.ai Docs)
  • Their pricing page lists common open models and also includes embeddings and rerank categories. (together.ai)
  • Best when you want fewer moving parts for a full RAG pipeline.

If you want a router with strong fallback logic and lots of providers

OpenRouter

  • OpenAI SDK base URL: https://openrouter.ai/api/v1. (OpenRouter)
  • Routes while respecting capabilities like tools, max tokens, etc., and can optimize for uptime/price/latency depending on settings. (OpenRouter)
  • Pass-through inference pricing (no markup) but fees when purchasing credits, and BYOK has a defined fee model. (OpenRouter)
  • Best when reliability and provider diversity matter more than absolute lowest cost.

Comparison table for your case (RAG chatbot from a database)

Notes:

  • “Representative cost” is anchored on Llama 3.3 70B where the provider publishes a specific number. Otherwise it is a tier.
  • Real cost for RAG is dominated by input tokens (retrieved context). So caching and reranking can beat “cheaper output tokens”.
Service OpenAI-compatible base URL Free tier / included credits Representative cost level Performance profile Pros for RAG-from-DB Cons / pitfalls
Groq https://api.groq.com/openai/v1 (GroqCloud) Free for dev experimentation (rate limited). (Groq Community) Example: Llama 3.3 70B cited at $0.59/M in, $0.79/M out. (Groq) Extremely low latency; very fast tokens/sec commonly reported. (Groq) Great UX for chat + streaming; good for “answer quickly using retrieved context” Feature gaps vs full OpenAI surface (see unsupported items list). (GroqCloud) You may still want separate embeddings/rerank provider
DeepInfra (Shown via OpenAI-compatible chat completions) (Deep Infra) Typically paid usage Llama 3.3 70B Turbo: $0.10/M in, $0.32/M out. (Deep Infra) Standard cloud latency; value-oriented Very low token cost; good default for “cheap 70B answer synthesis” Fewer “platform extras” than some vendors; you still need to engineer cost controls in your app
Fireworks https://api.fireworks.ai/inference/v1 (Fireworks AI Docs) $1 free credits. (Fireworks AI) >16B models: $0.90 per 1M tokens (tiered). (Fireworks AI) Strong production posture; supports cost-saving modes Prompt caching + batch both 50% pricing. (Fireworks AI) Very relevant for prompt-heavy RAG If your prompts are highly unique (low cache hit rate), you may not realize the savings
Together https://api.together.xyz/v1 (Together.ai Docs) “Register for free” (usage is priced). (Together.ai Docs) Llama 3.3 70B Instruct-Turbo: $0.88/M in, $0.88/M out. (together.ai) Broad model catalog; production-focused Single vendor for chat + embeddings + function calling examples + structured outputs. (Together.ai Docs) Also offers rerank category. (together.ai) Not the cheapest on this specific 70B SKU; you pay for convenience and breadth
HF Inference Providers router https://huggingface.co/proxy/router.huggingface.co/v1 (Hugging Face) Free users $0.10/mo; Pro $2/mo; no markup pass-through. (Hugging Face) Pass-through (depends on chosen provider/model). (Hugging Face) Depends on provider selected Fast iteration: switch providers/models quickly; consolidated billing; broad provider list. (Hugging Face) Free credits are tiny; pay-as-you-go beyond that varies by provider/model
OpenRouter https://openrouter.ai/api/v1 (OpenRouter) “Get started for free” plan exists (see pricing page). (OpenRouter) Pass-through inference pricing; fees on credit purchase; BYOK fee model. (OpenRouter) Can route by uptime/price/latency; supports tool-aware routing. (OpenRouter) Excellent fallbacks and provider diversity; one endpoint for many models Slight overhead vs direct-to-provider; pricing has an extra “platform fee” layer when buying credits. (OpenRouter)

What I would do in your exact situation

You said “low cost or free tier” and “OpenAI compatible endpoint” and you are building a RAG chatbot that reads from a database.

Phase 1: build and iterate cheaply

  • Groq for the chat model (fast and dev-friendly). (Groq Community)
  • Together or Fireworks for embeddings/rerank (because their OpenAI-compatible examples include embeddings, and Together explicitly covers embeddings + function calling patterns via OpenAI client). (Together.ai Docs)
  • If you want to avoid committing early, use HF router first, then pin a provider later. (Hugging Face)

Phase 2: cut unit cost once behavior is correct

  • Switch answer-synthesis to DeepInfra Llama 3.3 70B Turbo when you can tolerate slightly higher latency and want the lowest published $/token on that 70B tier.
  • Add caching/batching if your workflow has repetition (Fireworks makes these savings explicit).

Phase 3: harden reliability

  • Put OpenRouter (or HF router) in front when you need provider failover and routing policies (tools-aware routing, latency/price sorting).

Practical pitfalls and tips for RAG cost and quality

  1. RAG is input-token heavy.
    Most spend is “prompt tokens” because you inject retrieved chunks. This is why caching and reranking matter more than shaving output price.

  2. Rerank before you stuff context.
    Even a cheap reranker can cut your prompt size a lot. Together explicitly surfaces a rerank category in pricing, which is a hint they expect this pattern.

  3. Use tool/function calling only when needed.
    If your “database” is structured, function calling can route: interpret question → generate SQL → execute → summarize. Together’s OpenAI-compat docs show function calling patterns.

  4. Prefer routers when you are unsure.
    HF router gives pass-through pricing and small credits, and OpenRouter offers routing and fallbacks.


High-quality docs and references (directly relevant)


Summary

  • Best “free-ish and fast” for chat: Groq. (Groq Community)
  • Best cheapest published 70B Turbo pricing: DeepInfra.
  • Best RAG cost levers (cache, batch): Fireworks.
  • Best all-in-one (chat + embeddings + function calling patterns): Together.
  • Best “experiment and switch providers easily”: HF Inference Providers router.
  • Best “routing + fallbacks across many providers”: OpenRouter.
1 Like

Thanks a lot @John6666. That’s a lot to go through, but I will try and get to it very soon and get back to you!!

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.