Help with my questions. very new at this

No idea of how to properly formulate my questions, so I will just go ahead and write it.

Is there an ai that has world info I can store locally and that is interactive in a way that I can customize it, with memory and without corporate influence ?

Speed is not of the essence here, I am a very patient person. So the program, ai , LLM, does not have to be super fast on my interactions with it, can even take 5 minutes to answer a simple question or command.

I quickly tried out this = bartowski/Meta-Llama-3.1-70B-Instruct-GGUF

just through my own browser, but I want something a bit more interactive, so to speak

I only have 12 gb of Vram, 64 gb RAM DDR5, Ryzen 7 9800x3d.

Storage, slightly under 3 tb on two m2’s

Windows 11

will most likely buy new gpu this year or next.

I do have an older pc / setup. if it is best to just have it on another system so to speak. with windows 10. Does Operating System matter ?

1 Like
  • Llama-3.1-70B will run on your system, but models placed in RAM will perform extremely slowly. If real-time performance is required, it’s better to find a model small enough to fit in VRAM (With 12GB VRAM, I recommend 3B to 12B models). Generally, smaller models run faster.
  • The ability to instantly acquire and retain “new” knowledge of worlds / long-term conversation history isn’t inherent to the LLM itself. This is usually handled by the framework (the outer software layer). Therefore, when selecting an LLM, it’s best to assume its memory is limited to a fixed set of common knowledge. While it is possible to add memory to the LLM itself, it requires significant money, time, and expertise, and the accuracy of the memory isn’t particularly promising. Conversely, even a small model can be usable if it handles conversation and information retrieval without issues.

What you’re actually asking for (and what “world info locally” means)

You can get a local, interactive assistant with:

  • Local inference (no cloud calls)
  • Customizable behavior (system prompts, personas, tools)
  • Persistent memory (preferences/facts saved between chats)
  • Local knowledge (“world info” as an offline library + your own documents via RAG)

But it’s rarely “one model that contains all world knowledge.” Instead it’s a stack:

  1. LLM (the “brain”) – answers and reasons
  2. Knowledge base / RAG (the “library”) – your offline reference docs, Wikipedia dumps, notes, manuals
  3. Memory (the “profile”) – stable facts about you and your preferences

This approach is also how you reduce “corporate influence”: run offline + use open tooling + choose permissively licensed models.


The best-fit solution for your case (Windows 11, 12GB VRAM, 64GB RAM)

Recommended stack: Ollama (runner) + Open WebUI (interactive UI)

Why this matches your requirements:

  • You can import your existing GGUF into Ollama using a Modelfile (FROM /path/to/file.gguf). (Ollama)
  • Open WebUI supports long-term memory with model-callable tools like add_memory, search_memories, replace_memory_content. (Open WebUI)
  • Open WebUI provides Knowledge collections and RAG (local docs, web content, etc.). (Open WebUI)
  • It’s designed to be self-hosted, and the docs explicitly call out persistent storage via Docker volume mapping so your data survives restarts. (Open WebUI)

Open WebUI licensing note: the project is “free and permissively licensed,” with a branding requirement; older code was BSD-3. (Open WebUI Community)

A very good alternative (if you want “desktop app simplicity”): AnythingLLM Desktop

  • Positioned as a “single-player” install for local LLMs + RAG + Agents with “full privacy.” (docs.anythingllm.com)
  • Clear documentation of where your local data sits on Windows (AppData) and what’s stored (documents, vector DB, models, SQLite DB). (docs.anythingllm.com)
  • Default local vector DB behavior and supported local vector DB options are documented. (docs.useanything.com)
  • Open source under MIT (repo license). (GitHub)

Other viable “interactive shell” options (trade-offs)

  • Jan: open-source offline ChatGPT-like desktop experience (good UI; less “RAG-first” than Open WebUI/AnythingLLM depending on your workflow). (GitHub)
  • LM Studio: explicitly supports operating fully offline including “chatting with documents” (simple and polished). (LM Studio)

Model strategy that makes your setup feel good

Why your 70B feels “not interactive”

70B GGUF on 12GB VRAM usually means heavy CPU/RAM offload and slow tokens/sec. It works, but the UI experience becomes “ask → wait.”

What I would do: “two-model workflow”

  1. Daily driver (fast enough, still smart): 7B–14B
  2. Slow quality mode: your 70B for deep answers, rewriting, long-form synthesis

This is the most practical way to get both “interactive” and “high quality” on your hardware.

Quantization sizing rule of thumb (directly relevant)

Many GGUF packs recommend choosing a quant with a file size ~1–2GB smaller than your GPU VRAM (and using system RAM to push quality/size higher if needed). (Hugging Face)

Models I’d pick (Hugging Face + license considerations)

If “without corporate influence” includes avoiding restrictive model licenses, prefer Apache-2.0 families where possible.

  • Qwen3-8B (Apache-2.0 license file present) (Hugging Face)

  • Qwen2.5-14B-Instruct (Apache-2.0 license) (Hugging Face)

    • A common GGUF choice for 12GB VRAM: Q4_K_M is ~8.99GB in a widely used pack. (Hugging Face)
  • Mistral 7B Instruct v0.3 is widely distributed under Apache (Ollama library page explicitly states Apache; Mistral’s inference repo is Apache-2.0). (Ollama)

About your current Llama 3.1 70B choice (license reality)

If license freedom matters to you:

  • FSF states Llama 3.1 Community License is not a free software license. (fsf.org)
  • OSI argues Llama 3.x is not open source under the Open Source Definition. (Open Source Initiative)

You can still run it locally; the point is: it may not meet your “without corporate influence” criterion if that criterion includes licensing and usage constraints.


“World info I can store locally”: what actually works

Option A (recommended): Offline library + RAG

  • Keep offline sources (PDFs, books, manuals, curated web exports, etc.)
  • Use RAG inside Open WebUI/AnythingLLM to retrieve relevant passages into context (Open WebUI)

This produces better factuality than relying on an LLM’s baked-in knowledge.

Option B: Offline Wikipedia via Kiwix (good “world info” baseline)

Kiwix is explicitly positioned as an offline gateway to Wikipedia and other educational content. (Kiwix)

Practical storage reality (from a major mirror listing):

  • wikipedia_en_all_maxi_2025-08.zim is 111G (English Wikipedia with images, “maxi”). (ftp.fau.de)
  • wikipedia_en_all_nopic_2025-08.zim is 43G (no images). (ftp.fau.de)
  • wikipedia_en_all_mini_2025-12.zim is 11G (smaller subset). (ftp.fau.de)

Best practice: use Kiwix for offline browsing/search, and only ingest selected subsets into RAG. Trying to embed “all maxi” into a vector DB is doable but becomes a major indexing/tuning project.


Memory (how it works in practice)

There are two “memory” types you should plan for:

  1. Assistant memory (preferences + stable facts)
    Open WebUI supports this as a feature with dedicated tools (add_memory, search_memories, etc.). (Open WebUI)

  2. Knowledge memory (your library)
    Open WebUI “Knowledge” is the persistent reference store; RAG pulls from it. (Open WebUI)

Practical note: Open WebUI memory is labeled experimental and there are open issues/discussions about retrieval quality (e.g., reranking). (Open WebUI)


OS and “should I use the older PC?”

Does Windows 10 vs 11 matter?

Not much, as long as you’re on supported versions and drivers:

  • Ollama on Windows: Windows 10 22H2 or newer; specific driver requirements for NVIDIA/AMD are documented. (Ollama)

When the older PC is worth using

Use a second machine as a “local AI server” if you want:

  • Your main PC to stay responsive while the LLM thinks
  • Long-running indexing jobs (RAG embeddings, large library processing) without disruption
  • Always-on access via browser from any device on your network

Open WebUI works well in a “server + browser client” pattern.


A concrete setup plan for you (minimal friction)

1) Make the system interactive first (UI + runner)

  • Install Ollama (Windows supported versions/drivers per docs). (Ollama)
  • Run Open WebUI with persistent storage (-v open-webui:/app/backend/data). (Open WebUI)

2) Add two models

  • Daily: Qwen3-8B (Apache) or Qwen2.5-14B-Instruct GGUF Q4_K_M (~8.99GB) (Hugging Face)
  • Quality mode: your Llama 3.1 70B GGUF, imported into Ollama via Modelfile (FROM /path/to/file.gguf). (Ollama)

3) Add “world info” progressively

  • Start: your own documents + curated references in Knowledge/RAG (Open WebUI)
  • Add: Kiwix ZIM (English “nopic” is a good space/performance compromise at 43G). (ftp.fau.de)

4) Turn on memory (and control it)

  • Enable Open WebUI memory and decide what you want stored (preferences, recurring projects, stable facts). (Open WebUI)

The key trade-off you can’t avoid

Even with zero cloud usage, “no corporate influence” is not absolute because:

  • Most strong open-weight models are trained by large orgs (corporate/academic) and inherit their training data choices.
  • You can reduce external influence by running offline and choosing permissive licenses (Apache/MIT tooling), but you can’t erase bias without your own training/fine-tuning.

If you want to push further in that direction, the next step is a small fine-tune (LoRA) on your own writing/preferences + strict RAG-only factual answering for “world info.”


If you want one “best default” recommendation given everything you said: Open WebUI + Ollama, daily model Qwen2.5-14B-Instruct (Q4_K_M), and use your 70B only as a “slow quality mode,” with offline “world info” coming from Kiwix (ZIM) + curated RAG collections.

Thank you for your reply and all this info, I do appreciate it :slight_smile: What should I look out for in newer/future models ? Thinking of the license and just overall capabilities. Like, when should I replace my daily model ?

1 Like

Newer models typically offer higher performance or excel at specific tasks when compared to models of the same size. Even a few months’ difference can be significant. Furthermore, newer models often possess up-to-date knowledge on current events and scientific findings. Furthermore, each LLM has its own response tendencies and unique way of thinking.

Therefore, when a promising model emerges (promising models often get discussed on Posts, Blog, or the HF Discord, etc. Also, leaderboards are a reliable resource if you want to find them yourself for your specific purpose), it’s worth trying it out.
If you like it, adopt it as your daily model.
Since you can keep them as files, you can just load it when needed. Rather than thinking of it as a replacement, consider it more like adding an option?

Nice, thank you for taking your time and helping me out, really appreciate it :slight_smile:

1 Like

Is there an official invite link to HF dscord ? I could not find it here. I might also be blind :stuck_out_tongue:

1 Like

Maybe this? https://huggingface.co/join/discord

it gives me

404

Organization not found.

1 Like

Oh… True. Seems it changed?
https://huggingface.co/discord-community

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.