Llama-3.1-70B will run on your system, but models placed in RAM will perform extremely slowly. If real-time performance is required, itâs better to find a model small enough to fit in VRAM (With 12GB VRAM, I recommend 3B to 12B models). Generally, smaller models run faster.
- The ability to instantly acquire and retain ânewâ knowledge of worlds / long-term conversation history isnât inherent to the LLM itself. This is usually handled by the framework (the outer software layer). Therefore, when selecting an LLM, itâs best to assume its memory is limited to a fixed set of common knowledge. While it is possible to add memory to the LLM itself, it requires significant money, time, and expertise, and the accuracy of the memory isnât particularly promising. Conversely, even a small model can be usable if it handles conversation and information retrieval without issues.
What youâre actually asking for (and what âworld info locallyâ means)
You can get a local, interactive assistant with:
- Local inference (no cloud calls)
- Customizable behavior (system prompts, personas, tools)
- Persistent memory (preferences/facts saved between chats)
- Local knowledge (âworld infoâ as an offline library + your own documents via RAG)
But itâs rarely âone model that contains all world knowledge.â Instead itâs a stack:
- LLM (the âbrainâ) â answers and reasons
- Knowledge base / RAG (the âlibraryâ) â your offline reference docs, Wikipedia dumps, notes, manuals
- Memory (the âprofileâ) â stable facts about you and your preferences
This approach is also how you reduce âcorporate influenceâ: run offline + use open tooling + choose permissively licensed models.
The best-fit solution for your case (Windows 11, 12GB VRAM, 64GB RAM)
Recommended stack: Ollama (runner) + Open WebUI (interactive UI)
Why this matches your requirements:
- You can import your existing GGUF into Ollama using a
Modelfile (FROM /path/to/file.gguf). (Ollama)
- Open WebUI supports long-term memory with model-callable tools like
add_memory, search_memories, replace_memory_content. (Open WebUI)
- Open WebUI provides Knowledge collections and RAG (local docs, web content, etc.). (Open WebUI)
- Itâs designed to be self-hosted, and the docs explicitly call out persistent storage via Docker volume mapping so your data survives restarts. (Open WebUI)
Open WebUI licensing note: the project is âfree and permissively licensed,â with a branding requirement; older code was BSD-3. (Open WebUI Community)
A very good alternative (if you want âdesktop app simplicityâ): AnythingLLM Desktop
- Positioned as a âsingle-playerâ install for local LLMs + RAG + Agents with âfull privacy.â (docs.anythingllm.com)
- Clear documentation of where your local data sits on Windows (AppData) and whatâs stored (documents, vector DB, models, SQLite DB). (docs.anythingllm.com)
- Default local vector DB behavior and supported local vector DB options are documented. (docs.useanything.com)
- Open source under MIT (repo license). (GitHub)
Other viable âinteractive shellâ options (trade-offs)
- Jan: open-source offline ChatGPT-like desktop experience (good UI; less âRAG-firstâ than Open WebUI/AnythingLLM depending on your workflow). (GitHub)
- LM Studio: explicitly supports operating fully offline including âchatting with documentsâ (simple and polished). (LM Studio)
Model strategy that makes your setup feel good
Why your 70B feels ânot interactiveâ
70B GGUF on 12GB VRAM usually means heavy CPU/RAM offload and slow tokens/sec. It works, but the UI experience becomes âask â wait.â
What I would do: âtwo-model workflowâ
- Daily driver (fast enough, still smart): 7Bâ14B
- Slow quality mode: your 70B for deep answers, rewriting, long-form synthesis
This is the most practical way to get both âinteractiveâ and âhigh qualityâ on your hardware.
Quantization sizing rule of thumb (directly relevant)
Many GGUF packs recommend choosing a quant with a file size ~1â2GB smaller than your GPU VRAM (and using system RAM to push quality/size higher if needed). (Hugging Face)
Models Iâd pick (Hugging Face + license considerations)
If âwithout corporate influenceâ includes avoiding restrictive model licenses, prefer Apache-2.0 families where possible.
-
Qwen3-8B (Apache-2.0 license file present) (Hugging Face)
-
Qwen2.5-14B-Instruct (Apache-2.0 license) (Hugging Face)
- A common GGUF choice for 12GB VRAM: Q4_K_M is ~8.99GB in a widely used pack. (Hugging Face)
-
Mistral 7B Instruct v0.3 is widely distributed under Apache (Ollama library page explicitly states Apache; Mistralâs inference repo is Apache-2.0). (Ollama)
About your current Llama 3.1 70B choice (license reality)
If license freedom matters to you:
- FSF states Llama 3.1 Community License is not a free software license. (fsf.org)
- OSI argues Llama 3.x is not open source under the Open Source Definition. (Open Source Initiative)
You can still run it locally; the point is: it may not meet your âwithout corporate influenceâ criterion if that criterion includes licensing and usage constraints.
âWorld info I can store locallyâ: what actually works
Option A (recommended): Offline library + RAG
- Keep offline sources (PDFs, books, manuals, curated web exports, etc.)
- Use RAG inside Open WebUI/AnythingLLM to retrieve relevant passages into context (Open WebUI)
This produces better factuality than relying on an LLMâs baked-in knowledge.
Option B: Offline Wikipedia via Kiwix (good âworld infoâ baseline)
Kiwix is explicitly positioned as an offline gateway to Wikipedia and other educational content. (Kiwix)
Practical storage reality (from a major mirror listing):
wikipedia_en_all_maxi_2025-08.zim is 111G (English Wikipedia with images, âmaxiâ). (ftp.fau.de)
wikipedia_en_all_nopic_2025-08.zim is 43G (no images). (ftp.fau.de)
wikipedia_en_all_mini_2025-12.zim is 11G (smaller subset). (ftp.fau.de)
Best practice: use Kiwix for offline browsing/search, and only ingest selected subsets into RAG. Trying to embed âall maxiâ into a vector DB is doable but becomes a major indexing/tuning project.
Memory (how it works in practice)
There are two âmemoryâ types you should plan for:
-
Assistant memory (preferences + stable facts)
Open WebUI supports this as a feature with dedicated tools (add_memory, search_memories, etc.). (Open WebUI)
-
Knowledge memory (your library)
Open WebUI âKnowledgeâ is the persistent reference store; RAG pulls from it. (Open WebUI)
Practical note: Open WebUI memory is labeled experimental and there are open issues/discussions about retrieval quality (e.g., reranking). (Open WebUI)
OS and âshould I use the older PC?â
Does Windows 10 vs 11 matter?
Not much, as long as youâre on supported versions and drivers:
- Ollama on Windows: Windows 10 22H2 or newer; specific driver requirements for NVIDIA/AMD are documented. (Ollama)
When the older PC is worth using
Use a second machine as a âlocal AI serverâ if you want:
- Your main PC to stay responsive while the LLM thinks
- Long-running indexing jobs (RAG embeddings, large library processing) without disruption
- Always-on access via browser from any device on your network
Open WebUI works well in a âserver + browser clientâ pattern.
A concrete setup plan for you (minimal friction)
1) Make the system interactive first (UI + runner)
- Install Ollama (Windows supported versions/drivers per docs). (Ollama)
- Run Open WebUI with persistent storage (
-v open-webui:/app/backend/data). (Open WebUI)
2) Add two models
- Daily: Qwen3-8B (Apache) or Qwen2.5-14B-Instruct GGUF Q4_K_M (~8.99GB) (Hugging Face)
- Quality mode: your Llama 3.1 70B GGUF, imported into Ollama via Modelfile (
FROM /path/to/file.gguf). (Ollama)
3) Add âworld infoâ progressively
- Start: your own documents + curated references in Knowledge/RAG (Open WebUI)
- Add: Kiwix ZIM (English ânopicâ is a good space/performance compromise at 43G). (ftp.fau.de)
4) Turn on memory (and control it)
- Enable Open WebUI memory and decide what you want stored (preferences, recurring projects, stable facts). (Open WebUI)
The key trade-off you canât avoid
Even with zero cloud usage, âno corporate influenceâ is not absolute because:
- Most strong open-weight models are trained by large orgs (corporate/academic) and inherit their training data choices.
- You can reduce external influence by running offline and choosing permissive licenses (Apache/MIT tooling), but you canât erase bias without your own training/fine-tuning.
If you want to push further in that direction, the next step is a small fine-tune (LoRA) on your own writing/preferences + strict RAG-only factual answering for âworld info.â
If you want one âbest defaultâ recommendation given everything you said: Open WebUI + Ollama, daily model Qwen2.5-14B-Instruct (Q4_K_M), and use your 70B only as a âslow quality mode,â with offline âworld infoâ coming from Kiwix (ZIM) + curated RAG collections.