AI Integration Guide

AI Integration Guide

A comprehensive guide for setting up, testing, and integrating AI with epic_engine.


Table of Contents


Testing the AI

Test if Qwen actually generates text using one of these methods:

Option A: Use curl (in a new terminal)

curl -X POST http://localhost:8000/api/generate \
  -H "Content-Type: application/json" \
  -d "{\"prompt\": \"Write one sentence about a knight.\"}"

Option B: Use the browser

Go to: http://localhost:8000/docs

This opens FastAPI's built-in test interface where you can try the /api/generate endpoint with a form.

Test It

  1. Restart your server (Ctrl+C, then uv run python -m api.main)
  2. Go to http://localhost:8000/docs
  3. Try the new POST /api/generate/stream endpoint
uv run python -m api.main

Vector Store Sync

Is Your Existing KG Data Vectorized?

No. Your existing Knowledge Graph data is NOT synced to the vector store yet. The sync notifications only trigger when:

  • New entities are created (after the code was added)
  • Existing entities are updated (after the code was added)
  • Entities are deleted (after the code was added)

To sync existing data, you need to:

Call the full novel sync endpoint for each novel:

POST http://localhost:8000/api/sync/novel
Body: { "novel_id": "your-novel-uuid-here" }

This will:

  • Fetch all entities for that novel from backend
  • Chunk each entity
  • Generate embeddings
  • Store in ChromaDB

Check Sync Status

GET http://localhost:8000/api/sync/stats

If it returns collection_count: 0, nothing is vectorized yet.


VS Code / IDE Setup

Note: Those are VS Code/Pylance IDE warnings, not actual Python errors. The packages are installed correctly (you verified with pip show), but VS Code is looking at a different Python interpreter.

How to Fix the IDE Warnings

  1. Press Ctrl+Shift+P in VS Code
  2. Type "Python: Select Interpreter"
  3. Choose the interpreter at:
C:\Users\arman\AppData\Roaming\Python\Python313\python.exe

After selecting the correct interpreter, the red squiggly lines should disappear within a few seconds as Pylance re-analyzes the files.

Why This Happens

VS Code might be pointing to:

  • A different Python version
  • A virtual environment that doesn't have the packages
  • The wrong system Python

The pip show output confirmed packages are in C:\Users\arman\AppData\Roaming\Python\Python313\site-packages, so you need to tell VS Code to use that same Python.


Note: Log in with armandoblancq@gmail.com (or with gmail)

Provider Link
OpenAI https://platform.openai.com/logs
Anthropic https://console.anthropic.com/workspaces/default/logs
Google Gemini https://aistudio.google.com/app/logs and https://console.cloud.google.com/billing
DeepSeek https://platform.deepseek.com/usage
Groq https://console.groq.com/keys
Serper https://serper.dev/logs

Supported Providers

  • OpenAI
  • Anthropic
  • Google Gemini
  • DeepSeek
  • Groq
  • Serper

Package Installation

Additional dependencies for vector operations:

uv add "numba>=0.59" "umap-learn>=0.5.5"

How epic_engine Integrates with aiservice

Step 1: Install epic_engine as a package

cd package/epic_engine
pip install -e .

The -e flag means "editable" - it installs the package in development mode, linking to your source code rather than copying it.

Step 2: Import in aiservice

Your aiservice files would import from epic_engine instead of their local modules:

# Before (current aiservice)
from retrieval.hybrid_retriever import HybridRetriever
from engine.rag_engine import RAGEngine
from vectorstore.vector_store import VectorStore

# After (using epic_engine)
from epic_engine.retrieval import HybridRetriever
from epic_engine.rag import RAGEngine
from epic_engine.vectors import VectorStore

Step 3: aiservice becomes a thin API layer

Your aiservice would only contain:

  • Flask/FastAPI routes (api/routes.py)
  • Request/response schemas (api/schemas.py)
  • Server configuration (api/server.py)
  • Any app-specific logic not in the engine

Will Updates to epic_engine Reflect in aiservice?

YES - if you installed with pip install -e . (editable mode)

Installation Method Updates Reflected?
pip install -e . (editable) Yes - immediately linked to source
pip install . (regular) No - need to reinstall
pip install epic-engine (PyPI) No - need to upgrade

With editable install:

  1. Edit epic_engine/rag/engine.py
  2. Restart aiservice
  3. Changes are immediately available

Without editable install:

  1. Edit epic_engine/rag/engine.py
  2. Must run pip install . again
  3. Then restart aiservice
Scenario Recommendation
During development Use pip install -e . so changes reflect immediately
For production Use regular pip install . or version from PyPI
For other projects They can pip install epic-engine independently

What Stays in aiservice vs epic_engine

aiservice (API Layer) epic_engine (Core Library)
Flask/FastAPI routes RAG engine
HTTP request handling Vector store
Authentication Knowledge graph
API schemas Providers (OpenAI, etc.)
Server startup Agents
App-specific configs Prompts

Key Benefit: This separation means you could build a completely different app (CLI tool, desktop app, another API) using the same epic_engine library.


UV Package Manager

pyproject.toml Configuration

Path dependencies need to be in a separate [tool.uv.sources] section:

[project]
dependencies = [
    "epic-engine",  # keep as string here
    "fastapi",
    "uvicorn",
    "httpx",
]

[tool.uv.sources]
epic-engine = { path = "../package/epic_engine" }

Warning: Putting epic-engine = { path = "..." } inside the dependencies array is invalid TOML and will fail.

Transitive Dependencies

The uv.lock file contains all transitive dependencies - not just your 4 direct dependencies, but everything those packages depend on:

Even though your pyproject.toml only lists 4 packages, the full dependency tree is ~143 packages.

When to Run uv sync

You only need to run uv sync again if:

  • You add/remove dependencies in epic_engine's pyproject.toml
  • You change the package structure (add new submodules to __init__.py)

For normal code changes (fixing bugs, improving logic, adding functions to existing files), just save and restart the server.

Reinstalling epic-engine

When uv sync uses a cached version that doesn't have your new changes:

uv sync --reinstall-package epic-engine

This tells uv to rebuild and reinstall epic-engine from the source path.


Running the AI Service

Quick Start Commands

# Reinstall epic-engine (after making changes)
uv sync --reinstall-package epic-engine

# Start the AI service
uv run python -m api.main

After starting, test these endpoints:

Endpoint Expected Response
http://localhost:8000/ "Epic AI Service is running"
http://localhost:8000/api/health healthy status

Note: The Qwen model uses lazy loading - it loads on first use, not at startup.

Resync Novel Data

Open a new terminal while aiservice is running:

# PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/sync/novel" `
  -Method POST `
  -ContentType "application/json" `
  -Body '{"novel_id": "cmise0o310000h21zkiofct47"}'

Test Interface

# Launch the test GUI
uv run python test_interface.py

# Or the EPIC Tester
uv run python EPIC_Tester.py

Note: Make sure your backend is running (npm run dev in the backend folder) since the test interface needs to connect to the Knowledge Graph API.

Start the backend first:

cd backend
npm run dev

Prisma Migrations

To create the PlotThread table in the database:

cd backend

# Quick development sync
npx prisma db push

# Or proper migration
npx prisma migrate dev --name add_plot_thread

The db push is quicker for development - it syncs your schema without creating migration files.


Retriever Comparison

How to Switch Retrievers

In routes.py line 381, change:

# Current: LLM reranking, dual search
_DEFAULT_RETRIEVER_TYPE = "advanced"

# Alternative: Faster, score-based RRF fusion
_DEFAULT_RETRIEVER_TYPE = "hybrid"

HybridRetriever

Feature Has It? Details
Query Rewriting Yes Uses QueryRewriter (lines 60-65 in hybrid.py)
Vector Search Yes Parallel with KG search
KG Search Yes Parallel with vector search
KG Traversal Yes From seed entities
Dual Query Search No Only searches with ONE query (rewritten if enabled)
LLM Reranking No Uses RRF score fusion (mathematical, no LLM)
ThreadPoolExecutor Workers 2

AdvancedRetriever

Feature Has It? Details
Query Rewriting Yes Uses QueryRewriter
Vector Search Yes Parallel
KG Search Yes Parallel
KG Traversal Yes From seed entities
Dual Query Search Yes Searches with BOTH original AND rewritten queries
LLM Reranking Yes Uses Reranker with LLM to judge relevance
LLM Validation No
LLM Reasoning No
ThreadPoolExecutor Workers 3

RVRGRetriever (Retrieve-Validate-Reason-Generate)

The most advanced retriever with full LLM-powered pipeline for highest quality context.

Feature Has It? Details
Query Rewriting Yes Uses QueryRewriter
Vector Search Yes Parallel (original + rewritten queries)
KG Search Yes Parallel
KG Traversal Yes From seed entities
Dual Query Search Yes Searches with BOTH original AND rewritten queries
LLM Reranking Yes Uses Reranker with LLM to judge relevance
LLM Validation Yes Filters out irrelevant results with explanations
LLM Reasoning Yes Extracts insights, connections, and identifies gaps
ThreadPoolExecutor Workers 3

RVRG Pipeline Stages:

  1. Retrieve - Query rewrite → Vector search → KG search → KG traversal → Rerank
  2. Validate - LLM evaluates each result for relevance (score 0-1, with explanations)
  3. Reason - LLM analyzes validated context to extract insights, find connections, identify gaps
  4. Generate - Final answer generation with reasoning summary as guidance

Unique Features:

  • Validation with explanations - Each result gets a relevance score and reason why it's relevant/irrelevant
  • Gap identification - Detects what information is missing to fully answer the question
  • Confidence scoring - Reports whether context is sufficient to answer (can_answer, answer_confidence)
  • Key insights extraction - Pulls out important facts from context with source attribution
  • Connection discovery - Finds relationships between entities/events in context

The ACTUAL Key Differences

Feature HybridRetriever AdvancedRetriever RVRGRetriever
Queries Used for Search 1 (rewritten only) 2 (original + rewritten) 2 (original + rewritten)
Ranking Method RRF score fusion (math) LLM judges relevance LLM judges relevance
Validation Step No No Yes (filters irrelevant results)
Reasoning Step No No Yes (extracts insights & gaps)
LLM Calls During Retrieval 1 (rewriting only) 2 (rewriting + reranking) 4 (rewrite + rerank + validate + reason)
Speed Fastest Medium Slowest
Cost Lowest Medium Highest
Quality Good Better Best

All three have query rewriting. The key progression is: Hybrid (fast, math-based) → Advanced (adds LLM reranking) → RVRG (adds validation and reasoning for highest quality).

Which Retriever is Better?

Priority Better Choice
Highest Quality RVRGRetriever
Complex queries RVRGRetriever
Good quality + speed AdvancedRetriever
Speed HybridRetriever
Cost (API calls) HybridRetriever
Simple lookups HybridRetriever
Know if answer exists RVRGRetriever (has can_answer flag)

What is ThreadPoolExecutor Workers?

ThreadPoolExecutor is Python's way to run multiple tasks in parallel (at the same time).

  • Workers = 2 means 2 tasks can run simultaneously
  • Workers = 3 means 3 tasks can run simultaneously

In your retrievers:

  • HybridRetriever (2 workers): Runs Vector Search + KG Search in parallel
  • AdvancedRetriever (3 workers): Runs Original Query Search + Rewritten Query Search + KG Search in parallel

More workers = faster retrieval (tasks don't wait for each other).

LLM Reranking vs RRF Score Fusion: Which is More Accurate?

LLM Reranking (AdvancedRetriever) - More Accurate

How it works: An LLM reads each result and the original query, then judges: "Is this result actually relevant to what the user asked?"

Pros:

  • Understands semantic meaning and context
  • Can recognize when a result looks related but isn't actually helpful
  • Handles nuance, synonyms, and intent

Cons:

  • Slower (requires LLM API call)
  • Costs money (API tokens)
  • Can hallucinate or make mistakes

RRF Score Fusion (HybridRetriever) - Faster, but Less Accurate

How it works: Mathematical formula that combines rankings from different sources:

RRF_score = Σ (1 / (k + rank_i))

Where k is typically 60, and rank_i is the position in each result list.

Pros:

  • Instant (pure math, no LLM call)
  • Free (no API costs)
  • Consistent and predictable

Cons:

  • Doesn't understand meaning - just combines numbers
  • A result ranked #1 in both lists wins, even if it's not actually relevant
  • Can't handle cases where high-scoring results are semantically wrong

Bottom Line

Method Accuracy Speed Cost
LLM Reranking Higher Slower Higher
RRF Score Fusion Lower Faster Free

LLM reranking is more accurate because it actually understands the query and results. RRF just does math on rankings without understanding content. For a creative writing app like EPIC where context quality matters, AdvancedRetriever with LLM reranking will give better results - but at the cost of speed and API calls.

Speed & Cost Comparison

Speed Difference

Retriever LLM Calls During Retrieval Estimated Time
HybridRetriever 1 call (query rewriting) ~1-2 seconds
AdvancedRetriever 2 calls (rewriting + reranking) ~3-5 seconds
RVRGRetriever 4 calls (rewrite + rerank + validate + reason) ~6-12 seconds

The validation and reasoning steps add significant time because:

  • Validation sends all reranked results to LLM for relevance scoring with explanations
  • Reasoning analyzes validated context to extract insights, connections, and gaps
  • Each step requires a full LLM inference pass

Rough estimate: AdvancedRetriever is 2-3x slower than Hybrid; RVRGRetriever is 2-3x slower than Advanced

API Cost Difference

Both retrievers use the user-selected model from the test interface dropdown (e.g., openai/gpt-4o-mini, anthropic/claude-3-haiku, deepseek/deepseek-chat, etc.). The model is passed through the provider/model settings.

Example pricing (GPT-4o-mini as of early 2025):

  • Input: ~$0.15 per 1M tokens
  • Output: ~$0.60 per 1M tokens
Retriever Tokens per Query (estimate) Cost per Query
HybridRetriever ~500 tokens (rewrite only) ~$0.0001
AdvancedRetriever ~2000-3000 tokens (rewrite + rerank 20 results) ~$0.0003-0.0005
RVRGRetriever ~4000-6000 tokens (rewrite + rerank + validate + reason) ~$0.0006-0.001

RVRGRetriever costs roughly 6-10x more per query than HybridRetriever due to additional LLM calls for validation and reasoning

Real-World Impact

Usage HybridRetriever AdvancedRetriever RVRGRetriever
100 queries/day ~$0.01/day ~$0.03-0.05/day ~$0.06-0.10/day
1000 queries/day ~$0.10/day ~$0.30-0.50/day ~$0.60-1.00/day

Bottom line: The cost difference is negligible for personal use. RVRGRetriever provides the highest quality context with validation and reasoning, but at 2-3x the cost of AdvancedRetriever. The speed difference is more noticeable than the cost.

Can Qwen 2.5 7B Handle AdvancedRetriever?

Yes, but with caveats.

What Qwen 2.5 7B Needs to Do

Task Difficulty for 7B Model
Query Rewriting Easy - Simple text transformation
LLM Reranking Harder - Must evaluate 10-20 results against query

Potential Issues

1. Reranking Quality

  • Reranking requires the model to understand relevance at a deeper level
  • 7B models can do this, but not as well as GPT-4o-mini or Claude
  • You might get slightly worse ranking than with a cloud model

2. Context Window Pressure

  • Reranking sends ALL candidate results (up to 20 entities) to the LLM
  • Each entity has name + description + metadata
  • Could easily be 2000-4000 tokens just for the reranking prompt
  • Your Qwen is configured with 8192 context window - should be fine, but tight

3. Speed

  • Local 7B model on RTX 3050 6GB is slower than cloud APIs
  • Reranking adds another full inference pass
  • Expect 5-15 seconds per RAG query with AdvancedRetriever on local

Recommendation

Scenario Use
Testing/Development HybridRetriever (faster iteration)
Final output quality matters AdvancedRetriever
Using cloud provider (OpenAI, etc.) AdvancedRetriever works great
Using local Qwen only HybridRetriever is probably better tradeoff

Best of Both Worlds

Your current setup already defaults to openai/gpt-4o-mini for the rewriter and reranker in AdvancedRetriever (see advanced_retriever.py:121-122). So even if you use Qwen for the final generation, the retrieval/reranking still uses the cheap, fast cloud model for best quality context selection.


Developer Tools

Tracing Imports & Dependencies

Terminal Commands

grep / ripgrep (rg)

# Find all files importing a specific module
grep -r "from module_name import" .
grep -r "import module_name" .

# ripgrep is faster
rg "from config import"
rg "import config"

Python-specific

# Show module dependencies
python -c "import module_name; print(module_name.__file__)"

# Use pydeps to visualize dependencies
pydeps your_module.py

# Use pipdeptree for package dependencies
pipdeptree

Node.js/TypeScript-specific

# Find imports of a file
grep -r "from './filename'" .
grep -r "require('./filename')" .

# Use madge for dependency graphs
npx madge --circular src/
npx madge src/index.ts

IDE Features

  • VSCode: Right-click -> "Find All References" (Shift+F12)
  • VSCode: Right-click -> "Go to References"
  • PyCharm/WebStorm: Right-click -> "Find Usages" (Alt+F7)

Specialized Tools

Tool Language What it does
madge JS/TS Dependency graphs, circular detection
pydeps Python Visual dependency graphs
import-js JS Import analysis
vulture Python Find unused code/imports
ts-unused-exports TS Find unused exports

Quick One-Liners

# Count how many files import something
rg -l "import.*SomeClass" | wc -l

# See the actual import lines with context
rg -C 2 "from epic_engine"

Practical Examples

Finding Python Imports

# Find everything that imports from epic_engine
rg "from epic_engine" .

# Find what imports the config module specifically
rg "from epic_engine.core.config import"

# Find any file importing the reranker
rg "import.*reranker" --type py

Finding TypeScript/JavaScript Imports

# Find what imports the aiChat service
rg "from.*aiChat" frontend/

# Find all files importing from a specific hooks folder
rg "from.*AIChatModuleHooks" .

# Find require statements
rg "require\(.*aiChat" .

Finding Who Uses a Specific Function/Class

# Find all usages of a function called "retrieve_context"
rg "retrieve_context" --type py

# Find where AdvancedRetriever is used
rg "AdvancedRetriever" .

# Find with surrounding context (2 lines before/after)
rg -C 2 "useAIChat" frontend/

Checking Circular Dependencies

# For JavaScript/TypeScript projects
npx madge --circular frontend/

# For Python
pip install pydeps
pydeps aiservice/ --show-cycles

VS Code Keyboard Shortcuts

Multi-Cursor & Selection

Shortcut What It Does
Ctrl+D Select next occurrence of current word (keep pressing for more)
Ctrl+Shift+L Select ALL occurrences of current word at once
Alt+Click Add cursor at click location
Ctrl+Alt+Up/Down Add cursor above/below current line
Shift+Alt+I Add cursor at end of each selected line
Ctrl+U Undo last cursor operation

Line Manipulation

Shortcut What It Does
Alt+Up/Down Move entire line up/down
Shift+Alt+Up/Down Duplicate line up/down
Ctrl+Shift+K Delete entire line
Ctrl+Enter Insert blank line below
Ctrl+Shift+Enter Insert blank line above
Ctrl+] / Ctrl+[ Indent/outdent line
Shortcut What It Does
Ctrl+G Go to specific line number
Ctrl+P Quick open file by name
Ctrl+Shift+O Go to symbol in current file
Ctrl+T Go to symbol across entire workspace
F12 Go to definition
Alt+F12 Peek definition (inline popup)
Shift+F12 Find all references
Ctrl+Shift+\ Jump to matching bracket
Alt+Left/Right Navigate back/forward (history)

Selection Expansion

Shortcut What It Does
Shift+Alt+Right Expand selection (word -> line -> block -> function)
Shift+Alt+Left Shrink selection
Ctrl+L Select entire current line
Ctrl+Shift+[ / ] Fold/unfold code block

Search & Replace

Shortcut What It Does
Ctrl+F Find in file
Ctrl+H Find and replace in file
Ctrl+Shift+F Find across all files
Ctrl+Shift+H Find and replace across all files
F3 / Shift+F3 Next/previous match

Code Actions

Shortcut What It Does
Ctrl+. Quick fix / show code actions (auto-imports, refactors)
F2 Rename symbol (updates all references)
Ctrl+Space Trigger IntelliSense/autocomplete
Ctrl+Shift+Space Show parameter hints
Shift+Alt+F Format entire document
Ctrl+K Ctrl+F Format selected code only

Commenting

Shortcut What It Does
Ctrl+/ Toggle line comment
Shift+Alt+A Toggle block comment

Editor Management

Shortcut What It Does
Ctrl+\ Split editor
Ctrl+1/2/3 Focus editor group 1/2/3
Ctrl+W Close current tab
Ctrl+K Z Zen mode (distraction-free)
Ctrl+B Toggle sidebar visibility
Ctrl+J Toggle terminal panel

Top Recommendations for Speed

  1. Ctrl+D - Essential for quick renames
  2. Alt+Up/Down - Move lines without cut/paste
  3. Ctrl+Shift+K - Delete lines instantly
  4. F2 - Smart rename across files
  5. Ctrl+. - Auto-fix problems, add imports
  6. Ctrl+P - Navigate files without touching mouse
  7. Shift+Alt+Up/Down - Duplicate code instantly

System Resource Monitoring

The EPIC Tester includes a real-time System Resource Monitor panel that displays CPU, RAM, GPU, and VRAM usage as you interact with the AI.

What the Monitor Tracks

Metric Description Update Interval
CPU Overall CPU usage percentage across all cores 1 second
RAM System memory usage (used GB / total GB) 1 second
GPU Graphics card utilization percentage 1 second
VRAM Video memory usage (used GB / total GB) 1 second

The monitor uses:

  • psutil for CPU, RAM, and Disk metrics
  • GPUtil for GPU and VRAM metrics (NVIDIA GPUs)

Understanding GPU/VRAM Usage

When GPU/VRAM Shows 0%

If you see GPU and VRAM staying at 0% while making queries, it means the model is NOT running on your local GPU. This happens when:

Scenario GPU/VRAM Usage
Using cloud providers (OpenAI, Anthropic, etc.) 0% - Model runs on provider's servers
Using local/qwen Active - Model runs on your GPU
Using Ollama with GPU offload Active - Model runs on your GPU

When You WILL See GPU/VRAM Activity

GPU and VRAM graphs will show activity when:

  1. Local Qwen model (local/qwen) - The model loads into VRAM and runs inference on your GPU
  2. Embedding generation - If using a local embedding model
  3. Any local LLM that uses GPU acceleration

Resource Usage by Provider

Provider CPU Usage RAM Usage GPU Usage VRAM Usage
openai/gpt-4o-mini Low (network I/O) Low (Python + response parsing) None None
anthropic/claude-3-haiku Low (network I/O) Low (Python + response parsing) None None
deepseek/deepseek-chat Low (network I/O) Low (Python + response parsing) None None
local/qwen Medium (preprocessing) Medium (model loading) High (inference) High (model weights)

Cloud Provider Resource Pattern

When using cloud providers like OpenAI:

CPU:  [====------] 20-40%  (Python + async networking)
RAM:  [===-------] 15-30%  (Python runtime + response buffers)
GPU:  [----------] 0%      (Not used - inference in cloud)
VRAM: [----------] 0%      (Not used - model not loaded locally)

Local Model Resource Pattern

When using local/qwen:

CPU:  [=====-----] 40-60%  (Token processing + context)
RAM:  [======----] 50-70%  (Model metadata + context window)
GPU:  [========--] 70-95%  (Matrix operations during inference)
VRAM: [=======---] 60-80%  (Model weights + KV cache)

Hardware Requirements

Minimum Requirements for Local Models

Component Minimum Recommended Notes
GPU VRAM 4GB 6GB+ Qwen 2.5 7B needs ~5-6GB VRAM
System RAM 8GB 16GB+ For model loading + context
GPU NVIDIA GTX 1060 RTX 3050+ CUDA support required

VRAM Usage Estimates by Model Size

Model Size Estimated VRAM (FP16) Estimated VRAM (Q4 Quantized)
3B params ~6GB ~2GB
7B params ~14GB ~4-5GB
13B params ~26GB ~8GB
70B params ~140GB ~40GB

Note: EPIC uses quantized models (Q4_K_M) for efficiency. Your RTX 3050 6GB can comfortably run Qwen 2.5 7B quantized.

Troubleshooting GPU Detection

GPU/VRAM Shows 0% Even with Local Model

  1. Check GPUtil installation:

    pip install gputil
    
  2. Verify NVIDIA drivers:

    nvidia-smi
    

    This should show your GPU and current usage.

  3. Check if llama-cpp-python has GPU support:

    python -c "from llama_cpp import Llama; print('GPU support available')"
    
  4. Verify CUDA is available:

    python -c "import torch; print(torch.cuda.is_available())"
    

Common Issues

Issue Cause Solution
GPUtil not detecting GPU Missing NVIDIA drivers Install/update NVIDIA drivers
GPU shows but VRAM is 0 Model not using GPU Reinstall llama-cpp-python with CUDA
High RAM but no GPU usage Model running on CPU Check CUDA installation
Monitor panel missing Import error Check pip install psutil gputil

Force GPU Usage for Local Models

If your local model is running on CPU instead of GPU, you may need to reinstall llama-cpp-python with CUDA support:

# Uninstall existing
pip uninstall llama-cpp-python

# Reinstall with CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

For Windows:

$env:CMAKE_ARGS="-DLLAMA_CUDA=on"
pip install llama-cpp-python --force-reinstall --no-cache-dir