AI Integration Guide
AI Integration Guide
A comprehensive guide for setting up, testing, and integrating AI with epic_engine.
Table of Contents
- Testing the AI
- Vector Store Sync
- VS Code / IDE Setup
- API Usage Links
- Package Installation
- epic_engine Integration
- UV Package Manager
- Running the AI Service
- Retriever Comparison
- Developer Tools
- System Resource Monitoring
Testing the AI
Test if Qwen actually generates text using one of these methods:
Option A: Use curl (in a new terminal)
curl -X POST http://localhost:8000/api/generate \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"Write one sentence about a knight.\"}"
Option B: Use the browser
Go to: http://localhost:8000/docs
This opens FastAPI's built-in test interface where you can try the /api/generate endpoint with a form.
Test It
- Restart your server (
Ctrl+C, thenuv run python -m api.main) - Go to http://localhost:8000/docs
- Try the new
POST /api/generate/streamendpoint
uv run python -m api.main
Vector Store Sync
Is Your Existing KG Data Vectorized?
No. Your existing Knowledge Graph data is NOT synced to the vector store yet. The sync notifications only trigger when:
- New entities are created (after the code was added)
- Existing entities are updated (after the code was added)
- Entities are deleted (after the code was added)
To sync existing data, you need to:
Call the full novel sync endpoint for each novel:
POST http://localhost:8000/api/sync/novel
Body: { "novel_id": "your-novel-uuid-here" }
This will:
- Fetch all entities for that novel from backend
- Chunk each entity
- Generate embeddings
- Store in ChromaDB
Check Sync Status
GET http://localhost:8000/api/sync/stats
If it returns collection_count: 0, nothing is vectorized yet.
VS Code / IDE Setup
Note: Those are VS Code/Pylance IDE warnings, not actual Python errors. The packages are installed correctly (you verified with
pip show), but VS Code is looking at a different Python interpreter.
How to Fix the IDE Warnings
- Press
Ctrl+Shift+Pin VS Code - Type "Python: Select Interpreter"
- Choose the interpreter at:
C:\Users\arman\AppData\Roaming\Python\Python313\python.exe
After selecting the correct interpreter, the red squiggly lines should disappear within a few seconds as Pylance re-analyzes the files.
Why This Happens
VS Code might be pointing to:
- A different Python version
- A virtual environment that doesn't have the packages
- The wrong system Python
The pip show output confirmed packages are in C:\Users\arman\AppData\Roaming\Python\Python313\site-packages, so you need to tell VS Code to use that same Python.
API Usage Links
Note: Log in with armandoblancq@gmail.com (or with gmail)
| Provider | Link |
|---|---|
| OpenAI | https://platform.openai.com/logs |
| Anthropic | https://console.anthropic.com/workspaces/default/logs |
| Google Gemini | https://aistudio.google.com/app/logs and https://console.cloud.google.com/billing |
| DeepSeek | https://platform.deepseek.com/usage |
| Groq | https://console.groq.com/keys |
| Serper | https://serper.dev/logs |
Supported Providers
- OpenAI
- Anthropic
- Google Gemini
- DeepSeek
- Groq
- Serper
Package Installation
Additional dependencies for vector operations:
uv add "numba>=0.59" "umap-learn>=0.5.5"
How epic_engine Integrates with aiservice
Step 1: Install epic_engine as a package
cd package/epic_engine
pip install -e .
The -e flag means "editable" - it installs the package in development mode, linking to your source code rather than copying it.
Step 2: Import in aiservice
Your aiservice files would import from epic_engine instead of their local modules:
# Before (current aiservice)
from retrieval.hybrid_retriever import HybridRetriever
from engine.rag_engine import RAGEngine
from vectorstore.vector_store import VectorStore
# After (using epic_engine)
from epic_engine.retrieval import HybridRetriever
from epic_engine.rag import RAGEngine
from epic_engine.vectors import VectorStore
Step 3: aiservice becomes a thin API layer
Your aiservice would only contain:
- Flask/FastAPI routes (
api/routes.py) - Request/response schemas (
api/schemas.py) - Server configuration (
api/server.py) - Any app-specific logic not in the engine
Will Updates to epic_engine Reflect in aiservice?
YES - if you installed with pip install -e . (editable mode)
| Installation Method | Updates Reflected? |
|---|---|
pip install -e . (editable) |
Yes - immediately linked to source |
pip install . (regular) |
No - need to reinstall |
pip install epic-engine (PyPI) |
No - need to upgrade |
With editable install:
- Edit
epic_engine/rag/engine.py - Restart aiservice
- Changes are immediately available
Without editable install:
- Edit
epic_engine/rag/engine.py - Must run
pip install .again - Then restart aiservice
Recommended Workflow
| Scenario | Recommendation |
|---|---|
| During development | Use pip install -e . so changes reflect immediately |
| For production | Use regular pip install . or version from PyPI |
| For other projects | They can pip install epic-engine independently |
What Stays in aiservice vs epic_engine
| aiservice (API Layer) | epic_engine (Core Library) |
|---|---|
| Flask/FastAPI routes | RAG engine |
| HTTP request handling | Vector store |
| Authentication | Knowledge graph |
| API schemas | Providers (OpenAI, etc.) |
| Server startup | Agents |
| App-specific configs | Prompts |
Key Benefit: This separation means you could build a completely different app (CLI tool, desktop app, another API) using the same epic_engine library.
UV Package Manager
pyproject.toml Configuration
Path dependencies need to be in a separate [tool.uv.sources] section:
[project]
dependencies = [
"epic-engine", # keep as string here
"fastapi",
"uvicorn",
"httpx",
]
[tool.uv.sources]
epic-engine = { path = "../package/epic_engine" }
Warning: Putting
epic-engine = { path = "..." }inside the dependencies array is invalid TOML and will fail.
Transitive Dependencies
The uv.lock file contains all transitive dependencies - not just your 4 direct dependencies, but everything those packages depend on:
Even though your pyproject.toml only lists 4 packages, the full dependency tree is ~143 packages.
When to Run uv sync
You only need to run uv sync again if:
- You add/remove dependencies in epic_engine's
pyproject.toml - You change the package structure (add new submodules to
__init__.py)
For normal code changes (fixing bugs, improving logic, adding functions to existing files), just save and restart the server.
Reinstalling epic-engine
When uv sync uses a cached version that doesn't have your new changes:
uv sync --reinstall-package epic-engine
This tells uv to rebuild and reinstall epic-engine from the source path.
Running the AI Service
Quick Start Commands
# Reinstall epic-engine (after making changes)
uv sync --reinstall-package epic-engine
# Start the AI service
uv run python -m api.main
After starting, test these endpoints:
| Endpoint | Expected Response |
|---|---|
| http://localhost:8000/ | "Epic AI Service is running" |
| http://localhost:8000/api/health | healthy status |
Note: The Qwen model uses lazy loading - it loads on first use, not at startup.
Resync Novel Data
Open a new terminal while aiservice is running:
# PowerShell
Invoke-RestMethod -Uri "http://localhost:8000/api/sync/novel" `
-Method POST `
-ContentType "application/json" `
-Body '{"novel_id": "cmise0o310000h21zkiofct47"}'
Test Interface
# Launch the test GUI
uv run python test_interface.py
# Or the EPIC Tester
uv run python EPIC_Tester.py
Note: Make sure your backend is running (
npm run devin the backend folder) since the test interface needs to connect to the Knowledge Graph API.
Start the backend first:
cd backend
npm run dev
Prisma Migrations
To create the PlotThread table in the database:
cd backend
# Quick development sync
npx prisma db push
# Or proper migration
npx prisma migrate dev --name add_plot_thread
The db push is quicker for development - it syncs your schema without creating migration files.
Retriever Comparison
How to Switch Retrievers
In routes.py line 381, change:
# Current: LLM reranking, dual search
_DEFAULT_RETRIEVER_TYPE = "advanced"
# Alternative: Faster, score-based RRF fusion
_DEFAULT_RETRIEVER_TYPE = "hybrid"
HybridRetriever
| Feature | Has It? | Details |
|---|---|---|
| Query Rewriting | Yes | Uses QueryRewriter (lines 60-65 in hybrid.py) |
| Vector Search | Yes | Parallel with KG search |
| KG Search | Yes | Parallel with vector search |
| KG Traversal | Yes | From seed entities |
| Dual Query Search | No | Only searches with ONE query (rewritten if enabled) |
| LLM Reranking | No | Uses RRF score fusion (mathematical, no LLM) |
| ThreadPoolExecutor Workers | 2 |
AdvancedRetriever
| Feature | Has It? | Details |
|---|---|---|
| Query Rewriting | Yes | Uses QueryRewriter |
| Vector Search | Yes | Parallel |
| KG Search | Yes | Parallel |
| KG Traversal | Yes | From seed entities |
| Dual Query Search | Yes | Searches with BOTH original AND rewritten queries |
| LLM Reranking | Yes | Uses Reranker with LLM to judge relevance |
| LLM Validation | No | |
| LLM Reasoning | No | |
| ThreadPoolExecutor Workers | 3 |
RVRGRetriever (Retrieve-Validate-Reason-Generate)
The most advanced retriever with full LLM-powered pipeline for highest quality context.
| Feature | Has It? | Details |
|---|---|---|
| Query Rewriting | Yes | Uses QueryRewriter |
| Vector Search | Yes | Parallel (original + rewritten queries) |
| KG Search | Yes | Parallel |
| KG Traversal | Yes | From seed entities |
| Dual Query Search | Yes | Searches with BOTH original AND rewritten queries |
| LLM Reranking | Yes | Uses Reranker with LLM to judge relevance |
| LLM Validation | Yes | Filters out irrelevant results with explanations |
| LLM Reasoning | Yes | Extracts insights, connections, and identifies gaps |
| ThreadPoolExecutor Workers | 3 |
RVRG Pipeline Stages:
- Retrieve - Query rewrite → Vector search → KG search → KG traversal → Rerank
- Validate - LLM evaluates each result for relevance (score 0-1, with explanations)
- Reason - LLM analyzes validated context to extract insights, find connections, identify gaps
- Generate - Final answer generation with reasoning summary as guidance
Unique Features:
- Validation with explanations - Each result gets a relevance score and reason why it's relevant/irrelevant
- Gap identification - Detects what information is missing to fully answer the question
- Confidence scoring - Reports whether context is sufficient to answer (
can_answer,answer_confidence) - Key insights extraction - Pulls out important facts from context with source attribution
- Connection discovery - Finds relationships between entities/events in context
The ACTUAL Key Differences
| Feature | HybridRetriever | AdvancedRetriever | RVRGRetriever |
|---|---|---|---|
| Queries Used for Search | 1 (rewritten only) | 2 (original + rewritten) | 2 (original + rewritten) |
| Ranking Method | RRF score fusion (math) | LLM judges relevance | LLM judges relevance |
| Validation Step | No | No | Yes (filters irrelevant results) |
| Reasoning Step | No | No | Yes (extracts insights & gaps) |
| LLM Calls During Retrieval | 1 (rewriting only) | 2 (rewriting + reranking) | 4 (rewrite + rerank + validate + reason) |
| Speed | Fastest | Medium | Slowest |
| Cost | Lowest | Medium | Highest |
| Quality | Good | Better | Best |
All three have query rewriting. The key progression is: Hybrid (fast, math-based) → Advanced (adds LLM reranking) → RVRG (adds validation and reasoning for highest quality).
Which Retriever is Better?
| Priority | Better Choice |
|---|---|
| Highest Quality | RVRGRetriever |
| Complex queries | RVRGRetriever |
| Good quality + speed | AdvancedRetriever |
| Speed | HybridRetriever |
| Cost (API calls) | HybridRetriever |
| Simple lookups | HybridRetriever |
| Know if answer exists | RVRGRetriever (has can_answer flag) |
What is ThreadPoolExecutor Workers?
ThreadPoolExecutor is Python's way to run multiple tasks in parallel (at the same time).
- Workers = 2 means 2 tasks can run simultaneously
- Workers = 3 means 3 tasks can run simultaneously
In your retrievers:
- HybridRetriever (2 workers): Runs Vector Search + KG Search in parallel
- AdvancedRetriever (3 workers): Runs Original Query Search + Rewritten Query Search + KG Search in parallel
More workers = faster retrieval (tasks don't wait for each other).
LLM Reranking vs RRF Score Fusion: Which is More Accurate?
LLM Reranking (AdvancedRetriever) - More Accurate
How it works: An LLM reads each result and the original query, then judges: "Is this result actually relevant to what the user asked?"
Pros:
- Understands semantic meaning and context
- Can recognize when a result looks related but isn't actually helpful
- Handles nuance, synonyms, and intent
Cons:
- Slower (requires LLM API call)
- Costs money (API tokens)
- Can hallucinate or make mistakes
RRF Score Fusion (HybridRetriever) - Faster, but Less Accurate
How it works: Mathematical formula that combines rankings from different sources:
RRF_score = Σ (1 / (k + rank_i))
Where k is typically 60, and rank_i is the position in each result list.
Pros:
- Instant (pure math, no LLM call)
- Free (no API costs)
- Consistent and predictable
Cons:
- Doesn't understand meaning - just combines numbers
- A result ranked #1 in both lists wins, even if it's not actually relevant
- Can't handle cases where high-scoring results are semantically wrong
Bottom Line
| Method | Accuracy | Speed | Cost |
|---|---|---|---|
| LLM Reranking | Higher | Slower | Higher |
| RRF Score Fusion | Lower | Faster | Free |
LLM reranking is more accurate because it actually understands the query and results. RRF just does math on rankings without understanding content. For a creative writing app like EPIC where context quality matters, AdvancedRetriever with LLM reranking will give better results - but at the cost of speed and API calls.
Speed & Cost Comparison
Speed Difference
| Retriever | LLM Calls During Retrieval | Estimated Time |
|---|---|---|
| HybridRetriever | 1 call (query rewriting) | ~1-2 seconds |
| AdvancedRetriever | 2 calls (rewriting + reranking) | ~3-5 seconds |
| RVRGRetriever | 4 calls (rewrite + rerank + validate + reason) | ~6-12 seconds |
The validation and reasoning steps add significant time because:
- Validation sends all reranked results to LLM for relevance scoring with explanations
- Reasoning analyzes validated context to extract insights, connections, and gaps
- Each step requires a full LLM inference pass
Rough estimate: AdvancedRetriever is 2-3x slower than Hybrid; RVRGRetriever is 2-3x slower than Advanced
API Cost Difference
Both retrievers use the user-selected model from the test interface dropdown (e.g., openai/gpt-4o-mini, anthropic/claude-3-haiku, deepseek/deepseek-chat, etc.). The model is passed through the provider/model settings.
Example pricing (GPT-4o-mini as of early 2025):
- Input: ~$0.15 per 1M tokens
- Output: ~$0.60 per 1M tokens
| Retriever | Tokens per Query (estimate) | Cost per Query |
|---|---|---|
| HybridRetriever | ~500 tokens (rewrite only) | ~$0.0001 |
| AdvancedRetriever | ~2000-3000 tokens (rewrite + rerank 20 results) | ~$0.0003-0.0005 |
| RVRGRetriever | ~4000-6000 tokens (rewrite + rerank + validate + reason) | ~$0.0006-0.001 |
RVRGRetriever costs roughly 6-10x more per query than HybridRetriever due to additional LLM calls for validation and reasoning
Real-World Impact
| Usage | HybridRetriever | AdvancedRetriever | RVRGRetriever |
|---|---|---|---|
| 100 queries/day | ~$0.01/day | ~$0.03-0.05/day | ~$0.06-0.10/day |
| 1000 queries/day | ~$0.10/day | ~$0.30-0.50/day | ~$0.60-1.00/day |
Bottom line: The cost difference is negligible for personal use. RVRGRetriever provides the highest quality context with validation and reasoning, but at 2-3x the cost of AdvancedRetriever. The speed difference is more noticeable than the cost.
Can Qwen 2.5 7B Handle AdvancedRetriever?
Yes, but with caveats.
What Qwen 2.5 7B Needs to Do
| Task | Difficulty for 7B Model |
|---|---|
| Query Rewriting | Easy - Simple text transformation |
| LLM Reranking | Harder - Must evaluate 10-20 results against query |
Potential Issues
1. Reranking Quality
- Reranking requires the model to understand relevance at a deeper level
- 7B models can do this, but not as well as GPT-4o-mini or Claude
- You might get slightly worse ranking than with a cloud model
2. Context Window Pressure
- Reranking sends ALL candidate results (up to 20 entities) to the LLM
- Each entity has name + description + metadata
- Could easily be 2000-4000 tokens just for the reranking prompt
- Your Qwen is configured with 8192 context window - should be fine, but tight
3. Speed
- Local 7B model on RTX 3050 6GB is slower than cloud APIs
- Reranking adds another full inference pass
- Expect 5-15 seconds per RAG query with AdvancedRetriever on local
Recommendation
| Scenario | Use |
|---|---|
| Testing/Development | HybridRetriever (faster iteration) |
| Final output quality matters | AdvancedRetriever |
| Using cloud provider (OpenAI, etc.) | AdvancedRetriever works great |
| Using local Qwen only | HybridRetriever is probably better tradeoff |
Best of Both Worlds
Your current setup already defaults to openai/gpt-4o-mini for the rewriter and reranker in AdvancedRetriever (see advanced_retriever.py:121-122). So even if you use Qwen for the final generation, the retrieval/reranking still uses the cheap, fast cloud model for best quality context selection.
Developer Tools
Tracing Imports & Dependencies
Terminal Commands
grep / ripgrep (rg)
# Find all files importing a specific module
grep -r "from module_name import" .
grep -r "import module_name" .
# ripgrep is faster
rg "from config import"
rg "import config"
Python-specific
# Show module dependencies
python -c "import module_name; print(module_name.__file__)"
# Use pydeps to visualize dependencies
pydeps your_module.py
# Use pipdeptree for package dependencies
pipdeptree
Node.js/TypeScript-specific
# Find imports of a file
grep -r "from './filename'" .
grep -r "require('./filename')" .
# Use madge for dependency graphs
npx madge --circular src/
npx madge src/index.ts
IDE Features
- VSCode: Right-click -> "Find All References" (
Shift+F12) - VSCode: Right-click -> "Go to References"
- PyCharm/WebStorm: Right-click -> "Find Usages" (
Alt+F7)
Specialized Tools
| Tool | Language | What it does |
|---|---|---|
| madge | JS/TS | Dependency graphs, circular detection |
| pydeps | Python | Visual dependency graphs |
| import-js | JS | Import analysis |
| vulture | Python | Find unused code/imports |
| ts-unused-exports | TS | Find unused exports |
Quick One-Liners
# Count how many files import something
rg -l "import.*SomeClass" | wc -l
# See the actual import lines with context
rg -C 2 "from epic_engine"
Practical Examples
Finding Python Imports
# Find everything that imports from epic_engine
rg "from epic_engine" .
# Find what imports the config module specifically
rg "from epic_engine.core.config import"
# Find any file importing the reranker
rg "import.*reranker" --type py
Finding TypeScript/JavaScript Imports
# Find what imports the aiChat service
rg "from.*aiChat" frontend/
# Find all files importing from a specific hooks folder
rg "from.*AIChatModuleHooks" .
# Find require statements
rg "require\(.*aiChat" .
Finding Who Uses a Specific Function/Class
# Find all usages of a function called "retrieve_context"
rg "retrieve_context" --type py
# Find where AdvancedRetriever is used
rg "AdvancedRetriever" .
# Find with surrounding context (2 lines before/after)
rg -C 2 "useAIChat" frontend/
Checking Circular Dependencies
# For JavaScript/TypeScript projects
npx madge --circular frontend/
# For Python
pip install pydeps
pydeps aiservice/ --show-cycles
VS Code Keyboard Shortcuts
Multi-Cursor & Selection
| Shortcut | What It Does |
|---|---|
Ctrl+D |
Select next occurrence of current word (keep pressing for more) |
Ctrl+Shift+L |
Select ALL occurrences of current word at once |
Alt+Click |
Add cursor at click location |
Ctrl+Alt+Up/Down |
Add cursor above/below current line |
Shift+Alt+I |
Add cursor at end of each selected line |
Ctrl+U |
Undo last cursor operation |
Line Manipulation
| Shortcut | What It Does |
|---|---|
Alt+Up/Down |
Move entire line up/down |
Shift+Alt+Up/Down |
Duplicate line up/down |
Ctrl+Shift+K |
Delete entire line |
Ctrl+Enter |
Insert blank line below |
Ctrl+Shift+Enter |
Insert blank line above |
Ctrl+] / Ctrl+[ |
Indent/outdent line |
Navigation
| Shortcut | What It Does |
|---|---|
Ctrl+G |
Go to specific line number |
Ctrl+P |
Quick open file by name |
Ctrl+Shift+O |
Go to symbol in current file |
Ctrl+T |
Go to symbol across entire workspace |
F12 |
Go to definition |
Alt+F12 |
Peek definition (inline popup) |
Shift+F12 |
Find all references |
Ctrl+Shift+\ |
Jump to matching bracket |
Alt+Left/Right |
Navigate back/forward (history) |
Selection Expansion
| Shortcut | What It Does |
|---|---|
Shift+Alt+Right |
Expand selection (word -> line -> block -> function) |
Shift+Alt+Left |
Shrink selection |
Ctrl+L |
Select entire current line |
Ctrl+Shift+[ / ] |
Fold/unfold code block |
Search & Replace
| Shortcut | What It Does |
|---|---|
Ctrl+F |
Find in file |
Ctrl+H |
Find and replace in file |
Ctrl+Shift+F |
Find across all files |
Ctrl+Shift+H |
Find and replace across all files |
F3 / Shift+F3 |
Next/previous match |
Code Actions
| Shortcut | What It Does |
|---|---|
Ctrl+. |
Quick fix / show code actions (auto-imports, refactors) |
F2 |
Rename symbol (updates all references) |
Ctrl+Space |
Trigger IntelliSense/autocomplete |
Ctrl+Shift+Space |
Show parameter hints |
Shift+Alt+F |
Format entire document |
Ctrl+K Ctrl+F |
Format selected code only |
Commenting
| Shortcut | What It Does |
|---|---|
Ctrl+/ |
Toggle line comment |
Shift+Alt+A |
Toggle block comment |
Editor Management
| Shortcut | What It Does |
|---|---|
Ctrl+\ |
Split editor |
Ctrl+1/2/3 |
Focus editor group 1/2/3 |
Ctrl+W |
Close current tab |
Ctrl+K Z |
Zen mode (distraction-free) |
Ctrl+B |
Toggle sidebar visibility |
Ctrl+J |
Toggle terminal panel |
Top Recommendations for Speed
Ctrl+D- Essential for quick renamesAlt+Up/Down- Move lines without cut/pasteCtrl+Shift+K- Delete lines instantlyF2- Smart rename across filesCtrl+.- Auto-fix problems, add importsCtrl+P- Navigate files without touching mouseShift+Alt+Up/Down- Duplicate code instantly
System Resource Monitoring
The EPIC Tester includes a real-time System Resource Monitor panel that displays CPU, RAM, GPU, and VRAM usage as you interact with the AI.
What the Monitor Tracks
| Metric | Description | Update Interval |
|---|---|---|
| CPU | Overall CPU usage percentage across all cores | 1 second |
| RAM | System memory usage (used GB / total GB) | 1 second |
| GPU | Graphics card utilization percentage | 1 second |
| VRAM | Video memory usage (used GB / total GB) | 1 second |
The monitor uses:
- psutil for CPU, RAM, and Disk metrics
- GPUtil for GPU and VRAM metrics (NVIDIA GPUs)
Understanding GPU/VRAM Usage
When GPU/VRAM Shows 0%
If you see GPU and VRAM staying at 0% while making queries, it means the model is NOT running on your local GPU. This happens when:
| Scenario | GPU/VRAM Usage |
|---|---|
| Using cloud providers (OpenAI, Anthropic, etc.) | 0% - Model runs on provider's servers |
Using local/qwen |
Active - Model runs on your GPU |
| Using Ollama with GPU offload | Active - Model runs on your GPU |
When You WILL See GPU/VRAM Activity
GPU and VRAM graphs will show activity when:
- Local Qwen model (
local/qwen) - The model loads into VRAM and runs inference on your GPU - Embedding generation - If using a local embedding model
- Any local LLM that uses GPU acceleration
Resource Usage by Provider
| Provider | CPU Usage | RAM Usage | GPU Usage | VRAM Usage |
|---|---|---|---|---|
openai/gpt-4o-mini |
Low (network I/O) | Low (Python + response parsing) | None | None |
anthropic/claude-3-haiku |
Low (network I/O) | Low (Python + response parsing) | None | None |
deepseek/deepseek-chat |
Low (network I/O) | Low (Python + response parsing) | None | None |
local/qwen |
Medium (preprocessing) | Medium (model loading) | High (inference) | High (model weights) |
Cloud Provider Resource Pattern
When using cloud providers like OpenAI:
CPU: [====------] 20-40% (Python + async networking)
RAM: [===-------] 15-30% (Python runtime + response buffers)
GPU: [----------] 0% (Not used - inference in cloud)
VRAM: [----------] 0% (Not used - model not loaded locally)
Local Model Resource Pattern
When using local/qwen:
CPU: [=====-----] 40-60% (Token processing + context)
RAM: [======----] 50-70% (Model metadata + context window)
GPU: [========--] 70-95% (Matrix operations during inference)
VRAM: [=======---] 60-80% (Model weights + KV cache)
Hardware Requirements
Minimum Requirements for Local Models
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| GPU VRAM | 4GB | 6GB+ | Qwen 2.5 7B needs ~5-6GB VRAM |
| System RAM | 8GB | 16GB+ | For model loading + context |
| GPU | NVIDIA GTX 1060 | RTX 3050+ | CUDA support required |
VRAM Usage Estimates by Model Size
| Model Size | Estimated VRAM (FP16) | Estimated VRAM (Q4 Quantized) |
|---|---|---|
| 3B params | ~6GB | ~2GB |
| 7B params | ~14GB | ~4-5GB |
| 13B params | ~26GB | ~8GB |
| 70B params | ~140GB | ~40GB |
Note: EPIC uses quantized models (Q4_K_M) for efficiency. Your RTX 3050 6GB can comfortably run Qwen 2.5 7B quantized.
Troubleshooting GPU Detection
GPU/VRAM Shows 0% Even with Local Model
-
Check GPUtil installation:
pip install gputil -
Verify NVIDIA drivers:
nvidia-smiThis should show your GPU and current usage.
-
Check if llama-cpp-python has GPU support:
python -c "from llama_cpp import Llama; print('GPU support available')" -
Verify CUDA is available:
python -c "import torch; print(torch.cuda.is_available())"
Common Issues
| Issue | Cause | Solution |
|---|---|---|
| GPUtil not detecting GPU | Missing NVIDIA drivers | Install/update NVIDIA drivers |
| GPU shows but VRAM is 0 | Model not using GPU | Reinstall llama-cpp-python with CUDA |
| High RAM but no GPU usage | Model running on CPU | Check CUDA installation |
| Monitor panel missing | Import error | Check pip install psutil gputil |
Force GPU Usage for Local Models
If your local model is running on CPU instead of GPU, you may need to reinstall llama-cpp-python with CUDA support:
# Uninstall existing
pip uninstall llama-cpp-python
# Reinstall with CUDA support
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
For Windows:
$env:CMAKE_ARGS="-DLLAMA_CUDA=on"
pip install llama-cpp-python --force-reinstall --no-cache-dir