AI LLM for Financial Data Analysis
Technical Proposal — On-Premise Deployment on Windows Server 2019
16 GB
Available RAM
4 vCPU
CPU Cores
5.3 TB
Storage
€0
Software Cost
Project Goals:
- Financial AI — Process financial data and provide accurate responses via REST API
- General AI — ChatGPT-like web interface for general topics, user documents, and team queries
Current Server Analysis
| Component | Current Value | Status |
|---|---|---|
| Operating System | Windows Server 2019 Standard (Build 17763) 64-bit | OK |
| CPU | Intel Xeon E312xx (Sandy Bridge) — 4 vCPUs @ 2.2 GHz | Limited |
| RAM | 16 GB (13.9 GB free) | Tight |
| Storage | 5.3 TB NTFS (5.26 TB free) | Excellent |
| GPU | None (Virtual Display Adapter) | N/A |
| Server Type | Rented VM (KVM/QEMU) — corporate hosting provider | Confirmed |
| Nested Virtualization | Disabled (Hyper-V not available) | Use WSL2/Docker |
CPU Note: Sandy Bridge (2011) does not support AVX2 instructions, which modern LLM engines rely on for speed. This affects inference performance by approximately 40-60% compared to modern CPUs.
1. Hardware Estimation
Memory Budget on 16 GB RAM
| Component | RAM Usage | Notes |
|---|---|---|
| Windows OS + Services | ~3.0 GB | Base system overhead |
| WSL2 (Ubuntu) | ~0.5 GB | Linux subsystem kernel |
| LLM Model (Phi-3-mini Q4) | ~4.0 GB | Model weights in memory |
| Embedding Model (MiniLM) | ~0.3 GB | For vector search |
| ChromaDB (Vector Store) | ~2.0 GB | Document embeddings |
| FastAPI + RAG Pipeline | ~1.0 GB | Application layer |
| Open WebUI (Chat Interface) | ~0.5 GB | Web interface |
| Total Estimated | ~11.3 GB | Buffer: ~4.7 GB |
Verdict: 16 GB is tight but workable with a lightweight 3-4B parameter model. Storage (5.3 TB) is more than sufficient.
Model Options for 16 GB RAM
| Model | Size | RAM | Quality | Speed | Response | Verdict |
|---|---|---|---|---|---|---|
| Phi-3-mini 3.8B Q4 | 2.3 GB | ~4 GB | 6/10 | 2-3 tok/s | ~40s | Best Choice |
| Qwen2.5-3B Q4 | 2.0 GB | ~3.5 GB | 6/10 | 2-3 tok/s | ~40s | Alternative |
| Gemma-2 2B Q4 | 1.5 GB | ~3 GB | 5/10 | 3-4 tok/s | ~30s | Faster, less accurate |
| Qwen2.5-1.5B Q4 | 1.0 GB | ~2.5 GB | 4/10 | 3-5 tok/s | ~25s | Lightweight fallback |
| Mistral 7B Q4 | 4.4 GB | ~8 GB | 8/10 | 0.5-1 tok/s | ~3m | Too large |
RAG Advantage: With Retrieval-Augmented Generation, even a 3.8B model produces accurate financial answers. The RAG pipeline retrieves relevant documents and feeds them as context — the model synthesizes from your own data.
Performance Expectations
Response Time Comparison (~100 tokens per response)
Concurrent Users
| Users | Response Time | Experience |
|---|---|---|
| 1 user | ~45 seconds | Functional |
| 2 users | ~90 seconds | Slow but works |
| 3+ users | Timeout risk | Not recommended |
Quality: Model Only vs Model + RAG
| Task | Model Only | Model + RAG |
|---|---|---|
| General Chat | Decent | Good |
| Financial Terminology | Weak | Good |
| Financial Reasoning | Poor | Moderate |
| Document Summarization | OK | Good |
| Number Interpretation | Weak | Moderate |
2. Required Software
Total Software Cost: €0 — All components are free and open-source.
WSL2 + Ubuntu 22.04
Linux Environment
Free
Docker Engine
Container Runtime
Free
Ollama
LLM Inference Engine
Free
llama.cpp
CPU Inference Backend
Free
Phi-3-mini 3.8B
LLM Model (Microsoft)
Free (MIT)
ChromaDB
Vector Database
Free
Python 3.11+
Application Runtime
Free
FastAPI
REST API Framework
Free
Open WebUI
Chat Interface
Free
LangChain
RAG Orchestration
Free
sentence-transformers
Embedding Generation
Free
Unstructured / PyMuPDF
Document Parsing
Free
3. Proposed Architecture Design
All services run inside WSL2 on the existing Windows Server. No external network connections. ASP.NET connects via REST API on localhost.
User Interface Layer
ASP.NET AppREST API calls
Open WebUIChat interface :8080
REST APIExternal apps :8000
Application Layer (WSL2 — Ubuntu 22.04)
FastAPI GatewayQuery router + RAG :8000
RAG PipelineSearch + context inject
AI Engine Layer
OllamaLLM inference :11434
Phi-3-mini 3.8BQuantized Q4 model
MiniLM-L6-v2Embedding model
Data Layer
ChromaDBVector embeddings
Document StorePDF / Word / HTML / Excel
Query LogsAudit + monitoring
Infrastructure
Air-GappedNo external API calls
Windows Server 2019KVM/QEMU VM host
5.3 TB NTFSPersistent storage
Data Flow
| Step | Action | Component | Time |
|---|---|---|---|
| 1 | User submits query (API or Web UI) | ASP.NET / Open WebUI | Instant |
| 2 | Query embedded into vector | MiniLM-L6-v2 | ~0.5s |
| 3 | Semantic search in document index | ChromaDB | ~1-2s |
| 4 | Top-K documents retrieved and ranked | RAG Pipeline | ~0.5s |
| 5 | Context + query sent to LLM | FastAPI Gateway | Instant |
| 6 | Model generates response | Ollama + Phi-3-mini | ~35-50s |
| 7 | Response returned with source citations | FastAPI Gateway | Instant |
| Total end-to-end | ~40-55 seconds | ||
4. Activity Plan — 30 Days
Week 1 — Days 1-7
Foundation & First Working Chat
- Day 1-2: Set up WSL2 + Docker on Windows Server, install Ubuntu 22.04
- Day 3-4: Install Ollama, test Phi-3-mini and Qwen2.5, benchmark speed on your CPU
- Day 5-6: Install ChromaDB, configure embedding model, test vector search
- Day 7: Deploy Open WebUI, connect to Ollama — first working chat interface
Deliverable: Working chatbot on your server (general conversation mode)
Week 2 — Days 8-14
RAG Pipeline & Document Indexing
- Day 8-9: Build document ingestion pipeline (PDF, Word, HTML, Excel)
- Day 10-11: Index sample Intranet documents into ChromaDB
- Day 12-13: Build FastAPI gateway with hybrid search (semantic + keyword)
- Day 14: Test RAG — verify answers cite your documents
Deliverable: Chatbot answering from your internal documents with source citations
Week 3 — Days 15-21
Financial Mode & REST API
- Day 15-16: Optimize prompt templates for financial data processing
- Day 17-18: Build and test REST API for ASP.NET integration
- Day 19-20: Financial query testing, response quality optimization
- Day 21: Deploy monitoring dashboard (query logs, performance metrics)
Deliverable: REST API ready for ASP.NET + financial query mode operational
Week 4 — Days 22-30
Testing, Documentation & Handover
- Day 22-23: LoRA fine-tuning dataset preparation (if hardware allows)
- Day 24-25: Full system testing, edge cases, performance optimization
- Day 26-27: Complete documentation (architecture, operations, troubleshooting)
- Day 28-29: Training sessions with IT team
- Day 30: Final deployment, handover, support plan
Deliverable: Complete system + documentation + trained IT team
Upgrade Path
The architecture scales automatically. Upgrade the VM and better models unlock with no code changes.
| Tier | Specs | Model | Response | Users |
|---|---|---|---|---|
| Current | 4 cores, 16 GB | Phi-3-mini 3.8B Q4 | ~45s | 1-2 |
| Tier 1 | 16 cores, 32 GB | Mistral 7B Q4 | ~15-20s | 3-5 |
| Tier 2 | 32 cores, 64 GB | Mistral 7B Full | ~8-12s | 5-10 |
| Tier 3 (GPU) | 16c, 64 GB, RTX 4060 | Mistral 7B + LoRA | ~2-4s | 10-20 |
Cost Comparison
Cloud API vs On-Premise (1,000 queries/day)
Key Advantage: Unlimited queries at zero cost. Full data privacy — nothing leaves the building.