AI LLM for Financial Data Analysis

Technical Proposal — On-Premise Deployment on Windows Server 2019

16 GB

Available RAM

4 vCPU

CPU Cores

5.3 TB

Storage

€0

Software Cost

Project Goals:

Financial AI — Process financial data and provide accurate responses via REST API
General AI — ChatGPT-like web interface for general topics, user documents, and team queries

Current Server Analysis

Component	Current Value	Status
Operating System	Windows Server 2019 Standard (Build 17763) 64-bit	OK
CPU	Intel Xeon E312xx (Sandy Bridge) — 4 vCPUs @ 2.2 GHz	Limited
RAM	16 GB (13.9 GB free)	Tight
Storage	5.3 TB NTFS (5.26 TB free)	Excellent
GPU	None (Virtual Display Adapter)	N/A
Server Type	Rented VM (KVM/QEMU) — corporate hosting provider	Confirmed
Nested Virtualization	Disabled (Hyper-V not available)	Use WSL2/Docker

CPU Note: Sandy Bridge (2011) does not support AVX2 instructions, which modern LLM engines rely on for speed. This affects inference performance by approximately 40-60% compared to modern CPUs.

1. Hardware Estimation

Memory Budget on 16 GB RAM

Component	RAM Usage	Notes
Windows OS + Services	~3.0 GB	Base system overhead
WSL2 (Ubuntu)	~0.5 GB	Linux subsystem kernel
LLM Model (Phi-3-mini Q4)	~4.0 GB	Model weights in memory
Embedding Model (MiniLM)	~0.3 GB	For vector search
ChromaDB (Vector Store)	~2.0 GB	Document embeddings
FastAPI + RAG Pipeline	~1.0 GB	Application layer
Open WebUI (Chat Interface)	~0.5 GB	Web interface
Total Estimated	~11.3 GB	Buffer: ~4.7 GB

Verdict: 16 GB is tight but workable with a lightweight 3-4B parameter model. Storage (5.3 TB) is more than sufficient.

Model Options for 16 GB RAM

Model	Size	RAM	Quality	Speed	Response	Verdict
Phi-3-mini 3.8B Q4	2.3 GB	~4 GB	6/10	2-3 tok/s	~40s	Best Choice
Qwen2.5-3B Q4	2.0 GB	~3.5 GB	6/10	2-3 tok/s	~40s	Alternative
Gemma-2 2B Q4	1.5 GB	~3 GB	5/10	3-4 tok/s	~30s	Faster, less accurate
Qwen2.5-1.5B Q4	1.0 GB	~2.5 GB	4/10	3-5 tok/s	~25s	Lightweight fallback
Mistral 7B Q4	4.4 GB	~8 GB	8/10	0.5-1 tok/s	~3m	Too large

RAG Advantage: With Retrieval-Augmented Generation, even a 3.8B model produces accurate financial answers. The RAG pipeline retrieves relevant documents and feeds them as context — the model synthesizes from your own data.

Performance Expectations

Response Time Comparison (~100 tokens per response)

ChatGPT (Cloud)

2-3s

Local + GPU (RTX 4060)

3-5s

Local + Modern CPU (32c)

10-15s

Your Server (current)

40-55s

Concurrent Users

Users	Response Time	Experience
1 user	~45 seconds	Functional
2 users	~90 seconds	Slow but works
3+ users	Timeout risk	Not recommended

Quality: Model Only vs Model + RAG

Task	Model Only	Model + RAG
General Chat	Decent	Good
Financial Terminology	Weak	Good
Financial Reasoning	Poor	Moderate
Document Summarization	OK	Good
Number Interpretation	Weak	Moderate

2. Required Software

Total Software Cost: €0 — All components are free and open-source.

WSL2 + Ubuntu 22.04

Linux Environment

Free

Docker Engine

Container Runtime

Free

Ollama

LLM Inference Engine

Free

llama.cpp

CPU Inference Backend

Free

Phi-3-mini 3.8B

LLM Model (Microsoft)

Free (MIT)

ChromaDB

Vector Database

Free

Python 3.11+

Application Runtime

Free

FastAPI

REST API Framework

Free

Open WebUI

Chat Interface

Free

LangChain

RAG Orchestration

Free

sentence-transformers

Embedding Generation

Free

Unstructured / PyMuPDF

Document Parsing

Free

3. Proposed Architecture Design

All services run inside WSL2 on the existing Windows Server. No external network connections. ASP.NET connects via REST API on localhost.

User Interface Layer

ASP.NET AppREST API calls

Open WebUIChat interface :8080

REST APIExternal apps :8000

Application Layer (WSL2 — Ubuntu 22.04)

FastAPI GatewayQuery router + RAG :8000

RAG PipelineSearch + context inject

AI Engine Layer

OllamaLLM inference :11434

Phi-3-mini 3.8BQuantized Q4 model

MiniLM-L6-v2Embedding model

Data Layer

ChromaDBVector embeddings

Document StorePDF / Word / HTML / Excel

Query LogsAudit + monitoring

Infrastructure

Air-GappedNo external API calls

Windows Server 2019KVM/QEMU VM host

5.3 TB NTFSPersistent storage

Data Flow

Step	Action	Component	Time
1	User submits query (API or Web UI)	ASP.NET / Open WebUI	Instant
2	Query embedded into vector	MiniLM-L6-v2	~0.5s
3	Semantic search in document index	ChromaDB	~1-2s
4	Top-K documents retrieved and ranked	RAG Pipeline	~0.5s
5	Context + query sent to LLM	FastAPI Gateway	Instant
6	Model generates response	Ollama + Phi-3-mini	~35-50s
7	Response returned with source citations	FastAPI Gateway	Instant
Total end-to-end			~40-55 seconds

4. Activity Plan — 30 Days

Week 1 — Days 1-7

Foundation & First Working Chat

Day 1-2: Set up WSL2 + Docker on Windows Server, install Ubuntu 22.04
Day 3-4: Install Ollama, test Phi-3-mini and Qwen2.5, benchmark speed on your CPU
Day 5-6: Install ChromaDB, configure embedding model, test vector search
Day 7: Deploy Open WebUI, connect to Ollama — first working chat interface

Deliverable: Working chatbot on your server (general conversation mode)

Week 2 — Days 8-14

RAG Pipeline & Document Indexing

Day 8-9: Build document ingestion pipeline (PDF, Word, HTML, Excel)
Day 10-11: Index sample Intranet documents into ChromaDB
Day 12-13: Build FastAPI gateway with hybrid search (semantic + keyword)
Day 14: Test RAG — verify answers cite your documents

Deliverable: Chatbot answering from your internal documents with source citations

Week 3 — Days 15-21

Financial Mode & REST API

Day 15-16: Optimize prompt templates for financial data processing
Day 17-18: Build and test REST API for ASP.NET integration
Day 19-20: Financial query testing, response quality optimization
Day 21: Deploy monitoring dashboard (query logs, performance metrics)

Deliverable: REST API ready for ASP.NET + financial query mode operational

Week 4 — Days 22-30

Testing, Documentation & Handover

Day 22-23: LoRA fine-tuning dataset preparation (if hardware allows)
Day 24-25: Full system testing, edge cases, performance optimization
Day 26-27: Complete documentation (architecture, operations, troubleshooting)
Day 28-29: Training sessions with IT team
Day 30: Final deployment, handover, support plan

Deliverable: Complete system + documentation + trained IT team

Upgrade Path

The architecture scales automatically. Upgrade the VM and better models unlock with no code changes.

Tier	Specs	Model	Response	Users
Current	4 cores, 16 GB	Phi-3-mini 3.8B Q4	~45s	1-2
Tier 1	16 cores, 32 GB	Mistral 7B Q4	~15-20s	3-5
Tier 2	32 cores, 64 GB	Mistral 7B Full	~8-12s	5-10
Tier 3 (GPU)	16c, 64 GB, RTX 4060	Mistral 7B + LoRA	~2-4s	10-20

Cost Comparison

Cloud API vs On-Premise (1,000 queries/day)

OpenAI GPT-4 API

€300-600/mo

Claude API

€250-500/mo

On-Premise (your server)

€0/mo

Key Advantage: Unlimited queries at zero cost. Full data privacy — nothing leaves the building.

VM Upgrade Estimates (future)

Current VM (16GB/4c)

Included

Tier 1 (32GB/16c)

€40-80/mo extra

Tier 2 (64GB/32c)

€80-150/mo extra

Tier 3 + GPU

€150-300/mo extra