AI LLM for Financial Data Analysis

Technical Proposal — On-Premise Deployment on Windows Server 2019

16 GB
Available RAM
4 vCPU
CPU Cores
5.3 TB
Storage
€0
Software Cost
Project Goals:
  1. Financial AI — Process financial data and provide accurate responses via REST API
  2. General AI — ChatGPT-like web interface for general topics, user documents, and team queries

Current Server Analysis

ComponentCurrent ValueStatus
Operating SystemWindows Server 2019 Standard (Build 17763) 64-bitOK
CPUIntel Xeon E312xx (Sandy Bridge) — 4 vCPUs @ 2.2 GHzLimited
RAM16 GB (13.9 GB free)Tight
Storage5.3 TB NTFS (5.26 TB free)Excellent
GPUNone (Virtual Display Adapter)N/A
Server TypeRented VM (KVM/QEMU) — corporate hosting providerConfirmed
Nested VirtualizationDisabled (Hyper-V not available)Use WSL2/Docker
CPU Note: Sandy Bridge (2011) does not support AVX2 instructions, which modern LLM engines rely on for speed. This affects inference performance by approximately 40-60% compared to modern CPUs.

1. Hardware Estimation

Memory Budget on 16 GB RAM

ComponentRAM UsageNotes
Windows OS + Services~3.0 GBBase system overhead
WSL2 (Ubuntu)~0.5 GBLinux subsystem kernel
LLM Model (Phi-3-mini Q4)~4.0 GBModel weights in memory
Embedding Model (MiniLM)~0.3 GBFor vector search
ChromaDB (Vector Store)~2.0 GBDocument embeddings
FastAPI + RAG Pipeline~1.0 GBApplication layer
Open WebUI (Chat Interface)~0.5 GBWeb interface
Total Estimated~11.3 GBBuffer: ~4.7 GB
Verdict: 16 GB is tight but workable with a lightweight 3-4B parameter model. Storage (5.3 TB) is more than sufficient.

Model Options for 16 GB RAM

ModelSizeRAMQualitySpeedResponseVerdict
Phi-3-mini 3.8B Q42.3 GB~4 GB6/102-3 tok/s~40sBest Choice
Qwen2.5-3B Q42.0 GB~3.5 GB6/102-3 tok/s~40sAlternative
Gemma-2 2B Q41.5 GB~3 GB5/103-4 tok/s~30sFaster, less accurate
Qwen2.5-1.5B Q41.0 GB~2.5 GB4/103-5 tok/s~25sLightweight fallback
Mistral 7B Q44.4 GB~8 GB8/100.5-1 tok/s~3mToo large
RAG Advantage: With Retrieval-Augmented Generation, even a 3.8B model produces accurate financial answers. The RAG pipeline retrieves relevant documents and feeds them as context — the model synthesizes from your own data.

Performance Expectations

Response Time Comparison (~100 tokens per response)

ChatGPT (Cloud)
2-3s
Local + GPU (RTX 4060)
3-5s
Local + Modern CPU (32c)
10-15s
Your Server (current)
40-55s

Concurrent Users

UsersResponse TimeExperience
1 user~45 secondsFunctional
2 users~90 secondsSlow but works
3+ usersTimeout riskNot recommended

Quality: Model Only vs Model + RAG

TaskModel OnlyModel + RAG
General ChatDecentGood
Financial TerminologyWeakGood
Financial ReasoningPoorModerate
Document SummarizationOKGood
Number InterpretationWeakModerate

2. Required Software

Total Software Cost: €0 — All components are free and open-source.
WSL2 + Ubuntu 22.04
Linux Environment
Free
Docker Engine
Container Runtime
Free
Ollama
LLM Inference Engine
Free
llama.cpp
CPU Inference Backend
Free
Phi-3-mini 3.8B
LLM Model (Microsoft)
Free (MIT)
ChromaDB
Vector Database
Free
Python 3.11+
Application Runtime
Free
FastAPI
REST API Framework
Free
Open WebUI
Chat Interface
Free
LangChain
RAG Orchestration
Free
sentence-transformers
Embedding Generation
Free
Unstructured / PyMuPDF
Document Parsing
Free

3. Proposed Architecture Design

All services run inside WSL2 on the existing Windows Server. No external network connections. ASP.NET connects via REST API on localhost.

User Interface Layer
ASP.NET AppREST API calls
Open WebUIChat interface :8080
REST APIExternal apps :8000
Application Layer (WSL2 — Ubuntu 22.04)
FastAPI GatewayQuery router + RAG :8000
RAG PipelineSearch + context inject
AI Engine Layer
OllamaLLM inference :11434
Phi-3-mini 3.8BQuantized Q4 model
MiniLM-L6-v2Embedding model
Data Layer
ChromaDBVector embeddings
Document StorePDF / Word / HTML / Excel
Query LogsAudit + monitoring
Infrastructure
Air-GappedNo external API calls
Windows Server 2019KVM/QEMU VM host
5.3 TB NTFSPersistent storage

Data Flow

StepActionComponentTime
1User submits query (API or Web UI)ASP.NET / Open WebUIInstant
2Query embedded into vectorMiniLM-L6-v2~0.5s
3Semantic search in document indexChromaDB~1-2s
4Top-K documents retrieved and rankedRAG Pipeline~0.5s
5Context + query sent to LLMFastAPI GatewayInstant
6Model generates responseOllama + Phi-3-mini~35-50s
7Response returned with source citationsFastAPI GatewayInstant
Total end-to-end~40-55 seconds

4. Activity Plan — 30 Days

Week 1 — Days 1-7

Foundation & First Working Chat

  • Day 1-2: Set up WSL2 + Docker on Windows Server, install Ubuntu 22.04
  • Day 3-4: Install Ollama, test Phi-3-mini and Qwen2.5, benchmark speed on your CPU
  • Day 5-6: Install ChromaDB, configure embedding model, test vector search
  • Day 7: Deploy Open WebUI, connect to Ollama — first working chat interface

Deliverable: Working chatbot on your server (general conversation mode)

Week 2 — Days 8-14

RAG Pipeline & Document Indexing

  • Day 8-9: Build document ingestion pipeline (PDF, Word, HTML, Excel)
  • Day 10-11: Index sample Intranet documents into ChromaDB
  • Day 12-13: Build FastAPI gateway with hybrid search (semantic + keyword)
  • Day 14: Test RAG — verify answers cite your documents

Deliverable: Chatbot answering from your internal documents with source citations

Week 3 — Days 15-21

Financial Mode & REST API

  • Day 15-16: Optimize prompt templates for financial data processing
  • Day 17-18: Build and test REST API for ASP.NET integration
  • Day 19-20: Financial query testing, response quality optimization
  • Day 21: Deploy monitoring dashboard (query logs, performance metrics)

Deliverable: REST API ready for ASP.NET + financial query mode operational

Week 4 — Days 22-30

Testing, Documentation & Handover

  • Day 22-23: LoRA fine-tuning dataset preparation (if hardware allows)
  • Day 24-25: Full system testing, edge cases, performance optimization
  • Day 26-27: Complete documentation (architecture, operations, troubleshooting)
  • Day 28-29: Training sessions with IT team
  • Day 30: Final deployment, handover, support plan

Deliverable: Complete system + documentation + trained IT team

Upgrade Path

The architecture scales automatically. Upgrade the VM and better models unlock with no code changes.

TierSpecsModelResponseUsers
Current4 cores, 16 GBPhi-3-mini 3.8B Q4~45s1-2
Tier 116 cores, 32 GBMistral 7B Q4~15-20s3-5
Tier 232 cores, 64 GBMistral 7B Full~8-12s5-10
Tier 3 (GPU)16c, 64 GB, RTX 4060Mistral 7B + LoRA~2-4s10-20

Cost Comparison

Cloud API vs On-Premise (1,000 queries/day)

OpenAI GPT-4 API
€300-600/mo
Claude API
€250-500/mo
On-Premise (your server)
€0/mo
Key Advantage: Unlimited queries at zero cost. Full data privacy — nothing leaves the building.

VM Upgrade Estimates (future)

Current VM (16GB/4c)
Included
Tier 1 (32GB/16c)
€40-80/mo extra
Tier 2 (64GB/32c)
€80-150/mo extra
Tier 3 + GPU
€150-300/mo extra