Skip to main content
Back to XO GPT Model Specifications The XO GPT Answer Generation model uses Retrieval-Augmented Generation (RAG) to generate accurate, contextually relevant answers from domain-specific data. It is a fine-tuned LLM that addresses key limitations of using commercial models out-of-the-box.

Challenges with Commercial Models

ChallengeImpact
LatencyHigh processing times affect user experience in real-time or high-volume scenarios.
CostPer-request pricing scales poorly for large deployments.
Data GovernanceSending queries to external models raises privacy and security concerns.
Lack of CustomizationGeneral-purpose models are not tuned for specific industries or use cases.
Limited ControlMinimal ability to correct or refine model behavior for incorrect outputs.
Compliance ConstraintsSome industries have regulatory requirements that commercial LLM providers don’t fully support.

How It Works

The XO GPT Answer Generation model activates immediately after the retrieval phase in the RAG pipeline. It takes retrieved data chunks and generates accurate, contextually relevant answers. RAG Framework Key assumptions:
  • Answers are generated from text-based data chunks only (not images or video).
  • Input queries have been rephrased by the XO GPT User Query Rephrasing Model.
  • Retrieved data chunks are assumed to be accurate and relevant.
  • Responses are based solely on text; content within links or embedded images is not included.

Benefits

XO GPT Benefits

Consistent and Accurate

RAG-based retrieval delivers contextually relevant and precise answers. See Model Benchmarks for latency and accuracy metrics.

Cost-Effective

For Enterprise Tier customers, XO GPT eliminates commercial model usage costs. Example comparison (10 input tokens/question, 10,000 daily Q&A pairs, 80 tokens/answer):
ModelInput $/MTokOutput $/MTokInput $/YearOutput $/YearTotal $/Year
GPT-4 Turbo$30$60$1,095$17,520$18,615
GPT-4$10$30$365$8,760$9,125
GPT-4o Mini$0.15$0.60$5.48$175.20$180.68

Enhanced Security

No client or user data is used for model retraining. Guardrails: Content moderation, behavioral guidelines, response oversight, input validation, and usage controls. AI Safety: Ethical guidelines, bias monitoring, transparency, and continuous improvement.
Performance, features, and language support may vary by implementation. Test thoroughly in your environment before production use.

Use Cases

DomainUse Cases
HealthcareAnswering patient queries from medical records; assisting with evidence-based diagnosis; providing drug information and interactions
FinanceAnswering loan eligibility questions; providing personalized financial advice; assisting with tax queries
E-commerceRecommending products; answering shipping/returns/order status questions; providing product comparisons
EducationAssisting with homework; explaining complex concepts; recommending study materials
LegalAnswering legal queries from case law; assisting with contract review; providing info on legal rights
Customer SupportAutomating FAQ responses; troubleshooting common issues; providing personalized support
Human ResourcesAnswering policy and benefits questions; assisting with recruitment; guiding performance evaluations

Sample Output

Ingested chunks:
  • chk-1: A home loan is a secured loan obtained to purchase property by offering it as collateral. Calculation involves loan amount, interest rate, tenure, and borrower profile.
  • chk-4: EMI = [P × R × (1+R)^N] / [(1+R)^N−1], where P = principal, R = monthly interest rate, N = number of installments.
Q: How do you compute EMI?
The EMI is calculated using the formula: EMI = [P × R × (1+R)^N] / [(1+R)^N−1], where P is the principal loan amount, R is the monthly interest rate, and N is the number of monthly installments [chk-4].
Q: What’s a home loan? What’s photosynthesis?
I found only the answer to one question ‘what’s a home loan?’: A home loan is a secured loan obtained to purchase property by offering it as collateral [chk-1]. The remaining questions cannot be answered from the given context.

Model Building Process

See Model Building Process.

Model Benchmarks

VersionAccuracyTPSLatency (s)BenchmarkTest Data
v3.097%370.92Summary v3Results v3
v2.096%541.03Summary v2Results v2
v1.094%201.36Summary v1Results v1

Version 3.0

Model Choice

Base model: Llama 3.1 8B Instruct
Base ModelDeveloperLanguageRelease DateStatusKnowledge Cutoff
Llama 3.1 8B InstructMetaMulti-lingualJuly 2024StaticDecember 2023

Prompt Tuning

Prompts are designed to produce clear, well-structured outputs with a consistent tone. Each prompt variation is evaluated across multiple categories (toxicity, bias, ambiguity, hallucination, logical consistency, robustness) in English and multiple translated languages. The prompt with the highest accuracy and reliability across all scenarios is selected.

AWQ Model Quantization

AWQ (Activation-aware Weight Quantization) reduces memory and compute requirements while maintaining accuracy.
ParameterDescriptionValue
Zero PointInclude zero-point for better weight representationTrue
Quantization Group SizeWeight group size for quantization128
Weight PrecisionBits used to represent weights4
Quantization VersionAWQ version optimized for GEMM”GEMM”
Computation Data TypeData type for inferencetorch.float16
Model LoadingReduced CPU memory usage{"low_cpu_mem_usage": True}
Tokenizer LoadingRemote code compatibilitytrust_remote_code=True

Model Usage Notes

  • Context-only responses: The model responds based solely on the source document. It does not use external knowledge.
  • Language consistency: Query and source document must be in the same language.
  • Output formatting: Supports formatting cues in the query (e.g., “provide the answer in bullet points”, “explain step-by-step”).

Benchmarks Summary v3

Comparison models: Llama 3.1 8B, Claude 3.5 Sonnet, Mistral 7B v2. Benchmarks Summary v3 See Test Data and Results v3 for full details.

Version 2.0

Model Choice

Base model: Llama 3.1 8B Instruct
Base ModelDeveloperLanguageRelease DateStatusKnowledge Cutoff
Llama 3.1 8B InstructMetaMulti-lingualJuly 2024StaticDecember 2023

Fine-Tuning Parameters

ParameterDescriptionValue
Load in 4-bit PrecisionReduce memory by loading weights at 4-bitTrue
Use Double QuantizationImprove accuracy with double quantizationTrue
4-bit Quantization TypeType of 4-bit quantizationnf4
Computation Data TypeData type for 4-bit quantized weightstorch.float16
LoRA RankRank of low-rank decomposition32
LoRA AlphaLoRA scaling factor16
LoRA Dropout RateDropout to prevent overfitting0.05
Bias Term InclusionAdd bias terms in LoRA layers
Task TypeLoRA task typeCAUSAL_LM
Targeted ModulesModel layers where LoRA is applied['k_proj', 'q_proj', 'v_proj', 'o_proj']

General Parameters

Infrastructure: 2× A10 GPUs. Requires an Agent AI License.
ParameterDescriptionValue
Learning RateRate toward loss minimum2e-4 (0.0002)
Batch SizeExamples per training step2
EpochsPasses over training data4
Max Sequence LengthMaximum input length32768
OptimizerOptimization algorithmpaged_adamw_8bit

AWQ Model Quantization

Same configuration as v3.0. See AWQ parameters above.

Benchmarks Summary v2

Comparison models: Mistral 7B v2, Llama 3.1 8B, GPT-4o Mini, Claude 3.5 Sonnet. Benchmarks Summary v2 See Test Data and Results v2 for full details.

Version 1.0

Model Choice

Base model: Mistral 7B Instruct v0.2
Base ModelDeveloperLanguageRelease DateStatusKnowledge Cutoff
Mistral 7B Instruct v0.2Mistral AIMulti-lingualSeptember 2024StaticSeptember 2024

Fine-Tuning Parameters

ParameterDescriptionValue
Load in 4-bit PrecisionReduce memory by loading weights at 4-bitTrue
Use Double QuantizationImprove accuracy with double quantizationTrue
4-bit Quantization TypeType of 4-bit quantizationnf4
Computation Data TypeData type for 4-bit quantized weightstorch.float16
LoRA RankRank of low-rank decomposition32
LoRA AlphaLoRA scaling factor16
LoRA Dropout RateDropout to prevent overfitting0.05
Bias Term InclusionAdd bias terms in LoRA layers
Task TypeLoRA task typeCAUSAL_LM
Targeted ModulesModel layers where LoRA is applied['k_proj', 'q_proj', 'v_proj', 'o_proj']

General Parameters

Infrastructure: 2× A10 GPUs. Requires an Agent AI License.
ParameterDescriptionValue
Learning RateRate toward loss minimum1e-3 (0.001)
Batch SizeExamples per training step1
EpochsPasses over training data3
Max Sequence LengthMaximum input length32768
OptimizerOptimization algorithmpaged_adamw_8bit

Benchmarks Summary v1

Comparison models: Llama 3.1 8B, GPT-4o Mini, Claude 3.5 Sonnet. Benchmarks Summary v1 See Test Data and Results v1 for full details.