LLM Model Configuration¶

DeepCritical supports multiple LLM backends through a unified OpenAI-compatible interface. This guide covers configuration and usage of different LLM providers.

Supported Providers¶

DeepCritical supports any OpenAI-compatible API server:

vLLM: High-performance inference server for local models
llama.cpp: Efficient C++ inference for GGUF models
Text Generation Inference (TGI): Hugging Face's optimized inference server
Custom OpenAI-compatible servers: Any server implementing the OpenAI Chat Completions API

Configuration Files¶

LLM configurations are stored in configs/llm/ directory:

configs/llm/
├── vllm_pydantic.yaml      # vLLM server configuration
├── llamacpp_local.yaml     # llama.cpp server configuration
└── tgi_local.yaml          # TGI server configuration

Configuration Schema¶

All LLM configurations follow this Pydantic-validated schema:

Basic Configuration¶

# Provider identifier
provider: "vllm"  # or "llamacpp", "tgi", "custom"

# Model identifier
model_name: "meta-llama/Llama-3-8B"

# Server endpoint
base_url: "http://localhost:8000/v1"

# Optional API key (set to null for local servers)
api_key: null

# Connection settings
timeout: 60.0        # Request timeout in seconds (1-600)
max_retries: 3       # Maximum retry attempts (0-10)
retry_delay: 1.0     # Delay between retries in seconds

Generation Parameters¶

generation:
  temperature: 0.7           # Sampling temperature (0.0-2.0)
  max_tokens: 512           # Maximum tokens to generate (1-32000)
  top_p: 0.9                # Nucleus sampling threshold (0.0-1.0)
  frequency_penalty: 0.0    # Penalize token frequency (-2.0-2.0)
  presence_penalty: 0.0     # Penalize token presence (-2.0-2.0)

Provider-Specific Configurations¶

vLLM Configuration¶

# configs/llm/vllm_pydantic.yaml
provider: "vllm"
model_name: "meta-llama/Llama-3-8B"
base_url: "http://localhost:8000/v1"
api_key: null  # vLLM uses "EMPTY" by default if auth is disabled

generation:
  temperature: 0.7
  max_tokens: 512
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0

timeout: 60.0
max_retries: 3
retry_delay: 1.0

Starting vLLM server:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B \
  --port 8000

llama.cpp Configuration¶

# configs/llm/llamacpp_local.yaml
provider: "llamacpp"
model_name: "llama"  # Default name used by llama.cpp server
base_url: "http://localhost:8080/v1"
api_key: null

generation:
  temperature: 0.7
  max_tokens: 512
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0

timeout: 60.0
max_retries: 3
retry_delay: 1.0

Starting llama.cpp server:

./llama-server \
  --model models/llama-3-8b.gguf \
  --port 8080 \
  --ctx-size 4096

TGI Configuration¶

# configs/llm/tgi_local.yaml
provider: "tgi"
model_name: "bigscience/bloom-560m"
base_url: "http://localhost:3000/v1"
api_key: null

generation:
  temperature: 0.7
  max_tokens: 512
  top_p: 0.9
  frequency_penalty: 0.0
  presence_penalty: 0.0

timeout: 60.0
max_retries: 3
retry_delay: 1.0

Starting TGI server:

docker run -p 3000:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id bigscience/bloom-560m

Python API Usage¶

Loading Models from Configuration¶

from omegaconf import DictConfig, OmegaConf
from DeepResearch.src.models import OpenAICompatibleModel

# Load configuration
config = OmegaConf.load("configs/llm/vllm_pydantic.yaml")

# Type guard: ensure config is a DictConfig (not ListConfig)
assert OmegaConf.is_dict(config), "Config must be a dict"
dict_config: DictConfig = config  # type: ignore

# Create model from configuration
model = OpenAICompatibleModel.from_config(dict_config)

# Or use provider-specific methods
model = OpenAICompatibleModel.from_vllm(dict_config)
model = OpenAICompatibleModel.from_llamacpp(dict_config)
model = OpenAICompatibleModel.from_tgi(dict_config)

Direct Instantiation¶

from omegaconf import DictConfig, OmegaConf
from DeepResearch.src.models import OpenAICompatibleModel

# Create model with direct parameters (no config file needed)
model = OpenAICompatibleModel.from_vllm(
    base_url="http://localhost:8000/v1",
    model_name="meta-llama/Llama-3-8B"
)

# Override config parameters from file
config = OmegaConf.load("configs/llm/vllm_pydantic.yaml")

# Type guard before using config
assert OmegaConf.is_dict(config), "Config must be a dict"
dict_config: DictConfig = config  # type: ignore

model = OpenAICompatibleModel.from_config(
    dict_config,
    model_name="override-model",  # Override model name
    timeout=120.0                 # Override timeout
)

Environment Variables¶

Use environment variables for sensitive data:

# In your config file
base_url: ${oc.env:LLM_BASE_URL,http://localhost:8000/v1}
api_key: ${oc.env:LLM_API_KEY}

# Set environment variables
export LLM_BASE_URL="http://my-server:8000/v1"
export LLM_API_KEY="your-api-key"

Configuration Validation¶

All configurations are validated using Pydantic models at runtime:

LLMModelConfig¶

from DeepResearch.src.datatypes.llm_models import LLMModelConfig, LLMProvider

config = LLMModelConfig(
    provider=LLMProvider.VLLM,
    model_name="meta-llama/Llama-3-8B",
    base_url="http://localhost:8000/v1",
    timeout=60.0,
    max_retries=3
)

Validation rules: - model_name: Non-empty string (whitespace stripped) - base_url: Non-empty string (whitespace stripped) - timeout: Positive float (1-600 seconds) - max_retries: Integer (0-10) - retry_delay: Positive float

GenerationConfig¶

from DeepResearch.src.datatypes.llm_models import GenerationConfig

gen_config = GenerationConfig(
    temperature=0.7,
    max_tokens=512,
    top_p=0.9,
    frequency_penalty=0.0,
    presence_penalty=0.0
)

Validation rules: - temperature: Float (0.0-2.0) - max_tokens: Positive integer (1-32000) - top_p: Float (0.0-1.0) - frequency_penalty: Float (-2.0-2.0) - presence_penalty: Float (-2.0-2.0)

Command Line Overrides¶

Override LLM configuration from the command line:

# Override model name
uv run deepresearch \
  llm.model_name="different-model" \
  question="Your question"

# Override server URL
uv run deepresearch \
  llm.base_url="http://different-server:8000/v1" \
  question="Your question"

# Override generation parameters
uv run deepresearch \
  llm.generation.temperature=0.9 \
  llm.generation.max_tokens=1024 \
  question="Your question"

Testing LLM Configurations¶

Test your LLM configuration before use:

# tests/test_models.py
from omegaconf import DictConfig, OmegaConf
from DeepResearch.src.models import OpenAICompatibleModel

def test_vllm_config():
    """Test vLLM model configuration."""
    config = OmegaConf.load("configs/llm/vllm_pydantic.yaml")

    # Type guard: ensure config is a DictConfig
    assert OmegaConf.is_dict(config), "Config must be a dict"
    dict_config: DictConfig = config  # type: ignore

    model = OpenAICompatibleModel.from_vllm(dict_config)

    assert model.model_name == "meta-llama/Llama-3-8B"
    assert "localhost:8000" in model.base_url

Run tests:

# Run all model tests
uv run pytest tests/test_models.py -v

# Test specific provider
uv run pytest tests/test_models.py::TestOpenAICompatibleModelWithConfigs::test_from_vllm_with_actual_config_file -v

Troubleshooting¶

Connection Errors¶

Problem: ConnectionError: Failed to connect to server

Solutions: 1. Verify server is running: curl http://localhost:8000/v1/models 2. Check base_url in configuration 3. Increase timeout value 4. Check firewall settings

Type Validation Errors¶

Problem: ValidationError: Invalid type for model_name

Solutions: 1. Ensure model_name is a non-empty string 2. Check for trailing whitespace (automatically stripped) 3. Verify configuration file syntax

Model Not Found¶

Problem: Model 'xyz' not found

Solutions: 1. Verify model is loaded on the server 2. Check model_name matches server's model identifier 3. For llama.cpp, use default name "llama"

Best Practices¶

Configuration Management
Keep separate configs for development, staging, production
Use environment variables for sensitive data
Version control your configuration files
Performance Tuning
Adjust max_tokens based on use case
Use appropriate temperature for creativity vs. consistency
Set reasonable timeout values for your network
Error Handling
Configure max_retries based on server reliability
Set appropriate retry_delay to avoid overwhelming servers
Implement proper error logging
Testing
Test configurations in development environment first
Validate generation parameters produce expected output
Monitor server response times

Configuration Guide: General Hydra configuration
Core Modules: Implementation details
Data Types API: Pydantic schemas and validation