Configuration Guide¶

DeepCritical uses a comprehensive configuration system based on Hydra that allows flexible composition of different configuration components. This guide explains the configuration structure and how to customize DeepCritical for your needs.

Configuration Structure¶

The configuration system is organized into several key areas:

configs/
├── config.yaml                 # Main configuration file
├── app_modes/                  # Application execution modes
├── bioinformatics/             # Bioinformatics-specific configurations
├── challenge/                  # Challenge and experimental configurations
├── db/                         # Database connection configurations
├── deep_agent/                 # Deep agent configurations
├── deepsearch/                 # Deep search configurations
├── prompts/                    # Prompt templates for all agents
├── rag/                        # RAG system configurations
├── statemachines/              # Workflow state machine configurations
├── vllm/                       # VLLM model configurations
└── workflow_orchestration/     # Advanced workflow configurations

Main Configuration (`config.yaml`)¶

The main configuration file defines the core parameters for DeepCritical:

# Research parameters
question: "Your research question here"
plan: ["step1", "step2", "step3"]
retries: 3
manual_confirm: false

# Flow control
flows:
  prime:
    enabled: true
    params:
      adaptive_replanning: true
      manual_confirmation: false
      tool_validation: true
  bioinformatics:
    enabled: true
    data_sources:
      go:
        enabled: true
        evidence_codes: ["IDA", "EXP"]
        year_min: 2022
        quality_threshold: 0.9
      pubmed:
        enabled: true
        max_results: 50
        include_full_text: true

# Output management
hydra:
  run:
    dir: outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: multirun/${now:%Y-%m-%d}/${now:%H-%M-%S}

Application Modes (`app_modes/`)¶

Different execution modes for various research scenarios:

Single REACT Mode¶

# configs/app_modes/single_react.yaml
question: "What is machine learning?"
flows:
  prime:
    enabled: false
  bioinformatics:
    enabled: false
  deepsearch:
    enabled: false

Multi-Level REACT Mode¶

# configs/app_modes/multi_level_react.yaml
question: "Analyze machine learning in drug discovery"
flows:
  prime:
    enabled: true
    params:
      nested_loops: 3
  bioinformatics:
    enabled: true
  deepsearch:
    enabled: true

Nested Orchestration Mode¶

# configs/app_modes/nested_orchestration.yaml
question: "Design comprehensive research framework"
flows:
  prime:
    enabled: true
    params:
      nested_loops: 5
      subgraphs_enabled: true
  bioinformatics:
    enabled: true
  deepsearch:
    enabled: true

Loss-Driven Mode¶

# configs/app_modes/loss_driven.yaml
question: "Optimize research quality"
flows:
  prime:
    enabled: true
    params:
      loss_functions: ["quality", "efficiency", "comprehensiveness"]
  bioinformatics:
    enabled: true

Bioinformatics Configuration (`bioinformatics/`)¶

Agent Configuration¶

# configs/bioinformatics/agents.yaml
agents:
  data_fusion:
    model: "anthropic:claude-sonnet-4-0"
    temperature: 0.7
    max_tokens: 2000
  go_annotation:
    model: "anthropic:claude-sonnet-4-0"
    temperature: 0.5
    max_tokens: 1500
  reasoning:
    model: "anthropic:claude-sonnet-4-0"
    temperature: 0.3
    max_tokens: 3000

Data Sources Configuration¶

# configs/bioinformatics/data_sources.yaml
data_sources:
  go:
    enabled: true
    api_base_url: "https://api.geneontology.org"
    evidence_codes: ["IDA", "EXP", "TAS", "IMP"]
    year_min: 2020
    quality_threshold: 0.85
    max_annotations: 1000

  pubmed:
    enabled: true
    api_base_url: "https://eutils.ncbi.nlm.nih.gov/entrez/eutils"
    max_results: 100
    include_abstracts: true
    year_min: 2020
    relevance_threshold: 0.7

  geo:
    enabled: false
    max_datasets: 10
    sample_threshold: 50

  cmap:
    enabled: false
    max_profiles: 100
    correlation_threshold: 0.8

Workflow Configuration¶

# configs/bioinformatics/workflow.yaml
workflow:
  steps:
    - name: "parse_query"
      agent: "query_parser"
      timeout: 30

    - name: "fuse_data"
      agent: "data_fusion"
      timeout: 120
      retry_on_failure: true

    - name: "assess_quality"
      agent: "data_quality"
      timeout: 60

    - name: "reason_integrate"
      agent: "reasoning"
      timeout: 180

  quality_thresholds:
    data_fusion: 0.8
    cross_reference: 0.75
    evidence_integration: 0.85

Database Configurations (`db/`)¶

Neo4j Configuration¶

# configs/db/neo4j.yaml
neo4j:
  uri: "bolt://localhost:7687"
  user: "neo4j"
  password: "${oc.env:NEO4J_PASSWORD}"
  database: "neo4j"

  connection:
    max_connection_lifetime: 3600
    max_connection_pool_size: 50
    connection_acquisition_timeout: 60

  queries:
    default_timeout: 30
    max_query_complexity: 1000

PostgreSQL Configuration¶

# configs/db/postgres.yaml
postgres:
  host: "localhost"
  port: 5432
  database: "deepcritical"
  user: "${oc.env:POSTGRES_USER}"
  password: "${oc.env:POSTGRES_PASSWORD}"

  connection:
    pool_size: 20
    max_overflow: 30
    pool_timeout: 30

  tables:
    research_state: "research_states"
    execution_history: "execution_history"
    tool_results: "tool_results"

Deep Agent Configurations (`deep_agent/`)¶

Basic Configuration¶

# configs/deep_agent/basic.yaml
deep_agent:
  enabled: true
  model: "anthropic:claude-sonnet-4-0"
  temperature: 0.7

  capabilities:
    - "file_system"
    - "web_search"
    - "code_execution"

  tools:
    - "read_file"
    - "search_web"
    - "run_terminal_cmd"

Comprehensive Configuration¶

# configs/deep_agent/comprehensive.yaml
deep_agent:
  enabled: true
  model: "anthropic:claude-sonnet-4-0"
  temperature: 0.5
  max_tokens: 4000

  capabilities:
    - "file_system"
    - "web_search"
    - "code_execution"
    - "data_analysis"
    - "document_processing"

  tools:
    - "read_file"
    - "write_file"
    - "search_web"
    - "run_terminal_cmd"
    - "analyze_data"
    - "process_document"

  context_window: 8000
  memory_enabled: true
  memory_size: 100

Prompt Templates (`prompts/`)¶

PRIME Parser Prompt¶

# configs/prompts/prime_parser.yaml
system_prompt: |
  You are an expert research query parser for the PRIME protein engineering system.
  Your task is to analyze research questions and extract key scientific intent,
  identify relevant protein engineering domains, and structure the query for
  optimal tool selection and workflow planning.

  Focus on:
  1. Scientific domain identification (immunology, enzymology, etc.)
  2. Query intent classification (design, analysis, prediction, etc.)
  3. Key entities and relationships
  4. Required computational methods

instructions: |
  Parse the research question and return structured output with:
  - scientific_domain: Primary domain of research
  - query_intent: Main objective (design, analyze, predict, etc.)
  - key_entities: Important proteins, genes, or molecules mentioned
  - required_methods: Computational approaches needed
  - complexity_level: low, medium, high

RAG Configuration (`rag/`)¶

Vector Store Configuration¶

# configs/rag/vector_store/chroma.yaml
vector_store:
  type: "chroma"
  collection_name: "deepcritical_docs"
  persist_directory: "./chroma_db"

  embedding:
    model: "all-MiniLM-L6-v2"
    dimension: 384
    batch_size: 32

  search:
    k: 5
    score_threshold: 0.7
    include_metadata: true

LLM Configuration¶

# configs/rag/llm/openai.yaml
llm:
  provider: "openai"
  model: "gpt-4"
  temperature: 0.1
  max_tokens: 1000
  api_key: "${oc.env:OPENAI_API_KEY}"

  parameters:
    top_p: 0.9
    frequency_penalty: 0.0
    presence_penalty: 0.0

State Machine Configurations (`statemachines/`)¶

Flow Configurations¶

# configs/statemachines/flows/prime.yaml
enabled: true
params:
  adaptive_replanning: true
  manual_confirmation: false
  tool_validation: true
  scientific_intent_detection: true

  domain_heuristics:
    - immunology
    - enzymology
    - cell_biology

  tool_categories:
    - knowledge_query
    - sequence_analysis
    - structure_prediction
    - molecular_docking
    - de_novo_design
    - function_prediction

Orchestrator Configuration¶

# configs/statemachines/orchestrators/config.yaml
orchestrators:
  primary:
    type: "react"
    max_iterations: 10
    convergence_threshold: 0.95

  sub_orchestrators:
    - name: "search"
      type: "linear"
      max_steps: 5

    - name: "analysis"
      type: "tree"
      branching_factor: 3

VLLM Configurations (`vllm/`)¶

Default Configuration¶

# configs/vllm/default.yaml
vllm:
  model: "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
  tensor_parallel_size: 1
  dtype: "auto"

  generation:
    temperature: 0.7
    top_p: 0.9
    max_tokens: 512
    repetition_penalty: 1.1

  performance:
    max_model_len: 2048
    max_num_seqs: 16
    max_paddings: 256

Workflow Orchestration (`workflow_orchestration/`)¶

Primary Workflow¶

# configs/workflow_orchestration/primary_workflow/react_primary.yaml
workflow:
  type: "react"
  max_iterations: 10
  convergence_threshold: 0.95

  steps:
    - name: "thought"
      type: "reasoning"
      required: true

    - name: "action"
      type: "tool_execution"
      required: true

    - name: "observation"
      type: "result_processing"
      required: true

Multi-Agent Systems¶

# configs/workflow_orchestration/multi_agent_systems/default_multi_agent.yaml
multi_agent:
  enabled: true
  max_agents: 5
  communication_protocol: "message_passing"

  agents:
    - role: "coordinator"
      model: "anthropic:claude-sonnet-4-0"
      capabilities: ["planning", "monitoring"]

    - role: "specialist"
      model: "anthropic:claude-sonnet-4-0"
      capabilities: ["analysis", "execution"]

Configuration Composition¶

DeepCritical supports flexible configuration composition:

# Use specific configuration components
uv run deepresearch \
  --config-name=config_with_modes \
  --config-path=configs/bioinformatics \
  --config-path=configs/rag \
  question="Bioinformatics research query"

# Override specific parameters
uv run deepresearch \
  question="Custom question" \
  flows.prime.enabled=true \
  flows.bioinformatics.data_sources.go.year_min=2023 \
  model.temperature=0.8

Environment Variables¶

Many configurations support environment variable substitution:

# In any config file
api_keys:
  anthropic: "${oc.env:ANTHROPIC_API_KEY}"
  openai: "${oc.env:OPENAI_API_KEY}"

database:
  password: "${oc.env:DATABASE_PASSWORD}"
  host: "${oc.env:DATABASE_HOST,localhost}"

Best Practices¶

Start Simple: Begin with basic configurations and add complexity as needed
Use Composition: Leverage Hydra's composition features for reusable components
Environment Variables: Use environment variables for sensitive data
Documentation: Document custom configurations for team use
Validation: Test configurations before production deployment
Version Control: Keep configuration files in version control
Backups: Maintain backups of critical configurations

Troubleshooting¶

Common Configuration Issues¶

Missing Required Parameters:

# Check configuration structure
uv run deepresearch --cfg job

# Validate against schemas
uv run deepresearch --config-name=my_config --cfg job

Environment Variable Issues:

# Check environment variable resolution
export MY_VAR="test_value"
uv run deepresearch hydra.verbose=true question="test"

Configuration Conflicts:

# Check configuration precedence
uv run deepresearch --cfg path

# Use specific config files
uv run deepresearch --config-path=configs/bioinformatics question="test"

For more detailed information about specific configuration areas, see the API Reference and individual flow documentation.