Tool Development Guide¶
This guide provides comprehensive instructions for developing, testing, and integrating new tools into the DeepCritical ecosystem.
Overview¶
DeepCritical's tool system is designed to be extensible, allowing researchers and developers to add new capabilities seamlessly. Tools can be written in any language and integrate with various external services and APIs.
Tool Architecture¶
Core Components¶
Every DeepCritical tool consists of three main components:
- Tool Specification: Metadata describing the tool's interface
- Tool Runner: The actual implementation that executes the tool
- Tool Registration: Integration with the tool registry
Tool Specification¶
The tool specification defines the tool's interface using the ToolSpec
class:
from deepresearch.src.datatypes.tools import ToolSpec, ToolCategory
tool_spec = ToolSpec(
name="sequence_alignment",
description="Performs pairwise or multiple sequence alignment",
category=ToolCategory.SEQUENCE_ANALYSIS,
inputs={
"sequences": {
"type": "list",
"description": "List of DNA/RNA/protein sequences",
"required": True,
"schema": {
"type": "array",
"items": {"type": "string", "minLength": 1}
}
},
"algorithm": {
"type": "string",
"description": "Alignment algorithm to use",
"required": False,
"default": "blast",
"enum": ["blast", "clustal", "muscle", "mafft"]
},
"output_format": {
"type": "string",
"description": "Output format",
"required": False,
"default": "fasta",
"enum": ["fasta", "clustal", "phylip", "nexus"]
}
},
outputs={
"alignment": {
"type": "string",
"description": "Aligned sequences in specified format"
},
"score": {
"type": "number",
"description": "Alignment quality score"
},
"metadata": {
"type": "object",
"description": "Additional alignment metadata",
"properties": {
"execution_time": {"type": "number"},
"algorithm_version": {"type": "string"},
"warnings": {"type": "array", "items": {"type": "string"}}
}
}
},
metadata={
"version": "1.0.0",
"author": "Bioinformatics Team",
"license": "MIT",
"tags": ["alignment", "bioinformatics", "sequence"],
"dependencies": ["biopython", "numpy"],
"timeout": 300, # 5 minutes
"memory_limit_mb": 1024,
"gpu_required": False
}
)
Tool Runner Implementation¶
The tool runner implements the actual functionality:
from deepresearch.src.tools.base import ToolRunner, ExecutionResult
from deepresearch.src.datatypes.tools import ToolSpec, ToolCategory
import time
class SequenceAlignmentTool(ToolRunner):
"""Tool for performing sequence alignments."""
def __init__(self):
super().__init__(ToolSpec(
name="sequence_alignment",
description="Performs pairwise or multiple sequence alignment",
category=ToolCategory.SEQUENCE_ANALYSIS,
# ... inputs, outputs, metadata as above
))
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
"""Execute the sequence alignment."""
start_time = time.time()
try:
# Extract parameters
sequences = parameters["sequences"]
algorithm = parameters.get("algorithm", "blast")
output_format = parameters.get("output_format", "fasta")
# Validate inputs
if not sequences or len(sequences) < 2:
return ExecutionResult(
success=False,
error="At least 2 sequences required for alignment",
error_type="ValidationError"
)
# Perform alignment
alignment_result = self._perform_alignment(
sequences, algorithm, output_format
)
execution_time = time.time() - start_time
return ExecutionResult(
success=True,
data={
"alignment": alignment_result["alignment"],
"score": alignment_result["score"],
"metadata": {
"execution_time": execution_time,
"algorithm_version": "1.0.0",
"warnings": alignment_result.get("warnings", [])
}
},
execution_time=execution_time
)
except Exception as e:
execution_time = time.time() - start_time
return ExecutionResult(
success=False,
error=str(e),
error_type=type(e).__name__,
execution_time=execution_time
)
def _perform_alignment(self, sequences, algorithm, output_format):
"""Perform the actual alignment logic."""
# Implementation here - would use BioPython or other alignment libraries
# This is a simplified example
if algorithm == "blast":
# BLAST alignment logic
pass
elif algorithm == "clustal":
# Clustal Omega alignment logic
pass
# ... other algorithms
return {
"alignment": ">seq1\nATCG...\n>seq2\nATCG...",
"score": 85.5,
"warnings": []
}
Development Workflow¶
1. Planning Your Tool¶
Before implementing a tool, consider:
- Purpose: What problem does this tool solve?
- Inputs/Outputs: What data does it need and produce?
- Dependencies: What external libraries or services are required?
- Performance: What's the expected execution time and resource usage?
- Error Cases: What can go wrong and how should it be handled?
2. Creating the Tool Specification¶
Start by defining a clear, comprehensive specification:
def create_tool_spec() -> ToolSpec:
"""Create tool specification for a BLAST search tool."""
return ToolSpec(
name="blast_search",
description="Perform BLAST sequence similarity searches",
category=ToolCategory.SEQUENCE_ANALYSIS,
inputs={
"sequence": {
"type": "string",
"description": "Query sequence in FASTA format",
"required": True,
"minLength": 10,
"maxLength": 10000
},
"database": {
"type": "string",
"description": "Target database to search",
"required": False,
"default": "nr",
"enum": ["nr", "refseq", "swissprot", "pdb"]
},
"e_value_threshold": {
"type": "number",
"description": "E-value threshold for results",
"required": False,
"default": 1e-5,
"minimum": 0,
"maximum": 1
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return",
"required": False,
"default": 100,
"minimum": 1,
"maximum": 1000
}
},
outputs={
"results": {
"type": "array",
"description": "List of BLAST hit results",
"items": {
"type": "object",
"properties": {
"accession": {"type": "string"},
"description": {"type": "string"},
"e_value": {"type": "number"},
"identity": {"type": "number"},
"alignment_length": {"type": "integer"}
}
}
},
"search_info": {
"type": "object",
"description": "Search metadata and statistics",
"properties": {
"database_size": {"type": "integer"},
"search_time": {"type": "number"},
"total_hits": {"type": "integer"}
}
}
},
metadata={
"version": "2.0.0",
"author": "NCBI Tools Team",
"license": "Public Domain",
"tags": ["blast", "similarity", "search", "sequence"],
"dependencies": ["biopython", "requests"],
"timeout": 600, # 10 minutes
"memory_limit_mb": 2048,
"network_required": True
}
)
3. Implementing the Tool Runner¶
Implement the core logic with proper error handling:
import requests
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
class BlastSearchTool(ToolRunner):
"""NCBI BLAST search tool."""
def __init__(self):
super().__init__(create_tool_spec())
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
"""Execute BLAST search."""
start_time = time.time()
try:
# Extract and validate parameters
sequence = self._validate_sequence(parameters["sequence"])
database = parameters.get("database", "nr")
e_threshold = parameters.get("e_value_threshold", 1e-5)
max_results = parameters.get("max_results", 100)
# Perform BLAST search
result_handle = NCBIWWW.qblast(
program="blastp" if self._is_protein(sequence) else "blastn",
database=database,
sequence=sequence,
expect=e_threshold,
hitlist_size=max_results
)
# Parse results
blast_records = NCBIXML.parse(result_handle)
results = self._parse_blast_results(blast_records, max_results)
execution_time = time.time() - start_time
return ExecutionResult(
success=True,
data={
"results": results,
"search_info": {
"database_size": self._get_database_size(database),
"search_time": execution_time,
"total_hits": len(results)
}
},
execution_time=execution_time
)
except requests.exceptions.RequestException as e:
return ExecutionResult(
success=False,
error=f"Network error during BLAST search: {e}",
error_type="NetworkError",
execution_time=time.time() - start_time
)
except Exception as e:
return ExecutionResult(
success=False,
error=f"BLAST search failed: {e}",
error_type=type(e).__name__,
execution_time=time.time() - start_time
)
def _validate_sequence(self, sequence: str) -> str:
"""Validate and clean input sequence."""
# Remove FASTA header if present
lines = sequence.strip().split('\n')
if lines[0].startswith('>'):
sequence = '\n'.join(lines[1:])
# Remove whitespace and validate
sequence = ''.join(sequence.split()).upper()
if len(sequence) < 10:
raise ValueError("Sequence too short (minimum 10 characters)")
if len(sequence) > 10000:
raise ValueError("Sequence too long (maximum 10000 characters)")
# Validate sequence characters
valid_chars = set('ATCGNUWSMKRYBDHVZ-')
if not all(c in valid_chars for c in sequence):
raise ValueError("Invalid characters in sequence")
return sequence
def _is_protein(self, sequence: str) -> bool:
"""Determine if sequence is protein or nucleotide."""
# Simple heuristic: check for amino acid characters
protein_chars = set('EFILPQXZ')
return any(c in protein_chars for c in sequence.upper())
def _parse_blast_results(self, blast_records, max_results):
"""Parse BLAST XML results into structured format."""
results = []
for blast_record in blast_records:
for alignment in blast_record.alignments[:max_results]:
for hsp in alignment.hsps:
results.append({
"accession": alignment.accession,
"description": alignment.title,
"e_value": hsp.expect,
"identity": (hsp.identities / hsp.align_length) * 100,
"alignment_length": hsp.align_length,
"query_start": hsp.query_start,
"query_end": hsp.query_end,
"subject_start": hsp.sbjct_start,
"subject_end": hsp.sbjct_end
})
if len(results) >= max_results:
break
if len(results) >= max_results:
break
return results
def _get_database_size(self, database: str) -> int:
"""Get approximate database size."""
# This would typically query NCBI for actual database statistics
db_sizes = {
"nr": 500000000, # 500M sequences
"refseq": 100000000, # 100M sequences
"swissprot": 500000, # 500K sequences
"pdb": 100000 # 100K sequences
}
return db_sizes.get(database, 0)
4. Testing Your Tool¶
Create comprehensive tests for your tool:
import pytest
from unittest.mock import patch, MagicMock
class TestBlastSearchTool:
@pytest.fixture
def tool(self):
"""Create tool instance for testing."""
return BlastSearchTool()
def test_tool_specification(self, tool):
"""Test tool specification is correctly defined."""
spec = tool.get_spec()
assert spec.name == "blast_search"
assert spec.category == ToolCategory.SEQUENCE_ANALYSIS
assert "sequence" in spec.inputs
assert "results" in spec.outputs
def test_sequence_validation(self, tool):
"""Test sequence input validation."""
# Valid sequence
valid_seq = tool._validate_sequence("ATCGATCGATCGATCGATCG")
assert valid_seq == "ATCGATCGATCGATCGATCG"
# Sequence with FASTA header
fasta_seq = ">test\nATCGATCG\nATCGATCG"
cleaned = tool._validate_sequence(fasta_seq)
assert cleaned == "ATCGATCGATCGATCG"
# Invalid sequences
with pytest.raises(ValueError, match="too short"):
tool._validate_sequence("ATCG")
with pytest.raises(ValueError, match="Invalid characters"):
tool._validate_sequence("ATCGXATCG") # X is invalid
@patch('Bio.Blast.NCBIWWW.qblast')
def test_successful_search(self, mock_qblast, tool):
"""Test successful BLAST search."""
# Mock BLAST response
mock_result = MagicMock()
mock_qblast.return_value = mock_result
# Mock parsing
with patch.object(tool, '_parse_blast_results', return_value=[
{
"accession": "XP_001234",
"description": "Test protein",
"e_value": 1e-10,
"identity": 95.5,
"alignment_length": 100
}
]):
result = tool.run({
"sequence": "ATCGATCGATCGATCGATCGATCGATCGATCGATCG"
})
assert result.success
assert "results" in result.data
assert len(result.data["results"]) == 1
assert result.data["results"][0]["accession"] == "XP_001234"
@patch('Bio.Blast.NCBIWWW.qblast')
def test_network_error_handling(self, mock_qblast, tool):
"""Test network error handling."""
from requests.exceptions import ConnectionError
mock_qblast.side_effect = ConnectionError("Network timeout")
result = tool.run({
"sequence": "ATCGATCGATCGATCGATCGATCGATCGATCGATCG"
})
assert not result.success
assert "Network error" in result.error
assert result.error_type == "NetworkError"
def test_protein_detection(self, tool):
"""Test protein vs nucleotide sequence detection."""
# Nucleotide sequence
assert not tool._is_protein("ATCGATCGATCG")
# Protein sequence
assert tool._is_protein("MEEPQSDPSVEPPLSQETFSDLWK")
# Mixed/ambiguous
assert tool._is_protein("ATCGLEUF") # Contains E, F
@pytest.mark.parametrize("database,expected_size", [
("nr", 500000000),
("swissprot", 500000),
("unknown", 0)
])
def test_database_size_lookup(self, tool, database, expected_size):
"""Test database size lookup."""
assert tool._get_database_size(database) == expected_size
5. Registering Your Tool¶
Register the tool with the system:
from deepresearch.src.utils.tool_registry import ToolRegistry
def register_blast_tool():
"""Register the BLAST search tool."""
registry = ToolRegistry.get_instance()
tool = BlastSearchTool()
registry.register_tool(tool.get_spec(), tool)
print(f"Registered tool: {tool.get_spec().name}")
# Register during module import or application startup
register_blast_tool()
Advanced Tool Features¶
Asynchronous Execution¶
For tools that perform long-running operations:
import asyncio
from deepresearch.src.tools.base import AsyncToolRunner
class AsyncBlastTool(AsyncToolRunner):
"""Asynchronous BLAST search tool."""
async def run_async(self, parameters: Dict[str, Any]) -> ExecutionResult:
"""Execute BLAST search asynchronously."""
# Implementation using async HTTP requests
# This allows better concurrency and resource utilization
pass
Streaming Results¶
For tools that produce large amounts of data:
from deepresearch.src.tools.base import StreamingToolRunner
class StreamingAlignmentTool(StreamingToolRunner):
"""Tool that streams alignment results."""
def run_streaming(self, parameters: Dict[str, Any]):
"""Execute alignment and stream results."""
# Yield results as they become available
for partial_result in self._perform_incremental_alignment(parameters):
yield partial_result
Tool Dependencies¶
Handle tools that depend on other tools:
class DependentAnalysisTool(ToolRunner):
"""Tool that depends on other tools."""
def __init__(self, registry: ToolRegistry):
super().__init__(tool_spec)
self.registry = registry
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
# First, use a BLAST search tool
blast_result = self.registry.execute_tool("blast_search", {
"sequence": parameters["sequence"]
})
if not blast_result.success:
return ExecutionResult(
success=False,
error=f"BLAST search failed: {blast_result.error}"
)
# Then perform analysis on the results
analysis = self._analyze_blast_results(blast_result.data["results"])
return ExecutionResult(success=True, data={"analysis": analysis})
Tool Configuration¶
Support configurable tool behavior:
class ConfigurableBlastTool(ToolRunner):
"""BLAST tool with runtime configuration."""
def __init__(self, config: Dict[str, Any]):
self.max_retries = config.get("max_retries", 3)
self.timeout = config.get("timeout", 600)
self.api_key = config.get("api_key")
super().__init__(create_tool_spec())
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
# Use configuration in execution
# Implementation here
pass
Tool Packaging and Distribution¶
Tool Modules¶
Organize tools into modules:
deepresearch/src/tools/
├── bioinformatics/
│ ├── blast_search.py
│ ├── sequence_alignment.py
│ └── __init__.py
├── chemistry/
│ ├── molecular_docking.py
│ └── property_prediction.py
└── search/
├── web_search.py
└── document_search.py
Tool Discovery¶
Enable automatic tool discovery:
# In __init__.py
from deepresearch.src.utils.tool_registry import ToolRegistry
def discover_and_register_tools():
"""Automatically discover and register tools."""
registry = ToolRegistry.get_instance()
# Import tool modules
from . import bioinformatics, chemistry, search
# Register all tools in modules
tool_modules = [bioinformatics, chemistry, search]
for module in tool_modules:
for attr_name in dir(module):
attr = getattr(module, attr_name)
if (isinstance(attr, type) and
issubclass(attr, ToolRunner) and
attr != ToolRunner):
# Create instance and register
tool_instance = attr()
registry.register_tool(
tool_instance.get_spec(),
tool_instance
)
# Auto-discover tools on import
discover_and_register_tools()
Performance Optimization¶
Caching¶
Implement result caching for expensive operations:
from deepresearch.src.utils.cache import ToolCache
class CachedBlastTool(ToolRunner):
"""BLAST tool with result caching."""
def __init__(self):
super().__init__(tool_spec)
self.cache = ToolCache(ttl_seconds=3600) # 1 hour cache
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
# Create cache key from parameters
cache_key = self.cache.create_key(parameters)
# Check cache first
cached_result = self.cache.get(cache_key)
if cached_result:
return cached_result
# Execute tool
result = self._execute_blast(parameters)
# Cache successful results
if result.success:
self.cache.set(cache_key, result)
return result
Resource Management¶
Handle resource-intensive operations properly:
import psutil
import os
class ResourceAwareBlastTool(ToolRunner):
"""BLAST tool with resource monitoring."""
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
# Check available memory
available_memory = psutil.virtual_memory().available / (1024 * 1024) # MB
if available_memory < self.get_spec().metadata.get("memory_limit_mb", 1024):
return ExecutionResult(
success=False,
error="Insufficient memory for BLAST search",
error_type="ResourceError"
)
# Monitor memory usage during execution
process = psutil.Process(os.getpid())
initial_memory = process.memory_info().rss
result = self._execute_blast(parameters)
final_memory = process.memory_info().rss
memory_used = (final_memory - initial_memory) / (1024 * 1024) # MB
# Add memory usage to result metadata
if result.success and "metadata" in result.data:
result.data["metadata"]["memory_used_mb"] = memory_used
return result
Error Handling and Recovery¶
Comprehensive Error Handling¶
class RobustBlastTool(ToolRunner):
"""BLAST tool with comprehensive error handling."""
def run(self, parameters: Dict[str, Any]) -> ExecutionResult:
try:
# Input validation
validated_params = self._validate_parameters(parameters)
# Pre-flight checks
self._check_prerequisites(validated_params)
# Execute with retries
result = self._execute_with_retries(validated_params)
# Post-processing validation
self._validate_results(result)
return result
except ValidationError as e:
return ExecutionResult(
success=False,
error=f"Input validation failed: {e}",
error_type="ValidationError"
)
except NetworkError as e:
return ExecutionResult(
success=False,
error=f"Network error: {e}",
error_type="NetworkError"
)
except TimeoutError as e:
return ExecutionResult(
success=False,
error=f"Operation timed out: {e}",
error_type="TimeoutError"
)
except Exception as e:
# Log unexpected errors
self._log_error(e, parameters)
return ExecutionResult(
success=False,
error=f"Unexpected error: {e}",
error_type="InternalError"
)
def _validate_parameters(self, parameters):
"""Validate input parameters."""
# Implementation here
pass
def _check_prerequisites(self, parameters):
"""Check system prerequisites."""
# Check network connectivity, API availability, etc.
pass
def _execute_with_retries(self, parameters, max_retries=3):
"""Execute with automatic retries."""
for attempt in range(max_retries):
try:
return self._execute_blast(parameters)
except TemporaryError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
def _validate_results(self, result):
"""Validate execution results."""
# Check result structure, data integrity, etc.
pass
def _log_error(self, error, parameters):
"""Log errors for debugging."""
# Implementation here
pass
Testing Best Practices¶
Test Categories¶
- Unit Tests: Test individual methods and functions
- Integration Tests: Test tool interaction with external services
- Performance Tests: Test execution time and resource usage
- Error Handling Tests: Test various error conditions
- Edge Case Tests: Test boundary conditions and unusual inputs
Test Fixtures¶
@pytest.fixture
def sample_blast_parameters():
"""Provide sample BLAST search parameters."""
return {
"sequence": "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP",
"database": "swissprot",
"e_value_threshold": 1e-5,
"max_results": 50
}
@pytest.fixture
def mock_blast_response():
"""Mock BLAST search response."""
return {
"results": [
{
"accession": "P04637",
"description": "Cellular tumor antigen p53",
"e_value": 1e-150,
"identity": 100.0,
"alignment_length": 393
}
],
"search_info": {
"database_size": 500000,
"search_time": 2.5,
"total_hits": 1
}
}
Mocking External Dependencies¶
@patch('Bio.Blast.NCBIWWW.qblast')
def test_blast_search_with_mock(mock_qblast, tool, sample_blast_parameters, mock_blast_response):
"""Test BLAST search with mocked NCBI API."""
# Setup mock
mock_result = MagicMock()
mock_qblast.return_value = mock_result
# Mock result parsing
with patch.object(tool, '_parse_blast_results', return_value=mock_blast_response["results"]):
result = tool.run(sample_blast_parameters)
assert result.success
assert result.data["results"] == mock_blast_response["results"]
mock_qblast.assert_called_once()
Documentation¶
Tool Documentation¶
Provide comprehensive documentation for your tool:
def get_tool_documentation():
"""Get detailed documentation for the BLAST search tool."""
return {
"name": "NCBI BLAST Search",
"description": "Perform sequence similarity searches using NCBI BLAST",
"version": "2.0.0",
"author": "NCBI Tools Team",
"license": "Public Domain",
"usage_examples": [
{
"description": "Basic protein BLAST search",
"parameters": {
"sequence": "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP",
"database": "swissprot"
}
},
{
"description": "Nucleotide BLAST with custom parameters",
"parameters": {
"sequence": "ATCGATCGATCGATCGATCGATCG",
"database": "nr",
"e_value_threshold": 1e-10,
"max_results": 100
}
}
],
"limitations": [
"Requires internet connection for NCBI API access",
"Subject to NCBI usage policies and rate limits",
"Large searches may take significant time"
],
"troubleshooting": {
"NetworkError": "Check internet connection and NCBI service status",
"TimeoutError": "Reduce sequence length or increase timeout limit",
"ValidationError": "Ensure sequence format is correct"
}
}
Deployment and Distribution¶
Tool Packaging¶
Package tools for distribution:
# setup.py or pyproject.toml
setup(
name="deepcritical-blast-tool",
version="2.0.0",
packages=["deepresearch.tools.bioinformatics"],
install_requires=[
"deepresearch>=1.0.0",
"biopython>=1.80",
"requests>=2.28.0"
],
entry_points={
"deepresearch.tools": [
"blast_search = deepresearch.tools.bioinformatics.blast_search:BlastSearchTool"
]
}
)
CI/CD Integration¶
Integrate tool testing into CI/CD:
# .github/workflows/test-tools.yml
name: Test Tools
on: [push, pull_request]
jobs:
test-bioinformatics-tools:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -e .[dev]
- name: Run bioinformatics tool tests
run: pytest tests/tools/test_bioinformatics/ -v
- name: Test tool registration
run: python -c "from deepresearch.tools.bioinformatics import register_tools; register_tools()"
Best Practices Summary¶
- Clear Specifications: Define comprehensive input/output specifications
- Robust Error Handling: Handle all error conditions gracefully
- Comprehensive Testing: Test all code paths and edge cases
- Performance Awareness: Monitor and optimize resource usage
- Good Documentation: Provide clear usage examples and limitations
- Version Compatibility: Maintain backward compatibility
- Security Conscious: Validate inputs and handle sensitive data properly
- Modular Design: Keep tools focused on single responsibilities
Related Documentation¶
- Tool Registry Guide - Tool registration and management
- Testing Guide - Testing best practices
- Contributing Guide - Contribution guidelines
- API Reference - Complete tool API documentation