Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tensor Cache Architecture

Semantic caching for LLM responses with cost tracking and background eviction. Module 10 of Neumann.

The tensor_cache module provides multi-layer caching optimized for LLM workloads. It combines O(1) exact hash lookups with O(log n) semantic similarity search via HNSW indices. All cache entries are stored as TensorData in a shared TensorStore, following the tensor-native paradigm used by tensor_vault and tensor_blob.

Design Principles

PrincipleDescription
Multi-Layer CachingExact O(1), Semantic O(log n), Embedding O(1) lookups
Cost-AwareTracks tokens and estimates savings using tiktoken
Background EvictionAsync eviction with configurable strategies
TTL ExpirationTime-based entry expiration with min-heap tracking
Thread-SafeAll operations are concurrent via DashMap
Zero Allocation LookupEmbeddings stored inline, not as pointers
Sparse-AwareAutomatic sparse storage for vectors with >50% zeros

Key Types

Core Types

TypeDescription
CacheMain API - multi-layer LLM response cache
CacheConfigConfiguration (capacity, TTL, eviction, metrics)
CacheHitSuccessful cache lookup result
CacheStatsThread-safe statistics with atomic counters
StatsSnapshotPoint-in-time snapshot for reporting
CacheLayerEnum: Exact, Semantic, Embedding
CacheErrorError types for cache operations

Configuration Types

TypeDescription
EvictionStrategyLRU, LFU, CostBased, Hybrid
EvictionManagerBackground eviction task controller
EvictionScorerCalculates eviction priority scores
EvictionHandleHandle for controlling background eviction
EvictionConfigInterval, batch size, and strategy settings

Token Counting

TypeDescription
TokenCounterGPT-4 compatible token counting via tiktoken
ModelPricingPredefined pricing for GPT-4, Claude 3, etc.

Index Types (Internal)

TypeDescription
CacheIndexHNSW wrapper with key-to-node mapping
IndexSearchResultSemantic search result with similarity score

Architecture Diagram

+--------------------------------------------------+
|                  Cache (Public API)               |
|   - get(prompt, embedding) -> CacheHit           |
|   - put(prompt, embedding, response, ...)        |
|   - stats(), evict(), clear()                    |
+--------------------------------------------------+
            |           |           |
    +-------+    +------+    +------+
    |            |           |
+--------+  +----------+  +-----------+
| Exact  |  | Semantic |  | Embedding |
| Cache  |  |  Cache   |  |   Cache   |
| O(1)   |  | O(log n) |  |   O(1)    |
+--------+  +----------+  +-----------+
    |            |           |
    +-------+----+----+------+
            |
    +------------------+
    |   CacheIndex     |
    |  (HNSW wrapper)  |
    +------------------+
            |
    +------------------+
    |   tensor_store   |
    |     hnsw.rs      |
    +------------------+

Multi-Layer Cache Lookup Algorithm

The cache lookup algorithm is designed to maximize hit rates while minimizing latency. It follows a hierarchical approach, checking faster layers first before falling back to more expensive operations.

Lookup Flow Diagram

flowchart TD
    A[get prompt, embedding] --> B{Exact Cache Hit?}
    B -->|Yes| C[Return CacheHit layer=Exact]
    B -->|No| D[Record Exact Miss]
    D --> E{Embedding Provided?}
    E -->|No| F[Return None]
    E -->|Yes| G{Auto-Select Metric?}
    G -->|Yes| H{Sparsity >= Threshold?}
    G -->|No| I[Use Configured Metric]
    H -->|Yes| J[Use Jaccard]
    H -->|No| I
    J --> K[HNSW Search with Metric]
    I --> K
    K --> L{Results Above Threshold?}
    L -->|No| M[Record Semantic Miss]
    M --> F
    L -->|Yes| N{Entry Expired?}
    N -->|Yes| M
    N -->|No| O[Return CacheHit layer=Semantic]

Exact Cache Lookup (O(1))

The exact cache uses a hash-based key derived from the prompt text:

#![allow(unused)]
fn main() {
// Key generation using DefaultHasher
fn exact_key(prompt: &str) -> String {
    let mut hasher = DefaultHasher::new();
    prompt.hash(&mut hasher);
    let hash = hasher.finish();
    format!("_cache:exact:{:016x}", hash)
}
}

The lookup sequence:

  1. Generate hash key from prompt
  2. Query TensorStore with key
  3. Check expiration timestamp
  4. Return hit or proceed to semantic lookup

Semantic Cache Lookup (O(log n))

The semantic cache uses HNSW (Hierarchical Navigable Small World) graphs for approximate nearest neighbor search:

flowchart LR
    A[Query Vector] --> B[HNSW Entry Point]
    B --> C[Layer 2: Coarse Search]
    C --> D[Layer 1: Refined Search]
    D --> E[Layer 0: Fine Search]
    E --> F[Top-k Candidates]
    F --> G[Re-score with Metric]
    G --> H[Filter by Threshold]
    H --> I[Return Best Match]

Re-scoring Strategy: The HNSW index retrieves candidates using cosine similarity, then re-scores them with the requested metric. This allows using different metrics without rebuilding the index:

#![allow(unused)]
fn main() {
// Retrieve more candidates than needed for re-scoring
let ef = (k * 3).max(10);
let candidates = index.search(query, ef);

// Re-score with specified metric
let similarity = match &embedding {
    EmbeddingStorage::Dense(dense) => {
        let stored_sparse = SparseVector::from_dense(dense);
        let raw = metric.compute(&query_sparse, &stored_sparse);
        metric.to_similarity(raw)
    }
    EmbeddingStorage::Sparse(sparse) => {
        let raw = metric.compute(&query_sparse, sparse);
        metric.to_similarity(raw)
    }
    // ...handles Delta and TensorTrain storage types
};
}

Automatic Metric Selection

When auto_select_metric is enabled, the cache automatically selects the optimal distance metric based on embedding sparsity:

#![allow(unused)]
fn main() {
fn select_metric(&self, embedding: &[f32]) -> DistanceMetric {
    if !self.config.auto_select_metric {
        return self.config.distance_metric.clone();
    }

    let sparse = SparseVector::from_dense(embedding);
    if sparse.sparsity() >= self.config.sparsity_metric_threshold {
        DistanceMetric::Jaccard  // Better for sparse vectors
    } else {
        self.config.distance_metric.clone()  // Default (usually Cosine)
    }
}
}

Cache Layers

Exact Cache (O(1))

Hash-based lookup for identical queries. Keys are generated from prompt text using DefaultHasher. Stored with prefix _cache:exact:.

When to use: Repetitive queries with exact same prompts (e.g., FAQ systems, chatbots with canned responses).

Semantic Cache (O(log n))

HNSW-based similarity search for semantically similar queries. Uses configurable distance metrics (Cosine, Jaccard, Euclidean, Angular). Stored with prefix _cache:sem:.

When to use: Natural language queries with variations (e.g., “What’s the weather?” vs “How’s the weather today?”).

Embedding Cache (O(1))

Stores precomputed embeddings to avoid redundant embedding API calls. Keys combine source and content hash. Stored with prefix _cache:emb:.

When to use: When embedding computation is expensive and the same content is embedded multiple times.

Storage Format

Cache entries are stored as TensorData with standardized fields:

FieldTypeDescription
_responseStringCached response text
_embeddingVector/SparseEmbedding (semantic/embedding layers)
_embedding_dimIntEmbedding dimension
_input_tokensIntInput token count
_output_tokensIntOutput token count
_modelStringModel identifier
_layerStringCache layer (exact/semantic/embedding)
_created_atIntCreation timestamp (millis)
_expires_atIntExpiration timestamp (millis)
_access_countIntAccess count for LFU
_last_accessIntLast access timestamp for LRU
_versionStringOptional version tag
_sourceStringEmbedding source identifier
_content_hashIntContent hash for deduplication

Sparse Storage Optimization

Embeddings with high sparsity (>50% zeros) are automatically stored in sparse format to reduce memory usage:

#![allow(unused)]
fn main() {
fn should_use_sparse(vector: &[f32]) -> bool {
    if vector.is_empty() {
        return false;
    }
    let nnz = vector.iter().filter(|&&v| v.abs() > 1e-6).count();
    // Use sparse if non-zero count <= half of total length
    nnz * 2 <= vector.len()
}
}

Distance Metrics

Configurable distance metrics for semantic similarity:

MetricBest ForRangeFormula
CosineDense embeddings (default)-1 to 1dot(a,b) / (‖a‖ * ‖b‖)
AngularLinear angle relationships0 to PIacos(cosine_sim)
JaccardSparse/binary embeddings0 to 1‖A ∩ B‖ / ‖A ∪ B‖
EuclideanAbsolute distances0 to infsqrt(sum((a-b)^2))
WeightedJaccardSparse with magnitudes0 to 1Weighted set similarity

Auto-selection: When auto_select_metric is true, the cache automatically selects Jaccard for sparse embeddings (sparsity >= threshold, default 70%) and the configured metric otherwise.

Eviction Strategies

Strategy Comparison

StrategyDescriptionScore FormulaBest For
LRUEvicts entries that haven’t been accessed recently-last_access_secsGeneral purpose
LFUEvicts entries with lowest access countaccess_countStable workloads
CostBasedEvicts entries with lowest cost savings per bytecost_per_hit / size_bytesCost optimization
HybridCombines all strategies with configurable weightsWeighted combinationProduction systems

Hybrid Eviction Score Algorithm

The Hybrid strategy combines recency, frequency, and cost factors:

#![allow(unused)]
fn main() {
pub fn score(
    &self,
    last_access_secs: f64,
    access_count: u64,
    cost_per_hit: f64,
    size_bytes: usize,
) -> f64 {
    match self.strategy {
        EvictionStrategy::LRU => -last_access_secs,
        EvictionStrategy::LFU => access_count as f64,
        EvictionStrategy::CostBased => {
            if size_bytes == 0 { 0.0 }
            else { cost_per_hit / size_bytes as f64 }
        }
        EvictionStrategy::Hybrid { lru_weight, lfu_weight, cost_weight } => {
            let total = f64::from(lru_weight) + f64::from(lfu_weight) + f64::from(cost_weight);
            let recency_w = f64::from(lru_weight) / total;
            let frequency_w = f64::from(lfu_weight) / total;
            let cost_w = f64::from(cost_weight) / total;

            let age_minutes = last_access_secs / 60.0;
            let recency_score = 1.0 / (1.0 + age_minutes);    // Decays with age
            let frequency_score = (1.0 + access_count as f64).log2();  // Log scale
            let cost_score = cost_per_hit;

            recency_score * recency_w + frequency_score * frequency_w + cost_score * cost_w
        }
    }
}
}

Lower scores are evicted first. The hybrid formula:

  • recency_score: Decays as 1/(1 + age_in_minutes) - newer entries score higher
  • frequency_score: Grows logarithmically with access count - frequently accessed entries score higher
  • cost_score: Direct cost per hit - higher cost savings score higher

Background Eviction Flow

flowchart TD
    A[EvictionManager::start] --> B[Spawn Tokio Task]
    B --> C[Initialize Interval Timer]
    C --> D{Select Event}
    D -->|Timer Tick| E[Call evict_fn batch_size]
    D -->|Shutdown Signal| F[Set running=false]
    E --> G{Evicted > 0?}
    G -->|Yes| H[Record Eviction Stats]
    G -->|No| D
    H --> D
    F --> I[Break Loop]
#![allow(unused)]
fn main() {
// Starting background eviction
let handle = manager.start(move |batch_size| {
    cache.evict(batch_size)
});

// Later: graceful shutdown
handle.shutdown().await;
}

Configuration

Default Configuration

#![allow(unused)]
fn main() {
CacheConfig {
    exact_capacity: 10_000,
    semantic_capacity: 5_000,
    embedding_capacity: 50_000,
    default_ttl: Duration::from_secs(3600),
    max_ttl: Duration::from_secs(86400),
    semantic_threshold: 0.92,
    embedding_dim: 1536,
    eviction_strategy: EvictionStrategy::Hybrid {
        lru_weight: 40,
        lfu_weight: 30,
        cost_weight: 30
    },
    eviction_interval: Duration::from_secs(60),
    eviction_batch_size: 100,
    input_cost_per_1k: 0.0015,
    output_cost_per_1k: 0.002,
    inline_threshold: 4096,
    distance_metric: DistanceMetric::Cosine,
    auto_select_metric: true,
    sparsity_metric_threshold: 0.7,
}
}

Configuration Presets

PresetUse CaseExact CapacitySemantic CapacityEmbedding CapacityEviction Batch
default()General purpose10,0005,00050,000100
high_throughput()High-traffic server50,00020,000100,000500
low_memory()Memory-constrained1,0005005,00050
development()Dev/testing1005020010
sparse_embeddings()Sparse vectors10,0005,00050,000100

Configuration Validation

The config validates on cache creation:

#![allow(unused)]
fn main() {
pub fn validate(&self) -> Result<(), String> {
    if self.semantic_threshold < 0.0 || self.semantic_threshold > 1.0 {
        return Err("semantic_threshold must be between 0.0 and 1.0");
    }
    if self.embedding_dim == 0 {
        return Err("embedding_dim must be greater than 0");
    }
    if self.eviction_batch_size == 0 {
        return Err("eviction_batch_size must be greater than 0");
    }
    if self.default_ttl > self.max_ttl {
        return Err("default_ttl cannot exceed max_ttl");
    }
    if self.sparsity_metric_threshold < 0.0 || self.sparsity_metric_threshold > 1.0 {
        return Err("sparsity_metric_threshold must be between 0.0 and 1.0");
    }
    Ok(())
}
}

Usage Examples

Basic Usage

#![allow(unused)]
fn main() {
use tensor_cache::{Cache, CacheConfig};

let mut config = CacheConfig::default();
config.embedding_dim = 3;
let cache = Cache::with_config(config).unwrap();

// Store a response
let embedding = vec![0.1, 0.2, 0.3];
cache.put("What is 2+2?", &embedding, "4", "gpt-4", None).unwrap();

// Look up (tries exact first, then semantic)
if let Some(hit) = cache.get("What is 2+2?", Some(&embedding)) {
    println!("Cached: {}", hit.response);
}
}

Explicit Metric Queries

#![allow(unused)]
fn main() {
use tensor_cache::DistanceMetric;

let hit = cache.get_with_metric(
    "query",
    Some(&embedding),
    Some(&DistanceMetric::Euclidean),
);

if let Some(hit) = hit {
    println!("Metric used: {:?}", hit.metric_used);
}
}

Embedding Cache with Compute Fallback

#![allow(unused)]
fn main() {
// Get cached embedding or compute on miss
let embedding = cache.get_or_compute_embedding(
    "openai",           // source
    "Hello, world!",    // content
    "text-embedding-3-small",  // model
    || {
        // Compute function called only on cache miss
        Ok(compute_embedding("Hello, world!"))
    }
)?;
}

Token Counting and Cost Estimation

#![allow(unused)]
fn main() {
use tensor_cache::{TokenCounter, ModelPricing};

// Count tokens in text
let tokens = TokenCounter::count("Hello, world!");

// Count tokens in chat messages (includes overhead)
let messages = vec![("user", "Hello"), ("assistant", "Hi there!")];
let total = TokenCounter::count_messages(&messages);

// Estimate cost with custom rates
let cost = TokenCounter::estimate_cost(1000, 500, 0.01, 0.03);

// Use predefined model pricing
let pricing = ModelPricing::GPT4O;
let cost = pricing.estimate(1000, 500);

// Lookup pricing by model name
if let Some(pricing) = ModelPricing::for_model("gpt-4o-mini") {
    println!("Cost: ${:.4}", pricing.estimate(1000, 500));
}
}

Statistics and Monitoring

#![allow(unused)]
fn main() {
let stats = cache.stats_snapshot();

// Hit rates by layer
println!("Exact hit rate: {:.2}%", stats.hit_rate(CacheLayer::Exact) * 100.0);
println!("Semantic hit rate: {:.2}%", stats.hit_rate(CacheLayer::Semantic) * 100.0);

// Tokens and cost saved
println!("Input tokens saved: {}", stats.tokens_saved_in);
println!("Output tokens saved: {}", stats.tokens_saved_out);
println!("Cost saved: ${:.2}", stats.cost_saved_dollars);

// Cache utilization
println!("Total entries: {}", stats.total_entries());
println!("Evictions: {}", stats.evictions);
println!("Expirations: {}", stats.expirations);
println!("Uptime: {} seconds", stats.uptime_secs);
}

Shared TensorStore Integration

#![allow(unused)]
fn main() {
use tensor_store::TensorStore;
use tensor_cache::{Cache, CacheConfig};

// Share store with other engines
let store = TensorStore::new();
let cache = Cache::with_store(store.clone(), CacheConfig::default())?;

// Other engines can use the same store
let vault = Vault::with_store(store.clone(), VaultConfig::default())?;
}

Token Counting Implementation

The TokenCounter uses tiktoken’s cl100k_base encoding, which is compatible with GPT-4, GPT-3.5-turbo, and text-embedding-ada-002.

Lazy Encoder Initialization

#![allow(unused)]
fn main() {
static CL100K_ENCODER: OnceLock<Option<CoreBPE>> = OnceLock::new();

impl TokenCounter {
    fn encoder() -> Option<&'static CoreBPE> {
        CL100K_ENCODER
            .get_or_init(|| tiktoken_rs::cl100k_base().ok())
            .as_ref()
    }
}
}

Fallback Estimation

If tiktoken is unavailable, falls back to character-based estimation (~4 chars per token for English text):

#![allow(unused)]
fn main() {
const fn estimate_tokens(text: &str) -> usize {
    text.len().div_ceil(4)
}
}

Message Token Counting

Chat messages include overhead tokens per message (role markers, separators):

#![allow(unused)]
fn main() {
pub fn count_message(role: &str, content: &str) -> usize {
    Self::encoder().map_or_else(
        || Self::estimate_tokens(role) + Self::estimate_tokens(content) + 4,
        |enc| {
            let role_tokens = enc.encode_ordinary(role).len();
            let content_tokens = enc.encode_ordinary(content).len();
            role_tokens + content_tokens + 4  // 4 tokens overhead per message
        },
    )
}

pub fn count_messages(messages: &[(&str, &str)]) -> usize {
    let mut total = 0;
    for (role, content) in messages {
        total += Self::count_message(role, content);
    }
    total + 3  // 3 tokens for assistant reply priming
}
}

Cost Calculation Formulas

#![allow(unused)]
fn main() {
// Basic cost calculation
pub fn estimate_cost(
    input_tokens: usize,
    output_tokens: usize,
    input_rate: f64,   // $/1000 tokens
    output_rate: f64,  // $/1000 tokens
) -> f64 {
    (input_tokens as f64 / 1000.0) * input_rate +
    (output_tokens as f64 / 1000.0) * output_rate
}

// For atomic operations (avoids floating point accumulation errors)
pub fn estimate_cost_microdollars(...) -> u64 {
    let dollars = Self::estimate_cost(...);
    (dollars * 1_000_000.0) as u64
}
}

Model Pricing

ModelInput/1KOutput/1KNotes
GPT-4o$0.005$0.015Best for complex tasks
GPT-4o mini$0.00015$0.0006Cost-effective
GPT-4 Turbo$0.01$0.03High capability
GPT-3.5 Turbo$0.0005$0.0015Budget option
Claude 3 Opus$0.015$0.075Highest quality
Claude 3 Sonnet$0.003$0.015Balanced
Claude 3 Haiku$0.00025$0.00125Fast and cheap

Model Name Matching

#![allow(unused)]
fn main() {
pub fn for_model(model: &str) -> Option<Self> {
    let model_lower = model.to_lowercase();
    if model_lower.contains("gpt-4o-mini") {
        Some(Self::GPT4O_MINI)
    } else if model_lower.contains("gpt-4o") {
        Some(Self::GPT4O)
    } else if model_lower.contains("gpt-4-turbo") {
        Some(Self::GPT4_TURBO)
    } else if model_lower.contains("gpt-3.5") {
        Some(Self::GPT35_TURBO)
    } else if model_lower.contains("claude-3-opus") || model_lower.contains("claude-opus") {
        Some(Self::CLAUDE3_OPUS)
    } else if model_lower.contains("claude-3-sonnet") || model_lower.contains("claude-sonnet") {
        Some(Self::CLAUDE3_SONNET)
    } else if model_lower.contains("claude-3-haiku") || model_lower.contains("claude-haiku") {
        Some(Self::CLAUDE3_HAIKU)
    } else {
        None
    }
}
}

Semantic Search Index Internals

CacheIndex Structure

#![allow(unused)]
fn main() {
pub struct CacheIndex {
    index: RwLock<HNSWIndex>,           // HNSW graph
    config: HNSWConfig,                  // For recreation on clear
    key_to_node: DashMap<String, usize>, // Cache key -> HNSW node
    node_to_key: DashMap<usize, String>, // HNSW node -> Cache key
    dimension: usize,                    // Expected embedding dimension
    entry_count: AtomicUsize,            // Entry count
    distance_metric: DistanceMetric,     // Default metric
}
}

Insert Strategies

#![allow(unused)]
fn main() {
// Dense embedding insert
pub fn insert(&self, key: &str, embedding: &[f32]) -> Result<usize>;

// Sparse embedding insert (memory efficient)
pub fn insert_sparse(&self, key: &str, embedding: &SparseVector) -> Result<usize>;

// Auto-select based on sparsity threshold
pub fn insert_auto(
    &self,
    key: &str,
    embedding: &[f32],
    sparsity_threshold: f32,
) -> Result<usize>;
}

Key Orphaning on Re-insert

When a key is re-inserted, the old HNSW node is orphaned (not deleted) because HNSW doesn’t support efficient deletion:

#![allow(unused)]
fn main() {
let is_new = !self.key_to_node.contains_key(key);
if !is_new {
    // Remove mapping but leave HNSW node (will be ignored in search)
    self.key_to_node.remove(key);
}
}

Memory Statistics

#![allow(unused)]
fn main() {
pub fn memory_stats(&self) -> Option<HNSWMemoryStats> {
    self.index.read().ok().map(|index| index.memory_stats())
}
// Returns: dense_count, sparse_count, delta_count, embedding_bytes, etc.
}

Error Types

ErrorDescriptionRecovery
NotFoundCache entry not foundCheck key exists
DimensionMismatchEmbedding dimension does not match configVerify embedding size
StorageErrorUnderlying tensor store errorCheck store health
SerializationErrorSerialization/deserialization failedVerify data format
TokenizerErrorToken counting failedFalls back to estimation
CacheFullCache capacity exceededRun eviction or increase capacity
InvalidConfigInvalid configuration providedFix config values
CancelledOperation was cancelledRetry operation
LockPoisonedInternal lock was poisonedRestart cache

Error Conversion

#![allow(unused)]
fn main() {
impl From<tensor_store::TensorStoreError> for CacheError {
    fn from(e: TensorStoreError) -> Self {
        Self::StorageError(e.to_string())
    }
}

impl From<bitcode::Error> for CacheError {
    fn from(e: bitcode::Error) -> Self {
        Self::SerializationError(e.to_string())
    }
}
}

Performance

Benchmarks (10,000 entries, 128-dim embeddings)

OperationTimeNotes
Exact lookup (hit)~50nsHash lookup + TensorStore get
Exact lookup (miss)~30nsHash lookup only
Semantic lookup~5usHNSW search + re-scoring
Put (exact + semantic)~10usTwo stores + HNSW insert
Eviction (100 entries)~200usBatch deletion
Clear (full index)~1msHNSW recreation

Distance Metric Performance (128-dim, 1000 entries)

MetricSearch TimeNotes
Cosine21 usDefault, best for dense
Jaccard18 usBest for sparse
Angular23 us+acos overhead
Euclidean19 usAbsolute distance

Auto-Selection Overhead

OperationTime
Sparsity check~50 ns
Metric selection~10 ns

Memory Efficiency

Storage TypeMemory per EntryBest For
Dense Vector4 * dim bytesLow sparsity (<50% zeros)
Sparse Vector8 * nnz bytesHigh sparsity (>50% zeros)

Edge Cases and Gotchas

TTL Behavior

  • Entries with expires_at = 0 never expire
  • Expired entries return None on lookup but remain in storage until cleanup
  • cleanup_expired() must be called explicitly or via background eviction

Capacity Limits

  • put() fails with CacheFull when capacity is reached
  • Capacity is checked per-layer (exact, semantic, embedding)
  • No automatic eviction on put - must be explicit

Hash Collisions

  • Extremely unlikely with 64-bit hashes (~1 in 18 quintillion)
  • If collision occurs, exact cache will return wrong response
  • Semantic cache provides fallback for semantically different queries

Metric Re-scoring

  • HNSW always uses cosine similarity for graph navigation
  • Re-scoring with different metrics may change result order
  • Retrieves 3x candidates to account for re-ranking

Sparse Storage Threshold

  • Uses sparse format when nnz * 2 <= len (50% zeros)
  • Different from auto-metric selection threshold (default 70%)
  • Both thresholds are configurable

Performance Tips and Best Practices

Configuration Tuning

  1. Semantic Threshold: Start with 0.92, lower to 0.85 for fuzzy matching
  2. Eviction Weights: Increase cost_weight if API costs matter most
  3. Batch Size: Larger batches (500+) for high-throughput systems
  4. TTL: Match to your content freshness requirements

Memory Optimization

  1. Use sparse_embeddings() preset for sparse data
  2. Set inline_threshold based on typical response sizes
  3. Enable auto_select_metric for mixed workloads
  4. Monitor memory_stats() to track sparse vs dense ratio

Hit Rate Optimization

  1. Normalize prompts before caching (lowercase, trim whitespace)
  2. Use versioning for model/prompt template changes
  3. Set appropriate semantic threshold for your domain
  4. Consider domain-specific embeddings

Cost Tracking

  1. Use estimate_cost_microdollars() for atomic accumulation
  2. Record cost per cache hit for ROI analysis
  3. Compare tokens_saved against capacity costs

Shell Commands

CACHE INIT     Initialize semantic cache
CACHE STATS    Show cache statistics
CACHE CLEAR    Clear all cache entries

API Reference

Cache Methods

MethodDescription
new()Create with default config
with_config(config)Create with custom config
with_store(store, config)Create with shared TensorStore
get(prompt, embedding)Look up cached response
get_with_metric(prompt, embedding, metric)Look up with explicit metric
put(prompt, embedding, response, model, ttl)Store response
get_embedding(source, content)Get cached embedding
put_embedding(source, content, embedding, model)Store embedding
get_or_compute_embedding(source, content, model, compute)Get or compute embedding
get_simple(key)Simple key-value lookup
put_simple(key, value)Simple key-value store
invalidate(prompt)Remove exact entry
invalidate_version(version)Remove entries by version
invalidate_embeddings(source)Remove embeddings by source
evict(count)Manually evict entries
cleanup_expired()Remove expired entries
clear()Clear all entries
stats()Get statistics reference
stats_snapshot()Get statistics snapshot
config()Get configuration reference
len()Total cached entries
is_empty()Check if cache is empty

CacheHit Fields

FieldTypeDescription
responseStringCached response text
layerCacheLayerWhich layer matched
similarityOption<f32>Similarity score (semantic only)
input_tokensusizeInput tokens saved
output_tokensusizeOutput tokens saved
cost_savedf64Estimated cost saved (dollars)
metric_usedOption<DistanceMetric>Metric used (semantic only)

StatsSnapshot Fields

FieldTypeDescription
exact_hitsu64Exact cache hits
exact_missesu64Exact cache misses
semantic_hitsu64Semantic cache hits
semantic_missesu64Semantic cache misses
embedding_hitsu64Embedding cache hits
embedding_missesu64Embedding cache misses
tokens_saved_inu64Total input tokens saved
tokens_saved_outu64Total output tokens saved
cost_saved_dollarsf64Total cost saved
evictionsu64Total evictions
expirationsu64Total expirations
exact_sizeusizeCurrent exact cache size
semantic_sizeusizeCurrent semantic cache size
embedding_sizeusizeCurrent embedding cache size
uptime_secsu64Cache uptime in seconds

Dependencies

CratePurpose
tensor_storeHNSW index implementation, TensorStore
tiktoken-rsGPT-compatible token counting
dashmapConcurrent hash maps
tokioAsync runtime for background eviction
uuidUnique ID generation
thiserrorError type derivation
serdeConfiguration serialization
bincodeBinary serialization
  • tensor_store - Backing storage and HNSW index
  • query_router - Cache integration for query execution
  • neumann_shell - CLI commands for cache management