SteadyText Performance Benchmarks¶

This document provides detailed performance and accuracy benchmarks for SteadyText v1.3.3.

Quick Summary¶

SteadyText delivers 100% deterministic text generation and embeddings with competitive performance:

Text Generation: 21.4 generations/sec (46.7ms mean latency)
Embeddings: 104.4 single embeddings/sec, up to 598.7 embeddings/sec in batches
Cache Performance: 48x speedup for repeated prompts
Memory Usage: ~1.4GB for models, 150-200MB during operation
Determinism: 100% consistent outputs across all platforms and runs
Accuracy: 69.4% similarity for related texts with correct similarity ordering

Speed Benchmarks¶

Text Generation Performance¶

SteadyText v2.0.0+ uses the Gemma-3n-E2B-it-Q8_0.gguf model (Gemma-3n-2B) for deterministic text generation:

Metric	Value	Notes
Throughput	21.4 generations/sec	Fixed 512 tokens per generation
Mean Latency	46.7ms	Time to generate 512 tokens
Median Latency	45.8ms	50th percentile
P95 Latency	58.0ms	95th percentile
P99 Latency	69.5ms	99th percentile
Memory Usage	154MB	During generation

Streaming Generation¶

Streaming provides similar performance with slightly higher memory usage:

Metric	Value
Throughput	20.3 generations/sec
Mean Latency	49.3ms
Memory Usage	213MB

Embedding Performance¶

SteadyText uses the Qwen3-Embedding-0.6B-Q8_0.gguf model for deterministic embeddings (unchanged in v2.0.0+):

Batch Size	Throughput	Mean Latency	Use Case
1	104.4 embeddings/sec	9.6ms	Single document
10	432.7 embeddings/sec	23.1ms	Small batches
50	598.7 embeddings/sec	83.5ms	Bulk processing

Cache Performance¶

SteadyText includes a frecency cache that dramatically improves performance for repeated operations:

Operation	Mean Latency	Notes
Cache Miss	47.6ms	First time generating
Cache Hit	1.00ms	Repeated prompt
Speedup	48x	Cache vs no-cache
Hit Rate	65%	Typical workload

Concurrent Performance¶

SteadyText scales well with multiple concurrent requests:

Workers	Throughput	Scaling Efficiency
1	21.6 ops/sec	100%
2	84.4 ops/sec	95%
4	312.9 ops/sec	90%
8	840.5 ops/sec	85%

Daemon Mode Performance¶

SteadyText v1.3+ includes a daemon mode that keeps models loaded in memory for instant responses:

Operation	Direct Mode	Daemon Mode	Improvement
First Request	2.4s	15ms	160x faster
Subsequent Requests	46.7ms	46.7ms	Same
With Cache Hit	1.0ms	1.0ms	Same
Startup Time	0s	2.4s (once)	One-time cost

Benefits of daemon mode: - Eliminates model loading overhead for each request - Maintains persistent cache across all operations - Supports concurrent requests efficiently - Graceful fallback to direct mode if daemon unavailable

Model Loading¶

One-time startup cost:

Loading Time: 2.4 seconds
Memory Usage: 1.4GB (both models)
Models Download: Automatic on first use (~1.9GB total)

Accuracy Benchmarks¶

Standard NLP Benchmarks¶

SteadyText performs competitively for a 1B parameter quantized model:

Benchmark	SteadyText	Baseline (1B)	Description
TruthfulQA	0.42	0.40	Truthfulness in Q&A
GSM8K	0.18	0.15	Grade school math
HellaSwag	0.58	0.55	Common sense reasoning
ARC-Easy	0.71	0.68	Science questions

Embedding Quality¶

Metric	Score	Description
Semantic Similarity	0.76	Correlation with human judgments (STS-B)
Clustering Quality	0.68	Silhouette score on 20newsgroups
Related Text Similarity	0.694	Cosine similarity for semantically related texts
Different Text Similarity	0.466	Cosine similarity for unrelated texts
Similarity Ordering	✅ PASS	Correctly ranks related vs unrelated texts

Determinism Tests¶

SteadyText's core guarantee is 100% deterministic outputs:

Test Results¶

Test	Result	Details
Identical Outputs	✅ PASS	100% consistency across 100 iterations
Seed Consistency	✅ PASS	10 different seeds tested
Platform Consistency	✅ PASS	Linux x86_64 verified
Fallback Determinism	✅ PASS	Works without models
Generation Determinism	✅ PASS	100% determinism rate in accuracy tests
Code Generation Quality	✅ PASS	Generates valid code snippets

Determinism Guarantees¶

Same Input → Same Output: Every time, on every machine
Customizable Seeds: Always uses DEFAULT_SEED=42 by default, but can be overridden.
Greedy Decoding: No randomness in token selection
Quantized Models: 8-bit precision ensures consistency
Fallback Support: Deterministic even without models

Hardware & Methodology¶

Test Environment¶

CPU: Intel Core i7-8700K @ 3.70GHz
RAM: 32GB DDR4
OS: Linux 6.14.11 (Fedora 42)
Python: 3.13.2
Models: Gemma-3n-E2B-it-Q8_0.gguf (v2.0.0+), Qwen3-Embedding-0.6B-Q8_0.gguf, Qwen3-Reranker-4B-Q8_0.gguf

Benchmark Methodology¶

Speed Tests¶

5 warmup iterations before measurement
100 iterations for statistical significance
High-resolution timing with time.perf_counter()
Memory tracking with psutil
Cache cleared between hit/miss tests

Accuracy Tests¶

LightEval framework for standard benchmarks
Custom determinism verification suite
Multiple seed testing for consistency
Platform compatibility checks

Comparison with Alternatives¶

vs. Non-Deterministic LLMs¶

Feature	SteadyText	GPT/Claude APIs
Determinism	100% guaranteed	Variable
Latency	46.7ms (fixed)	500-3000ms
Cost	Free (local)	$0.01-0.15/1K tokens
Offline	✅ Works	❌ Requires internet
Privacy	✅ Local only	⚠️ Cloud processing

vs. Caching Solutions¶

Feature	SteadyText	Redis/Memcached
Setup	Zero config	Requires setup
First Run	46.7ms	N/A (miss)
Cached	1.0ms	0.5-2ms
Semantic	✅ Built-in	❌ Exact match only

Running Benchmarks¶

To run benchmarks yourself:

Using UV (recommended):

# Run all benchmarks
uv run python benchmarks/run_all_benchmarks.py

# Quick benchmarks (for CI)
uv run python benchmarks/run_all_benchmarks.py --quick

# Test framework only
uv run python benchmarks/test_benchmarks.py

Legacy method:

# Install benchmark dependencies
pip install steadytext[benchmark]

# Run all benchmarks
python benchmarks/run_all_benchmarks.py

See benchmarks/README.md for detailed instructions.

Key Takeaways¶

Production Ready: Sub-50ms latency suitable for real-time applications
Efficient Caching: 48x speedup for repeated operations
Scalable: Good concurrent performance up to 8 workers
Quality Trade-off: Slightly lower accuracy than larger models, but 100% deterministic
Resource Efficient: Only 1.4GB memory for both models

Perfect for testing, CLI tools, and any application requiring reproducible AI outputs.