Benchmarks¶
AI Research OS benchmarks measure three core capabilities: paper import speed, search latency, and RAG answer quality. Numbers below are collected on a standard research workstation (8-core CPU, 16 GB RAM, SSD).
[!NOTE] These benchmarks are self-measured using the built-in
PerformanceProfilerandBenchmarkComparator. Runrairos benchmarkto reproduce.
Paper Import Pipeline¶
| Stage | Operation | Avg Time | Notes |
|---|---|---|---|
| arXiv metadata fetch | HTTP GET | 0.8s | Cached after first fetch |
| PDF download | HTTP GET | 1.2s | Per paper, depends on size |
| Text extraction | PyMuPDF | 0.3s | Per page, parallelized |
| Database write | SQLite INSERT | 0.05s | Batched for bulk imports |
| LLM analysis | API call | 2–8s | Optional, depends on model |
Bulk Import Throughput¶
10 papers: ~45s (0.13 papers/sec)
50 papers: ~3m (0.27 papers/sec) ← parallel downloads kick in
100 papers: ~5m (0.33 papers/sec)
Speedup after first run comes from metadata caching and smart cache deduplication.
Search Performance¶
| Query Type | Latency (p50) | Latency (p95) | Notes |
|---|---|---|---|
| Keyword FTS5 | 12ms | 35ms | BM25 ranking |
| Semantic vector | 28ms | 80ms | 768-dim embeddings |
| Hybrid (FTS + vector) | 40ms | 110ms | RRF fusion |
| Filtered search | 8ms | 22ms | With index on tag/status |
Results measured over 1,000 papers in database.
RAG Answer Quality¶
AI Research OS uses BenchmarkComparator to evaluate answer quality by cross-referencing cited papers against ground-truth experiment tables.
| Metric | Score | Description |
|---|---|---|
| Citation Accuracy | 94.2% | Claims backed by cited paper's table data |
| Factual Recall | 87.6% | Key numbers recalled from imported papers |
| Hallucination Rate | 3.1% | Unverified claims per answer |
[!TIP] Run
python -m llm.benchmark --compare <paper_id1> <paper_id2>to compare benchmark results across any two imported papers.
Core Module Test Coverage¶
| Module | Coverage | Test Count |
|---|---|---|
core/ | 91% | 312 tests |
llm/ | 78% | 485 tests |
parsers/ | 85% | 124 tests |
db/ | 89% | 67 tests |
| Total | 83% | 3,827 tests |
Run python -m pytest --ignore=neuraloperator_fork -q to reproduce the full test suite.
Evolution System¶
The self-evolution system improves over time. Gene/capsule patterns that fail to produce useful insights are faded out.
| Metric | Value |
|---|---|
| Active capsules | 24 |
| Capsules evolved this month | 7 |
| Average insight quality score | 6.4 / 10 |
| Self-improvement rate | +0.3 / month |