Benchmarks¶

AI Research OS benchmarks measure three core capabilities: paper import speed, search latency, and RAG answer quality. Numbers below are collected on a standard research workstation (8-core CPU, 16 GB RAM, SSD).

[!NOTE] These benchmarks are self-measured using the built-in PerformanceProfiler and BenchmarkComparator. Run rairos benchmark to reproduce.

Paper Import Pipeline¶

Stage	Operation	Avg Time	Notes
arXiv metadata fetch	HTTP GET	0.8s	Cached after first fetch
PDF download	HTTP GET	1.2s	Per paper, depends on size
Text extraction	PyMuPDF	0.3s	Per page, parallelized
Database write	SQLite INSERT	0.05s	Batched for bulk imports
LLM analysis	API call	2–8s	Optional, depends on model

Bulk Import Throughput¶

10 papers:   ~45s  (0.13 papers/sec)
50 papers:   ~3m   (0.27 papers/sec)  ← parallel downloads kick in
100 papers:  ~5m   (0.33 papers/sec)

Speedup after first run comes from metadata caching and smart cache deduplication.

Search Performance¶

Query Type	Latency (p50)	Latency (p95)	Notes
Keyword FTS5	12ms	35ms	BM25 ranking
Semantic vector	28ms	80ms	768-dim embeddings
Hybrid (FTS + vector)	40ms	110ms	RRF fusion
Filtered search	8ms	22ms	With index on tag/status

Results measured over 1,000 papers in database.

RAG Answer Quality¶

AI Research OS uses BenchmarkComparator to evaluate answer quality by cross-referencing cited papers against ground-truth experiment tables.

Metric	Score	Description
Citation Accuracy	94.2%	Claims backed by cited paper's table data
Factual Recall	87.6%	Key numbers recalled from imported papers
Hallucination Rate	3.1%	Unverified claims per answer

[!TIP] Run python -m llm.benchmark --compare <paper_id1> <paper_id2> to compare benchmark results across any two imported papers.

Core Module Test Coverage¶

Module	Coverage	Test Count
`core/`	91%	312 tests
`llm/`	78%	485 tests
`parsers/`	85%	124 tests
`db/`	89%	67 tests
Total	83%	3,827 tests

Run python -m pytest --ignore=neuraloperator_fork -q to reproduce the full test suite.

Evolution System¶

The self-evolution system improves over time. Gene/capsule patterns that fail to produce useful insights are faded out.

Metric	Value
Active capsules	24
Capsules evolved this month	7
Average insight quality score	6.4 / 10
Self-improvement rate	+0.3 / month

Reproducing These Numbers¶

# Full benchmark suite
python -m pytest tests/ -q --ignore=neuraloperator_fork

# Profile paper import
rairos benchmark --time-import 2601.00155

# Compare two papers' benchmark tables
python -m llm.benchmark 2601.00155 2302.00763