Architecture¶

Overview¶

CLI (python -m cli)  /  rairos-cli (Rust)
    │
    ├── cli/              # Python CLI (argparse-based subcommands)
    ├── core/             # Core Python modules (database, embedding, KG)
    ├── llm/              # LLM clients, gap detection, gene pool, evolution
    ├── parsers/          # arXiv, DOI, PDF parsing
    ├── db/               # SQLite database layer (Rust pyo3 extension)
    ├── research_loop/    # Deep research agent, paper2code, RAG pipeline
    ├── web/              # FastAPI web UI
    └── crates/           # Rust rewrite (103 crates, ~50k+ LOC)
        ├── rairos-core   # Core data structures + database
        ├── rairos-llm    # LLM clients, gene pool, evolution
        ├── rairos-cli    # Rust CLI (48 commands)
        ├── rairos-web    # Rust web server
        └── ...

AI Research OS is a local-first research tool. No cloud dependency — all data stays in ~/.ai_research_os/.

Modules¶

CLI Entry Point¶

cli/ — subcommands registered via argparse. All invocations use subcommands:

python -m cli <subcommand> [args]   # recommended
airos-cli <subcommand> [args]       # via entry point

`parsers/`¶

Module	Input	Output
`arxiv.py`	arXiv ID or URL	`Paper` object
`crossref.py`	DOI	`Paper` + optional arXiv ID
`pdf.py`	Local PDF path	`Paper` with full text
`openalex.py`	OpenAlex work ID	Citation graph data
`input_detection.py`	Arbitrary string	Detects arXiv ID, DOI, or file path

`database.py`¶

SQLite wrapper with FTS5 full-text search.

Schema:

Table	Purpose
`papers`	One row per unique paper
`papers_fts`	FTS5 virtual table — BM25 full-text search
`papers_embeddings`	768-dim vectors via Ollama `nomic-embed-text`
`citations`	(source_id, target_id) directed edges

Key methods: - upsert_paper() — insert or update, handles duplicates - search_papers(query) — FTS5 with BM25 ranking, category/date filters - set_embedding() / find_similar() — semantic deduplication - add_citation() / get_citations() — citation graph traversal

`embed.py`¶

Ollama embedding client at http://localhost:11434/api/embeddings.

Model: nomic-embed-text (768-dim)
Used for: semantic deduplication, similarity search

`citation.py`¶

OpenAlex API client. Bypasses Windows proxy SSL issues via custom SSL context.

GET /works?filter=doi:{doi}              → resolve → OpenAlex ID
GET /works/{id}/references               → backward citations
GET /works?filter=cites:{id}            → forward citations

`notes/`¶

Structured knowledge output from paper processing:

Structure	Scope	Description
P-Note	Per paper	One key insight per paper
C-Note	Per concept/tag	Aggregated notes across papers sharing a tag
M-Note	3+ papers	Comparison when papers sharing a tag
Radar	Per tag	Topic frequency heat score
Timeline	Per tag	Year-based research evolution

`llm/insight/` — Gene Pool¶

Self-evolving research gap memory. Tracks which gaps have been identified, validated, or consumed.

Core concepts:

Concept	Description
CapsuleGene	A single research gap entry — `gap_type`, `gap_title`, `keywords`, `outcome_success_score`
Gene Pool	Dual-store: `gene_pool.jsonl` (tracker) + `capsules.json` (web UI)
consumed 闭环	Suggestions carry `source_cap_id`; when accepted, source capsule marked `consumed`
Capsule merge	Same `gap_type` + Jaccard keyword overlap ≥ 0.80 → merge into winner
Auto-archive	Capsule with `low_score_streak ≥ 3` and score < 0.30 → auto-archived

Lifecycle states: active → consumed (闭环) or archived (自动归档)

Key files:

File	Role
`llm/insight/gene.py`	`CapsuleGene` dataclass with `status`, `low_score_streak`
`llm/insight/tracker.py`	`EvolutionTracker` — encodes/finds/archives capsules
`llm/insight/evolution.py`	`InsightEvolution` — merge + auto-archive on each `apply()`
`llm/paper_gap_extractor.py`	Extract gap from paper via LLM → Gene Pool
`llm/briefing_generator.py`	`_match_gene_pool()` — match paper to existing capsules

CLI access:

airos-cli gap list [--status active|consumed|archived]  # list Gene Pool
airos-cli gap extract <paper_id>                       # extract gap from paper

Web UI: /paper/{id} page shows Gene Pool relevance and "Extract Gap" button.

`kg/`¶

Knowledge graph — paper-level and concept-level nodes with citation edges.

`research_loop/`¶

Autonomous research loop (Plan → Act → Observe → Learn):

paper2code_integration/ — generate code from paper
rag_pipeline/ — RAG: paper → code → tests → benchmark
evoskill_integration/ — feedback-driven skill discovery

Data Flow¶

import PAPER_ID
    │
    ▼
parse (arXiv | DOI | PDF)
    │
    ▼
upsert_paper() → papers table
    │
    ├──► index_paper()     → papers_fts (FTS5)
    └──► embed.py          → papers_embeddings (Ollama)

search "query"
    │
    ▼
FTS5 BM25(query) → ranked results

cite-fetch PAPER_ID
    │
    ▼
OpenAlex API → add_citation() → citations table

research "topic"
    │
    ▼
Plan → Act → Observe → Learn → (repeat)

Database Location¶

Default: ~/.ai_research_os/papers.db

Override with:

export AI_RESEARCH_OS_DB=/path/to/papers.db

Architecture¶

Overview¶

Modules¶

CLI Entry Point¶

parsers/¶

database.py¶

embed.py¶

citation.py¶

notes/¶

llm/insight/ — Gene Pool¶

kg/¶

research_loop/¶