Usage¶
Initialize¶
Import Papers¶
# Single paper by arXiv ID
python -m cli import 2601.00155
# Multiple papers
python -m cli import 2601.00155 2301.12345 10.1038/nature12373
# Batch from file (one ID per line)
python -m cli import --file ids.txt
List Papers¶
# List all papers
python -m cli list
# Filter by status
python -m cli list --status pending
python -m cli list --status done
# Filter by source
python -m cli list --source arxiv
Search¶
# Full-text search (FTS5 with BM25 ranking)
python -m cli search "transformer attention"
# Filter by category and date
python -m cli search --category cs.LG --date-from 2024-01-01
# Filter by parse status and sort by date
python -m cli search --status done --sort date
# Limit results
python -m cli search "LLM" --limit 50
Database Statistics¶
Export¶
# BibTeX (default)
python -m cli export > papers.bib
# JSON
python -m cli export --format json > papers.json
Queue¶
Merge Duplicates¶
# Dry run — preview what would merge
python -m cli merge --dry-run
# Auto-merge high-similarity pairs (>= 0.95)
python -m cli merge --auto --dry-run
# Auto-merge for real
python -m cli merge --auto
# Keep specific paper when merging
python -m cli merge --keep newer 2301.00001 --dry-run
# Auto with semantic preference (0.8+ sim + matching titles)
python -m cli merge --keep semantic --auto --dry-run
Semantic Deduplication¶
Requires Ollama with nomic-embed-text model.
# Check embedding coverage
python -m cli dedup-semantic --stats
# Generate embeddings for papers missing them
python -m cli dedup-semantic --generate
# Find similar papers for a specific paper
python -m cli dedup-semantic --paper 2601.00155
# Custom similarity threshold (higher = stricter match, default: 0.85)
python -m cli dedup-semantic --paper 2601.00155 --threshold 0.90
# Limit number of similar papers returned
python -m cli dedup-semantic --paper 2601.00155 --limit 5
# CSV output for pipeline integration
python -m cli dedup-semantic --generate --format csv
Citation Graph¶
Fetch Citations from OpenAlex¶
# Fetch for all papers in DB
python -m cli cite-fetch
# Fetch for specific paper
python -m cli cite-fetch 2601.00155
# Dry run — preview what would be imported
python -m cli cite-fetch --dry-run
# Only import citations where both papers are in local DB
python -m cli cite-fetch --skip-external
# Fetch only backward citations (papers cited by this paper)
python -m cli cite-fetch 2601.00155 --direction from
# Fetch only forward citations (papers citing this paper)
python -m cli cite-fetch 2601.00155 --direction to
# Rate limit (~9 req/s)
python -m cli cite-fetch --delay 0.11
Extract References from Paper (plain-text)¶
# Extract references from a paper's plain text and print them
python -m cli cite-import --extract --paper 2601.00155
# Same, plus import citation edges into DB (arXiv IDs that exist in DB)
python -m cli cite-import --extract --paper 2601.00155 --dry-run
# Import with duplicate reporting (uses upsert mode)
python -m cli cite-import --extract --paper 2601.00155 --dedup
Bulk Import Citations from JSON¶
# From stdin
cat citations.json | python -m cli cite-import
# From file
python -m cli cite-import --file citations.json
# Dry run
python -m cli cite-import --file citations.json --dry-run
# Skip edges where source/target is not in DB
python -m cli cite-import --file citations.json --skip-missing
JSON format:
Citation Statistics¶
# Global stats — total edges, unique citing/cited, avg per paper
python -m cli cite-stats
# Per-paper stats
python -m cli cite-stats --paper 2601.00155
# Sort by citing papers (papers that cite most others)
python -m cli cite-stats --by citing
# Sort by cited-by (most cited papers)
python -m cli cite-stats --by cited
# CSV output
python -m cli cite-stats --format csv
Citation Graph (plain-text)¶
# Extract references from a plain-text file and print as citation graph
python -m cli cite-graph --plain-text --paper 2601.00155
python -m cli cite-graph --plain-text --file ./paper.txt
python -m cli cite-graph --plain-text --file ./paper.txt --verbose
Plain-text mode reads a paper's plain text, extracts all arXiv IDs, DOIs, PMIDs, and ISBNs, and prints them as a citation list. Verbose mode shows the context around each identifier. DOI/PMID/ISBN are shown but cannot be linked as citation edges (only arXiv IDs are linked).
Citation Graph (database)¶
# Graph centered on a paper (depth=1 by default)
python -m cli cite-graph --paper 2601.00155
python -m cli cite-graph --paper 2601.00155 --depth 2
python -m cli cite-graph --paper 2601.00155 --max-nodes 50
# Output formats
python -m cli cite-graph --paper 2601.00155 --format json
python -m cli cite-graph --paper 2601.00155 --format mermaid
python -m cli cite-graph --paper 2601.00155 --format text
Deduplicate Papers¶
# Deduplicate by exact arXiv ID match
python -m cli dedup
# Deduplicate by semantic similarity (embedding-based)
python -m cli dedup-semantic
python -m cli dedup-semantic --paper 2601.00155
python -m cli dedup-semantic --paper 2601.00155 --threshold 0.85 --limit 5
dedup removes duplicate papers from the database using exact match. dedup-semantic finds papers with similar abstracts using embeddings; --threshold controls similarity cutoff (0.0-1.0, default 0.8), --limit caps results per paper.