Arxiv-Collector.AutoGen automates arXiv paper discovery, PDF download, summarization, and citation graph building — turns hours of manual research into a structured knowledge base fed directly into your AI agents.
**TL;DR:** `Arxiv-Collector.AutoGen` is a TIER 3 research pipeline that monitors arXiv for new papers matching your interests, downloads PDFs, generates structured summaries, and builds a citation graph — all automated on a schedule.
The 10-Second Pitch
- **Scheduled arXiv monitoring** — tracks cs.AI, cs.CL, cs.LG, and custom categories for new papers matching keyword filters
- **Multi-agent pipeline** — one agent fetches, one summarizes, one extracts figures/tables, one builds citation edges
- **Structured knowledge base output** — outputs markdown notes, citation graph (JSON), and a reading queue
- **RAG-ready format** — summaries formatted for direct ingestion into vector databases (Pinecone, Weaviate, Chroma)
- **AutoGen-native** — designed to run as an autonomous agentic workflow, not a one-shot script
Setup Directions
Step 1 — Install
mrt install "Arxiv-Collector.AutoGen"
Step 2 — Configure Your Research Interests
{
"categories": ["cs.AI", "cs.CL", "cs.LG"],
"keywords": ["reasoning", "chain-of-thought", "planning", "agentic"],
"max_papers_per_run": 20,
"output_format": "markdown",
"rag_chunk_size": 512
}
Step 3 — Run the Pipeline
claude -- blueprint arxiv-collector --mode daily-digest --output ./research-base
The Exact Prompt for a Deep-Dive Research Session
Run arXiv paper discovery for the last 7 days in cs.CL and cs.AI.
Focus on: reasoning agents, planning, chain-of-thought prompting.
For each paper: download PDF, generate 500-word summary, extract
key claims and limitations, build citation edges to prior work.
Output a reading queue ranked by citation count and relevance score.
Pros & Cons
| Pros | Cons |
|---|
| Automated discovery — never miss a relevant paper again | arXiv API rate limits apply (~2 requests/sec) |
| Multi-agent pipeline means parallel processing, not sequential | PDF download failures require retry logic |
|---|
| RAG-ready output means instant ingestion into your knowledge base | Keyword matching can miss conceptually relevant papers |
|---|
| Citation graph is reusable across research projects | Requires significant storage for PDF cache |
|---|
| TIER 3 — serious pipeline, not a wrapper around one API call | May surface irrelevant papers from keyword noise |
|---|
Verdict
Research is a recurring time sink — you either spend hours manually hunting papers or you rely on something like Arxiv's email digest which has no customization. `Arxiv-Collector.AutoGen` splits the difference: automated, configurable, and structured enough that the output feeds directly into your AI systems. The RAG-ready format is the key — you're not just reading papers, you're building a knowledge base that makes your agents smarter.
**Best for:** AI researchers, ML engineers, and PhD students who need to stay current on fast-moving literature without the manual overhead.
**Alternative:** For manual paper discovery with human curation, use arXiv's RSS feeds or Connected Papers. For full research paper databases with academic paywalls, look at Semantic Scholar API integration instead.
*TIER 3 skill. Available at [mr.technology/registry](/registry).*