Web Crawler: A Minimal, Real-Time Web Search CLI for JSON Results
You type a query; a few seconds later you get clean JSON — title, url, published_date—already sorted by recency. No pop-ups, no captchas in your face, no copy-pasting from a browser. Web Crawler is a tiny command-line utility that turns the living web into data you can pipe, parse, and feed into downstream tools (LLMs, dashboards, cron jobs). The target goal is simple: one command → fresh, structured results you can trust in scripts and automations.
What It Is
Web Crawler is a minimal, real-time internet searcher that outputs newline-delimited JSON (NDJSON) or JSON arrays with the same small schema for every hit:
{
"title": "Example Headline",
"url": "https://example.com/article",
"published_date": "2025-08-26T11:12:13Z"
}Because the output is standardized and time-sorted, you can drop it into any UNIX pipeline or application with zero glue. Think of it as “curl for the news & web,” designed for speed, determinism, and composability.
Why a Minimal Search CLI Matters
- Automation-first: Browsers are for reading; CLIs are for doing. A simple CLI lets you schedule, diff, and parse results without friction.
- Deterministic schema: Same three fields, always. That makes it safe to script (no brittle scraping logic).
- Recency by default: Many workflows (monitoring, alerts, market and security research) care about “what changed today.” Sorting by published time makes that trivial.
- LLM-ready: Clean JSON is exactly what agents and RAG pipelines want, no HTML cleanup step required.
How It Works (The 5-Stage Pipeline)
- Query Normalization — trims, de-dupes, and attaches optional time windows or domain filters.
- Fetchers — lightweight connectors query search endpoints and news feeds in parallel.
- Parsers — extract
title,url, andpublished_dateusing robust rules; normalize timestamps to UTC ISO-8601. - Deduper — collapses duplicates by canonical URL and fuzzy title matching.
- Sorter + Emitter — sorts by
published_date(newest first) and emits JSON/NDJSON.
This architecture keeps the tool fast, predictable, and tiny.
Core Features (and why you’ll care)
- Real-time recency ranking
See the newest items first. Perfect for alerts, changelogs, and “what’s new” tabs. - Small, fixed schema
Every result hastitle,url, andpublished_date. Many tools fail because the schema drifts; this one doesn’t. - Filters that map to real use
--domains,--exclude-domains,--since,--until,--max-results—the basics you actually need. - Concurrency controls
Tune--concurrencyfor your environment. Faster on a CI runner? Dial it up. Rate-limited network? Dial it down. - Respectful fetching
Built to be polite: backoff, timeouts, and an option to honorrobots.txt. You get speed without behaving like a botnet. - Cache (optional)
Cache recent queries to avoid re-hitting endpoints — useful for iterative development and CI runs.
Output You Can Trust
JSON array (default)
web-crawler "solar flare forecast" --json > results.jsonNDJSON (streaming; great for jq, grep, and logs)
web-crawler "solar flare forecast" --ndjson | jq -r '.url'Top N, newest first
web-crawler "llm evaluation paper" --max-results 20 | jq '.[0:5]'Last 24 hours only
web-crawler "openssl vulnerability" --since 24hRestrict to a domain set
web-crawler "fiscal policy" --domains ft.com,economist.com,wsj.comQuick Start (Local)
From source (Python workflow)
If the project ships a
requirements.txtorpyproject.toml, this is the typical path.
# 1) Get the source
git clone <repo> web-crawler
cd web-crawler
# 2) Create an isolated env
python -m venv .venv
source .venv/bin/activate
# 3) Install deps
pip install -r requirements.txt
# or: pip install .
# 4) Run
python -m web_crawler "quantum computing breakthrough" --max-results 30From source (Node.js workflow)
If the project uses a
package.json, use this path.
git clone <repo> web-crawler
cd web-crawler
npm install
npm run build # if present
node cli.js "startup funding rounds" --ndjson(The repository’s README will show the exact language/dependency set; use the matching block above.)
Containerized Run (One-liner)
# Build image (once)
docker build -t web-crawler:latest .
# Run a query
docker run --rm web-crawler:latest \
web-crawler "cybersecurity breach report" --since 48h --max-results 50Why containers? Reproducible environments in CI/CD, no local Python/Node juggling, and easy to pin a version.
Configuration & Flags You’ll Use Often
--since 24h|7d|2025-08-01– time window filters; everything is normalized to UTC.--until <timestamp>– upper bound for historical slices.--max-results <int>– cap results to keep payloads small.--domains a.com,b.org/--exclude-domains …– scope or prune results.--concurrency <int>– parallel requests; defaults are conservative.--timeout <sec>– per-request timeouts.--ndjson– newline-delimited streaming output.--no-cache– force fresh fetching during critical runs.--pretty– pretty-printed JSON for humans.
Cookbook (Copy-Paste These)
Pipe URLs into a downloader
web-crawler "nextjs tutorial 2025" --ndjson \
| jq -r '.url' \
| xargs -n1 -P5 curl -OSend fresh links to Slack
web-crawler "postgres performance" --since 24h --ndjson \
| jq -r '"New: \(.title) — \(.url)"' \
| while read line; do
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$line\"}" $SLACK_WEBHOOK_URL
doneFeed an LLM
web-crawler "carbon capture breakthrough 2025" --max-results 15 \
| python summarize.py # your script that composes a short briefDaily cron
0 7 * * * /usr/local/bin/web-crawler "ai governance regulation" --since 24h \
--max-results 40 > /var/data/briefs/ai_governance_$(date +\%F).jsonDeploying for Teams
Docker Compose (shared utility)
services:
crawler:
image: web-crawler:latest
command: ["web-crawler", "economic calendar", "--since", "48h", "--ndjson"]
environment:
- CRAWLER_CACHE_DIR=/cache
volumes:
- ./cache:/cacheSystemd service (single host)
[Unit]
Description=Web Crawler service
[Service]
ExecStart=/usr/local/bin/web-crawler "market microstructure" --since 12h --ndjson
Restart=on-failure
WorkingDirectory=/var/web-crawler
[Install]
WantedBy=multi-user.targetBest Practices (So It Stays Fast and Friendly)
- Be polite: respect backoff messages; don’t hammer the same query in tight loops.
- Use caching in dev: saves bandwidth and speeds iteration.
- Normalize times: everything is UTC ISO-8601 — keep it that way farther down your pipeline.
- Keep queries specific: better signal, fewer irrelevant hits, faster pipelines.
- Log your runs: NDJSON pairs beautifully with log aggregators and time-series stores.
Troubleshooting
- Empty
published_date: Some sources lack timestamps. Enable fallback heuristics or ignore undated results in your script (select(.published_date!=null)injq). - Inconsistent duplicates: Turn on canonicalization or increase the dedupe threshold.
- Slow runs: Reduce
--max-results, increase--concurrency, or narrow with--since. - Blocked networks: Check corporate proxy settings and set
HTTP_PROXY/HTTPS_PROXYif needed.
Conclusion
Web Crawler gives you a no-nonsense, automation-ready way to ask the web a question and get structured, time-sorted data back. It’s tiny, composable, and perfect for LLM ingestion, monitoring, and research. The immediate goal: run a single command today, capture fresh results as JSON, and wire it into one small workflow (a daily brief, a Slack notifier, or a research script). From there, let your imagination — and your pipelines — grow.
