Sitemap
Press enter or click to view image in full size
ara

Web Crawler: A Minimal, Real-Time Web Search CLI for JSON Results

5 min readAug 26, 2025

--

You type a query; a few seconds later you get clean JSONtitle, url, published_date—already sorted by recency. No pop-ups, no captchas in your face, no copy-pasting from a browser. Web Crawler is a tiny command-line utility that turns the living web into data you can pipe, parse, and feed into downstream tools (LLMs, dashboards, cron jobs). The target goal is simple: one command → fresh, structured results you can trust in scripts and automations.

What It Is

Web Crawler is a minimal, real-time internet searcher that outputs newline-delimited JSON (NDJSON) or JSON arrays with the same small schema for every hit:

{
"title": "Example Headline",
"url": "https://example.com/article",
"published_date": "2025-08-26T11:12:13Z"
}

Because the output is standardized and time-sorted, you can drop it into any UNIX pipeline or application with zero glue. Think of it as “curl for the news & web,” designed for speed, determinism, and composability.

Why a Minimal Search CLI Matters

  • Automation-first: Browsers are for reading; CLIs are for doing. A simple CLI lets you schedule, diff, and parse results without friction.
  • Deterministic schema: Same three fields, always. That makes it safe to script (no brittle scraping logic).
  • Recency by default: Many workflows (monitoring, alerts, market and security research) care about “what changed today.” Sorting by published time makes that trivial.
  • LLM-ready: Clean JSON is exactly what agents and RAG pipelines want, no HTML cleanup step required.

How It Works (The 5-Stage Pipeline)

  1. Query Normalization — trims, de-dupes, and attaches optional time windows or domain filters.
  2. Fetchers — lightweight connectors query search endpoints and news feeds in parallel.
  3. Parsers — extract title, url, and published_date using robust rules; normalize timestamps to UTC ISO-8601.
  4. Deduper — collapses duplicates by canonical URL and fuzzy title matching.
  5. Sorter + Emitter — sorts by published_date (newest first) and emits JSON/NDJSON.

This architecture keeps the tool fast, predictable, and tiny.

Core Features (and why you’ll care)

  • Real-time recency ranking
    See the newest items first. Perfect for alerts, changelogs, and “what’s new” tabs.
  • Small, fixed schema
    Every result has title, url, and published_date. Many tools fail because the schema drifts; this one doesn’t.
  • Filters that map to real use
    --domains, --exclude-domains, --since, --until, --max-results—the basics you actually need.
  • Concurrency controls
    Tune --concurrency for your environment. Faster on a CI runner? Dial it up. Rate-limited network? Dial it down.
  • Respectful fetching
    Built to be polite: backoff, timeouts, and an option to honor robots.txt. You get speed without behaving like a botnet.
  • Cache (optional)
    Cache recent queries to avoid re-hitting endpoints — useful for iterative development and CI runs.

Output You Can Trust

JSON array (default)

web-crawler "solar flare forecast" --json > results.json

NDJSON (streaming; great for jq, grep, and logs)

web-crawler "solar flare forecast" --ndjson | jq -r '.url'

Top N, newest first

web-crawler "llm evaluation paper" --max-results 20 | jq '.[0:5]'

Last 24 hours only

web-crawler "openssl vulnerability" --since 24h

Restrict to a domain set

web-crawler "fiscal policy" --domains ft.com,economist.com,wsj.com

Quick Start (Local)

From source (Python workflow)

If the project ships a requirements.txt or pyproject.toml, this is the typical path.

# 1) Get the source
git clone <repo> web-crawler
cd web-crawler
# 2) Create an isolated env
python -m venv .venv
source .venv/bin/activate
# 3) Install deps
pip install -r requirements.txt
# or: pip install .
# 4) Run
python -m web_crawler "quantum computing breakthrough" --max-results 30

From source (Node.js workflow)

If the project uses a package.json, use this path.

git clone <repo> web-crawler
cd web-crawler
npm install
npm run build # if present
node cli.js "startup funding rounds" --ndjson

(The repository’s README will show the exact language/dependency set; use the matching block above.)

Containerized Run (One-liner)

# Build image (once)
docker build -t web-crawler:latest .
# Run a query
docker run --rm web-crawler:latest \
web-crawler "cybersecurity breach report" --since 48h --max-results 50

Why containers? Reproducible environments in CI/CD, no local Python/Node juggling, and easy to pin a version.

Configuration & Flags You’ll Use Often

  • --since 24h|7d|2025-08-01 – time window filters; everything is normalized to UTC.
  • --until <timestamp> – upper bound for historical slices.
  • --max-results <int> – cap results to keep payloads small.
  • --domains a.com,b.org / --exclude-domains … – scope or prune results.
  • --concurrency <int> – parallel requests; defaults are conservative.
  • --timeout <sec> – per-request timeouts.
  • --ndjson – newline-delimited streaming output.
  • --no-cache – force fresh fetching during critical runs.
  • --pretty – pretty-printed JSON for humans.

Cookbook (Copy-Paste These)

Pipe URLs into a downloader

web-crawler "nextjs tutorial 2025" --ndjson \
| jq -r '.url' \
| xargs -n1 -P5 curl -O

Send fresh links to Slack

web-crawler "postgres performance" --since 24h --ndjson \
| jq -r '"New: \(.title) — \(.url)"' \
| while read line; do
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$line\"}" $SLACK_WEBHOOK_URL
done

Feed an LLM

web-crawler "carbon capture breakthrough 2025" --max-results 15 \
| python summarize.py # your script that composes a short brief

Daily cron

0 7 * * * /usr/local/bin/web-crawler "ai governance regulation" --since 24h \
--max-results 40 > /var/data/briefs/ai_governance_$(date +\%F).json

Deploying for Teams

Docker Compose (shared utility)

services:
crawler:
image: web-crawler:latest
command: ["web-crawler", "economic calendar", "--since", "48h", "--ndjson"]
environment:
- CRAWLER_CACHE_DIR=/cache
volumes:
- ./cache:/cache

Systemd service (single host)

[Unit]
Description=Web Crawler service
[Service]
ExecStart=/usr/local/bin/web-crawler "market microstructure" --since 12h --ndjson
Restart=on-failure
WorkingDirectory=/var/web-crawler
[Install]
WantedBy=multi-user.target

Best Practices (So It Stays Fast and Friendly)

  • Be polite: respect backoff messages; don’t hammer the same query in tight loops.
  • Use caching in dev: saves bandwidth and speeds iteration.
  • Normalize times: everything is UTC ISO-8601 — keep it that way farther down your pipeline.
  • Keep queries specific: better signal, fewer irrelevant hits, faster pipelines.
  • Log your runs: NDJSON pairs beautifully with log aggregators and time-series stores.

Troubleshooting

  • Empty published_date: Some sources lack timestamps. Enable fallback heuristics or ignore undated results in your script (select(.published_date!=null) in jq).
  • Inconsistent duplicates: Turn on canonicalization or increase the dedupe threshold.
  • Slow runs: Reduce --max-results, increase --concurrency, or narrow with --since.
  • Blocked networks: Check corporate proxy settings and set HTTP_PROXY/HTTPS_PROXY if needed.

Conclusion

Web Crawler gives you a no-nonsense, automation-ready way to ask the web a question and get structured, time-sorted data back. It’s tiny, composable, and perfect for LLM ingestion, monitoring, and research. The immediate goal: run a single command today, capture fresh results as JSON, and wire it into one small workflow (a daily brief, a Slack notifier, or a research script). From there, let your imagination — and your pipelines — grow.

--

--

Javier Calderon Jr
Javier Calderon Jr

Written by Javier Calderon Jr

CTO, Tech Entrepreneur, Mad Scientist, that has a passion to Innovate Solutions that specializes in Web3, Artificial Intelligence, and Cyber Security

No responses yet