Cariddi: Point It at Domains, Pull Out the Good Stuff
When you’re hunting for attack surface or doing housekeeping on your own web estate, two questions matter most: what’s out there and what secrets did we accidentally ship. Cariddi answers both. Feed it a list of domains and it will crawl, collect URLs, and flag anything interesting — API endpoints, tokens, API keys, secrets, file extensions, and more — in real time. It’s fast, scriptable, and designed for the way security engineers and red teams actually work.
This guide explains what Cariddi does, why each capability matters, and exactly how to install, run, and automate it — responsibly.
Use responsibly. Only test assets you own or have explicit permission to assess. Always follow applicable laws and your organization’s policies.
What Cariddi Is (and why you’ll love it)
1) A focused crawler for security work
Cariddi crawls from one or more starting domains and harvests URLs from HTML, JavaScript, sitemaps, robots.txt, and in-page links. It’s tuned for breadth with control: you decide concurrency, depth, and which hosts/paths/extensions to include or ignore.
Why this matters: Generic web crawlers drown you in noise. Cariddi is tuned to surface things humans actually triage — endpoints, forms, and code-y bits — fast.
2) A secret/endpoint detector out of the box
As it crawls, Cariddi applies curated pattern detectors (regex + heuristics) to page content, scripts, and URLs to spot things like:
- API keys & tokens (common cloud providers and services)
- Auth artifacts (JWTs, bearer tokens)
- Credentials & secrets embedded in code or config
- Endpoints (REST, GraphQL, RPC) and interesting file types (e.g.,
.bak,.old,.env,.zip,.sql)
Why this matters: Secrets in client assets or public buckets remain a top cause of compromise. Catching them at crawl time collapses discovery → validation into one pass.
3) Filters that map to real workflows
Cariddi lets you whitelist/blacklist hosts, paths, and extensions; control depth, rate, and parallelism; and choose output formats (plain text or structured JSON). You can also point it at wordlists to brute-force common paths while it crawls.
Why this matters: Good tools bend to the environment you’re scanning — noisy SaaS pages, single-page apps, marketing CDNs — with just enough control to stay fast and relevant.
4) Built for pipelines and teams
Input can come from files, stdin, or other tools; output can stream to files, terminals, or log collectors. It’s easy to stick it between tools like subdomain discovery → Cariddi → param miner/burpsuite or Cariddi → secrets scanner → ticketing.
Why this matters: The best scanner is the one that slots into your existing muscle memory without ceremony.
How It Works (under the hood)
- Seed: Read domains from a file/stdin.
- Fetch: Pull pages, robots.txt, sitemaps, and linked resources with a tuned HTTP client (timeouts, retries, proxy support).
- Parse: Extract in-page links, script src, inline JS strings, forms, and obvious endpoints.
- Detect: Apply pattern detectors for secrets, tokens, and suspicious extensions.
- Emit: Stream URLs and findings as they appear (text or JSON), sorted or deduped per your flags.
- Respect: Optional robots adherence, rate limits, and host scoping to play nicely.
Install & Run
Option A — Build from source (Go)
Cariddi is written in Go, so building a static binary is straightforward.
# 1) Get the source
git clone https://github.com/edoardottt/cariddi
cd cariddi
# 2) Build
go build -o cariddi ./cmd/cariddi
# 3) Put it on your PATH (optional)
mv cariddi /usr/local/bin/Option B — Docker (repeatable environment)
# Build an image locally
docker build -t cariddi .
# Run with a domains file mounted
docker run --rm -v "$PWD:/scan" cariddi \
-in /scan/domains.txt -out /scan/found.json -jsonTip: If you prefer not to build, use any container registry image your team trusts; the run flags are the same.
Quick Start Recipes
1) Crawl a list of domains and print all discovered URLs
cariddi -in domains.txtdomains.txtcontains one domain per line:example.com shop.example.com api.example.net
2) Save URLs and security findings as JSON
cariddi -in domains.txt -out results.json -json- JSON includes fields like
url,source,evidence,match_type,matched_value,timestamp.
3) Limit scope and go faster
cariddi -in domains.txt \
-include-host *.example.com \
-exclude-ext jpg,png,svg,woff,woff2 \
-depth 3 \
-concurrency 20 \
-timeout 8s- Keeps crawls tight, ignores asset noise, and speeds up discovery.
4) Hunt for interesting file extensions
cariddi -in domains.txt -only-ext bak,old,env,zip,sql,log- Surfaces likely mistakes (backups, dumps, environment files).
5) Secrets-first mode (cut the chatter)
cariddi -in domains.txt -secrets-only -out secrets.ndjson -ndjson- Streams one JSON object per line — perfect for piping into
jqor a SIEM.
6) Mix in a wordlist for common paths
cariddi -in domains.txt -paths paths.txt -depth 1paths.txtmight include/api,/graphql,/admin,/config,/backup.
7) Route through a proxy (lab or corp network)
HTTPS_PROXY=http://127.0.0.1:8080 cariddi -in domains.txtInterpreting Results (triage without tears)
Green flags (actionable):
- A token with recognizable structure (e.g., JWT with valid header & payload).
- Cloud keys (format matches provider + checksum) in inline JS or HTML.
- Endpoints returning 200/401/403 with auth hints or schema descriptions.
- Exposed backups (
.zip,.sql,.bak) with realistic size.
Yellow flags (verify):
- Strings that look like tokens but are short/highly random without context.
- Parameters that imply auth or signing without a visible secret.
- Client-side keys that are intended to be public (e.g., analytics). Label them and move on.
Always validate: Attempt benign requests (HEAD/GET), check CORS, and confirm scope before escalating. For secrets, verify whether they actually grant access (use controlled environments).
Automating Cariddi in Your Pipeline
Daily job (cron)
0 3 * * * cariddi -in /opt/domains.txt -json -out /var/log/cariddi/$(date +\%F).jsonWith jq for quick views
# Show new endpoints discovered today
jq -r '.[].url' 2025-08-28.json | sort -u
# List only secrets with high-confidence detectors
jq -r '.[] | select(.match_type=="secret" and .confidence=="high") | .url,.matched_value' results.jsonSend alerts to Slack (example)
cariddi -in domains.txt -secrets-only -ndjson \
| jq -r '"*Cariddi*: " + .matched_value + " found at " + .url' \
| while read line; do
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$line\"}" $SLACK_WEBHOOK_URL
doneDeploying for Teams
Docker Compose (shared runner)
services:
cariddi:
build: .
volumes:
- ./input:/input
- ./out:/out
command: >
cariddi -in /input/domains.txt
-json
-out /out/fresh.json
-include-host *.example.com
-exclude-ext jpg,png,svg,woff,woff2
-concurrency 25
-depth 3CI job (guardrails on new PRs)
- Maintain a small allowlist of marketing/CDN hosts to skip.
- Run Cariddi against preview environments; surface only newly discovered endpoints/secrets as annotations.
Performance & Reliability Tips
- Start shallow, then deepen. Run
-depth 1to map breadth; rerun with-depth 3on interesting hosts. - Filter assets early. Excluding images/fonts can cut runtime dramatically.
- Tune concurrency to the target. Corporate sites may rate-limit aggressively; reduce threads and increase timeout.
- Deduplicate on the fly. Use the tool’s own dedupe and avoid piping through slow external sort during live runs.
- Version & pin. Keep a known-good binary in CI; upgrade intentionally and record detector changes.
Safe & Ethical Use
- Scope: Confirm which domains and subdomains you’re allowed to test.
- Rate limits: Don’t DOS your own site — tune concurrency and depth.
- Handling secrets: Never paste raw credentials in tickets or chat. Store as secrets in your vault and reference by ID.
- Disclosure: If you find a real exposure, follow your organization’s or the vendor’s responsible disclosure process.
Troubleshooting
- “Too much noise.” Add
-exclude-ext,-include-host, and reduce-depth. Turn on-secrets-onlyduring triage. - “Missed SPA routes.” Ensure JS fetching is enabled; include
-pathswith likely app routes (/api,/graphql). - “Time-outs.” Increase
-timeout, reduce-concurrency, and try a closer egress region. - “False positives.” Output JSON with detector metadata; tune post-filters (
jq/SIEM) to drop known analytics keys or dummy tokens. - “Blocked by WAF.” Coordinate with the blue team; add your IPs to an allowlist for scheduled scans.
Conclusion
Cariddi compresses hours of manual clicking into a single, repeatable crawl that surfaces the things you actually care about: endpoints, secrets, and misconfigurations. Your target goal today is simple: run a shallow scan against a permitted domain list, export JSON, and wire a tiny triage script that flags new critical findings. Once that loop is in place, deepen scans and schedule them. You’ll spend less time collecting hay — and more time removing needles before they poke you.
