Crawl
tavily-ai/skillsThis crawl skill enables users to extract content from multiple web pages, making it ideal for gathering documentation, knowledge bases, and site-wide information. It supports focused and deep crawling with customizable depth, path filtering, and instructions for targeted data retrieval, suitable for both agentic context feeding and full-page archiving. The skill offers easy OAuth or API key authentication and is designed for developers, researchers, and knowledge managers aiming to systematically collect or analyze website content.
Crawl Skill
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
Authentication
The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:
- Check for existing tokens in
~/.mcp-auth/ - If none found, automatically open your browser for OAuth authentication
Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.
Alternative: API Key
If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json:
{
"env": {
"TAVILY_API_KEY": "tvly-your-api-key-here"
}
}
Quick Start
Using the Script
./scripts/crawl.sh '<json>' [output_dir]
Examples:
# Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
# Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
# Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
# Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.*", "/api/.*"], "exclude_paths": ["/blog/.*"]}'
# With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
When output_dir is provided, each crawled page is saved as a separate markdown file.
Basic Crawl
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
Focused Crawl with Instructions
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
API Reference
Endpoint
POST https://api.tavily.com/crawl
Headers
Header
Value
Authorization
Bearer <TAVILY_API_KEY>
Content-Type
application/json
Request Body
Field
Type
Default
Description
url
string
Required
Root URL to begin crawling
max_depth
integer
1
Levels deep to crawl (1-5)
max_breadth
integer
20
Links per page
limit
integer
50
Total pages cap
instructions
string
null
Natural language guidance for focus
chunks_per_source
integer
3
Chunks per page (1-5, requires instructions)
extract_depth
string
"basic"
basic or advanced
format
string
"markdown"
markdown or text
select_paths
array
null
Regex patterns to include
exclude_paths
array
null
Regex patterns to exclude
allow_external
boolean
true
Include external domain links
timeout
float
150
Max wait (10-150 seconds)
Response Format
{
"base_url": "https://docs.example.com",
"results": [
{
"url": "https://docs.example.com/page",
"raw_content": "# Page Title\n\nContent..."
}
],
"response_time": 12.5
}
Depth vs Performance
Depth
Typical Pages
Time
1
10-50
Seconds
2
50-500
Minutes
3
500-5000
Many minutes
Start with max_depth=1 and increase only if needed.
Crawl for Context vs Data Collection
For agentic use (feeding results into context): Always use instructions + chunks_per_source. This returns only relevant chunks instead of full pages, preventing context window explosion.
For data collection (saving to files): Omit chunks_per_source to get full page content.
Examples
For Context: Agentic Research (Recommended)
Use when feeding crawl results into an LLM context:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
For Context: Targeted Technical Docs
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.*", "/api/.*"]
}'
For Data Collection: Full Page Archive
Use when saving content to files for later processing:
curl --request POST \
--url https://api.tavily.com/crawl \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/.*"],
"exclude_paths": ["/blog/tag/.*", "/blog/category/.*"]
}'
Returns full page content - use the script with output_dir to save as markdown files.
Map API (URL Discovery)
Use map instead of crawl when you only need URLs, not content:
curl --request POST \
--url https://api.tavily.com/map \
--header "Authorization: Bearer $TAVILY_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
Returns URLs only (faster than crawl):
{
"base_url": "https://docs.example.com",
"results": [
"https://docs.example.com/api/auth",
"https://docs.example.com/guides/quickstart"
]
}
Tips
- Always use
chunks_per_sourcefor agentic workflows - prevents context explosion when feeding results to LLMs - Omit
chunks_per_sourceonly for data collection - when saving full pages to files - Start conservative (
max_depth=1,limit=20) and scale up - Use path patterns to focus on relevant sections
- Use Map first to understand site structure before full crawl
- Always set a
limitto prevent runaway crawls
GitHub Owner
Owner: tavily-ai
GitHub Links
- Website: http://tavily.com
- Email: support@tavily.com