OctoReport Docs
Back to HomeGo to Console
🚀快速开始
  • 产品概述
  • 快速上手
✨核心功能
    • 数据源总览
    • 搜索类源
    • RSS订阅源
    • 网页与邮件源
    • 政府与新闻源
  • 知识库管理
  • 报告生成
  • 交互式对话
  • 邮件触发
  • 积分与日志
💡使用技巧
  • 配置技巧
  • 优化与排查
🔬产品亮点
  • URL去重
  • 原子计费
  • 系统可靠性
❓帮助中心
  • FAQ与支持

Search Sources

What Are Search Sources

Search sources actively search for keywords through search engine APIs and automatically collect search results. Suitable for monitoring specific topics, tracking brand mentions, and discovering industry trends.

Core Advantages:

  • Actively discover content (rather than passive subscription)
  • Support complex keyword combinations
  • Optional detail page scraping

Supported Search Engines

1. Google Search 🔍

Features:

  • World's largest search engine with broadest coverage
  • Uses Google Custom Search API
  • Supports detail page scraping (requires additional configuration)

Use Cases:

  • Global news monitoring
  • English content search
  • Broad topic coverage

Configuration Requirements:

  • Google API Key (obtain through Google Cloud Console)
  • Search Engine ID (create custom search engine)

Cost: ~10 credits/item

2. Jina AI Search 🤖

Features:

  • AI-driven semantic search
  • Focuses on high-quality content
  • Supports detail page scraping

Use Cases:

  • Technical documentation search
  • High-quality content filtering
  • Semantic relevance matching

Configuration Requirements:

  • Jina API Key (get from https://jina.ai)

Cost: ~10 credits/item

3. Firecrawl Search 🔥

Features:

  • Professional web scraping service
  • Native detail scraping support (returns Markdown content directly during search)
  • Returns structured Markdown format

Use Cases:

  • Scenarios requiring complete content
  • Structured data extraction
  • High-quality content cleaning

Configuration Requirements:

  • Firecrawl API Key (get from https://firecrawl.dev)

Cost:

  • Search: ~10 credits/item
  • Detail scraping: Included in search (no extra charge)

💡 Tip: Firecrawl is the only engine that returns Markdown directly during search, no secondary scraping needed

4. Metaso Search (秘塔AI) 🌟

Features:

  • Chinese AI search engine
  • Focuses on Chinese content
  • Supports both web search and academic search modes
  • Does not support direct detail scraping (only returns summaries)

Use Cases:

  • Chinese content monitoring
  • Domestic news search
  • Academic literature discovery

Configuration Requirements:

  • Metaso API Key (get from https://metaso.cn)

Search Scope:

  • webpage
    - Web search (default)
  • academic
    - Academic search

Cost: ~3 credits/time

Placeholder4 search engine comparison cards

Configuration Parameters

1. Keywords

Required, search keywords or phrases.

Examples:

"人工智能 大模型"
"OpenAI GPT-4"
"renewable energy policy"

Tips:

  • Use quotes for exact match:
    "exact phrase"
  • Use space for AND relationship:
    AI GPT
  • Combine multiple keywords for better relevance

2. Max Results

Optional, maximum number of results per search.

Default: 10

Range:

  • Google Search: 1-10 (Google API limit)
  • Jina/Firecrawl/Metaso: 1-50

Example:

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Cost Tip: More results = more credits consumed (charged per item)

3. Fetch Detail

Optional, whether to scrape detail page content of search results.

Default:

  • Google/Jina/Firecrawl:
    true
    (scrape by default)
  • Metaso: Not supported (always returns summary)

How It Works:

  1. Firecrawl: Returns Markdown directly during search (no extra overhead)
  2. Google/Jina: After search, uses
    Firecrawl → Browserless
    fallback chain for secondary scraping
  3. Metaso: Only returns snippet, doesn't support detail scraping

Example:

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Detail Scraping Mechanism

Scraping Strategy

Firecrawl First + Browserless Fallback:

  1. First try Firecrawl v2 Scrape API
  2. If fails, auto-degrade to Browserless (headless Chrome)
  3. If still fails, keep original snippet

CAPTCHA Detection

System automatically detects CAPTCHA pages to avoid saving invalid content:

  • Detection keywords:
    "verify you are human"
    ,
    "captcha"
    ,
    "robot check"
  • When CAPTCHA detected, use summary instead of detail
  • Not counted in scraping success stats

Concurrency Limits

To avoid API rate limiting, system auto-controls concurrency:

  • Firecrawl: Max 5 concurrent requests
  • Browserless: Max 3 concurrent requests
  • Adjustable in admin panel (
    /admin/system-config
    )

Statistics

After each search completion, detail scraping stats are shown:

Detail Scraping Stats:
- Total: 10
- Success: 8
- Failed: 2
- Firecrawl: 6
- Browserless: 2

PlaceholderDetail scraping fallback flow diagram

Configuration Examples

Example 1: Google Search + Fetch Details

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Description:

  • Search keywords:
    renewable energy policy 2024
  • Return 10 results
  • Auto-scrape detail pages for each result
  • Use Firecrawl → Browserless fallback chain

Cost Estimate:

  • Search: 10 items × 10 credits = 100 credits
  • Detail scraping: Included in search

Example 2: Firecrawl Search (Recommended)

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Description:

  • Use Firecrawl search (select
    firecrawl
    subtype)
  • Returns Markdown content directly during search
  • No secondary scraping needed, faster
  • Highest content quality (professional cleaning)

Example 3: Metaso AI Search (Chinese)

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Description:

  • Use Metaso AI search (select
    metaso
    subtype)
  • Web search mode (
    webpage
    )
  • Only returns summary (doesn't support
    fetch_detail
    )
  • Suitable for Chinese content monitoring

Example 4: Academic Search

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Description:

  • Use Metaso AI academic search
  • Search scope:
    academic
    (academic mode)
  • Returns papers and academic articles
  • Suitable for research and literature review

Best Practices

✅ Keyword Optimization

Use Exact Phrases:

  • ❌
    AI
    (too broad)
  • ✅
    "GPT-4 Turbo release notes"
    (exact match)

Combine Multiple Keywords:

  • ❌
    news
    (too many results)
  • ✅
    "climate change" policy 2024
    (multiple keywords)

✅ Cost Optimization

Disable Unnecessary Detail Scraping:

  • Only need title and summary →
    fetch_detail: false
  • Save ~50% cost

Choose Appropriate Search Engine:

  • Chinese content → Metaso (3 credits/time)
  • English content + details → Firecrawl (10 credits/time)
  • Broad coverage → Google (10 credits/time)

✅ Schedule Strategy

News Monitoring:

  • Schedule: Every 12 hours
  • Dedup strategy: KEEP_OLD (avoid duplicate scraping)

Keyword Tracking:

  • Schedule: 1-2 times daily
  • Dedup strategy: UPDATE (get latest version)

⚠️ Common Issues

Issue 1: Search Results Fewer Than Expected

Reasons:

  • Keywords too specific
  • Search engine API limits

Solutions:

  • Broaden keywords
  • Try different search engines

Issue 2: High Detail Scraping Failure Rate

Reasons:

  • Target site has anti-scraping mechanisms
  • CAPTCHA verification present

Solutions:

  • Use Firecrawl search (higher bypass rate)
  • Disable
    fetch_detail
    , use summary only

Issue 3: Duplicate Content

Reasons:

  • Schedule too frequent
  • Dedup strategy misconfigured

Solutions:

  • Reduce search frequency (once daily)
  • Use KEEP_OLD dedup strategy

Next Steps

  • RSS Sources - Subscribe to website updates
  • Web & Email Sources - Scrape specific pages
  • Sources Overview - Learn about all source types