OctoReport Docs
Back to HomeGo to Console
🚀快速开始
  • 产品概述
  • 快速上手
✨核心功能
    • 数据源总览
    • 搜索类源
    • RSS订阅源
    • 网页与邮件源
    • 政府与新闻源
  • 知识库管理
  • 报告生成
  • 交互式对话
  • 邮件触发
  • 积分与日志
💡使用技巧
  • 配置技巧
  • 优化与排查
🔬产品亮点
  • URL去重
  • 原子计费
  • 系统可靠性
❓帮助中心
  • FAQ与支持

Source Management - Overview

What is a Data Source

A data source defines where to collect what content from.

After creating a data source, the system will automatically collect content according to your scheduled strategy and store it in the specified library.

8+ Data Source Types

TypeUse CaseUpdate FrequencyCost
🔍 Search SourcesActively search keywords1-2 times/dayMedium
📡 RSS FeedsSubscribe to website updatesEvery 1-6 hoursLow
📧 Email SourcesMonitor mailbox emailsEvery hourLow
🌐 Web ScrapingScrape specific pages1-2 times/dayMedium
📢 Tender AnnouncementsMonitor government procurementOnce/dayMedium
📰 Google NewsGlobal news monitoringEvery 2-6 hoursLow

PlaceholderData source type cards

How to Choose Data Source Type

Decision Flow

What is your content source?
├─ Specific site has RSS → Use **RSS Feed** (most cost-effective)
├─ Need to search keywords → Use **Search Source**
├─ Monitor email notifications → Use **Email Source**
├─ Scrape specific pages → Use **Web Scraping**
├─ Government tender info → Use **Tender Source**
└─ Global news monitoring → Use **Google News**

Recommended Combinations

News Aggregation Scenario:

  • RSS Feeds (main source, low cost)
  • Google News (supplement, cover more regions)
  • Search Sources (supplement specific keywords)

Tender Monitoring Scenario:

  • Tender Sources (government platforms)
  • Search Sources (company website tender pages)

PlaceholderData source selection decision tree

Common Configuration Options

1. Schedule Strategy

Interval Mode

  • Execute every X hours
  • Example: Every 6 hours (suitable for news)

Weekly Plan Mode

  • Specific days + time
  • Example: Mon/Wed/Fri 9:00 (suitable for periodic reports)

Manual Trigger

  • No auto-execution
  • Click "Execute Now" button to trigger

How to Choose:

  • News/Real-time content → Interval mode (1-6 hours)
  • Tender/Periodic updates → Weekly plan (fixed time daily)
  • Temporary needs → Manual trigger

2. Deduplication Strategy

UPDATE (Default)

  • When duplicate URL found, save new version
  • Old version marked as expired (
    isExpired=true
    )
  • Auto-filter expired content when generating reports

Use Case:

  • Need latest version (news updates, price changes)

KEEP_OLD

  • When duplicate URL found, only log, don't re-scrape
  • Keep original content

Use Case:

  • Content won't update (RSS news, tender announcements)
  • Save cost (avoid duplicate scraping)

Comparison:

StrategyRe-scrapeCostUse Case
UPDATE✅ YesHighContent updates
KEEP_OLD❌ NoLowContent unchanged

3. Content Cleaning

Enable Cleaning

  • Use LLM to extract title, summary, keywords
  • Remove HTML tags and irrelevant content
  • Extra cost: 10-20 credits/time

Disable Cleaning

  • Keep original HTML content
  • Clean later when needed (recommended)

How to Choose:

  • Need structured data immediately → Enable
  • Collect raw data first → Disable (can manually trigger cleaning later)

4. Associated Libraries

Each data source can associate with 1 or more libraries.

Scenarios:

  • 1 source → 1 library (simple scenario)
  • 1 source → multiple libraries (categorize by topic)

Example:

Data Source: "36Kr Tech News"
  ├─ Associated Library: "AI Industry News"
  └─ Associated Library: "Startup Investment News"

PlaceholderConfiguration diagram

Best Practices

✅ News Content

  • Data Source: RSS Feed
  • Schedule: Every 6 hours
  • Dedup: KEEP_OLD (save cost)
  • Cleaning: Disabled (raw content sufficient)

✅ Tender Content

  • Data Source: Tender Source
  • Schedule: Once daily
  • Dedup: KEEP_OLD (no duplicate scraping)
  • Cleaning: Enabled (extract key info)

✅ Keyword Monitoring

  • Data Source: Search Source
  • Schedule: 2 times daily
  • Dedup: UPDATE (get latest)
  • Cleaning: Enabled (structured data)

Next Steps

  • Search Sources - Detailed config for 4 search engines
  • RSS Feeds - RSSHub advanced config
  • Web & Email Sources - Automated monitoring