OctoReport Docs
Back to HomeGo to Console
🚀快速开始
  • 产品概述
  • 快速上手
✨核心功能
    • 数据源总览
    • 搜索类源
    • RSS订阅源
    • 网页与邮件源
    • 政府与新闻源
  • 知识库管理
  • 报告生成
  • 交互式对话
  • 邮件触发
  • 积分与日志
💡使用技巧
  • 配置技巧
  • 优化与排查
🔬产品亮点
  • URL去重
  • 原子计费
  • 系统可靠性
❓帮助中心
  • FAQ与支持

Web & Email Sources

Web scraping and email monitoring are two powerful content collection methods, suitable for websites without RSS feeds and email notification scenarios.


🌐 Web Scraping (Scrape & Crawl)

OctoReport supports two web scraping modes: Single Page Scrape and Batch Crawl.

Single Page Scrape

Suitable for scraping single fixed pages, such as company announcement pages, policy document pages, etc.

How It Works

  1. Periodically access specified URL
  2. Use Firecrawl API to scrape complete page content (automatically handles JavaScript rendering)
  3. If Firecrawl fails, automatically fallback to Browserless (backup solution)
  4. Extract page Markdown content and save to knowledge base

Configuration Parameters

ParameterDescriptionExample
Target URLPage address to scrape
https://example.com/news
Schedule StrategyScraping frequencyEvery 6 hours / Daily at 9:00
Deduplication StrategyUPDATE (get latest version) or KEEP_OLD (avoid duplicate scraping)UPDATE
Content CleaningWhether to use LLM to extract key informationEnable/Disable

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Batch Crawl

Suitable for batch crawling multiple pages, such as crawling entire blogs, product catalogs, etc.

How It Works

  1. Start crawling from starting URL
  2. Automatically discover and follow links in pages (configurable depth)
  3. Batch scrape all discovered pages
  4. Save each page separately with automatic deduplication

Configuration Parameters

ParameterDescriptionExample
Starting URLEntry page for crawling
https://blog.example.com
Max PagesLimit number of pages to crawl50
Depth LimitMaximum number of link layers to follow2 (start page → list page → detail page)
URL Filter RuleOnly crawl URLs matching pattern (regex)
/blog/.*
Deduplication StrategyRecommended to use KEEP_OLD (avoid duplicate crawling)KEEP_OLD

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Scrape vs Crawl Comparison

FeatureScrape (Single Page)Crawl (Batch)
Use CaseSingle fixed pageMultiple pages, entire site crawling
Page Count1 pageMultiple (configurable limit)
Link Discovery❌ Not supported✅ Automatic discovery
Depth Control❌ Not applicable✅ Supported
CostLow (1-5 credits per run)Medium (1-5 credits per page)
Recommended DeduplicationUPDATE (get latest)KEEP_OLD (avoid duplicates)

PlaceholderScrape vs Crawl workflow comparison diagram


📧 Email Source (IMAP)

Suitable for monitoring emails in mailboxes, such as subscription notifications, system alerts, etc.

How It Works

  1. Connect to mailbox via IMAP protocol (check every hour)
  2. Read new emails from specified folder
  3. Extract email subject, sender, body content
  4. Save to knowledge base (each email as one content item)

Configuration Parameters

ParameterDescriptionExample
IMAP ServerEmail server address
imap.gmail.com
PortIMAP port (usually 993)993
UsernameEmail account
[email protected]
PasswordEmail password or app-specific password
app_password_123
Email FolderFolder to monitor (default INBOX)
INBOX
Sender FilterOnly collect emails from specified sender
[email protected]

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

ℹ️ Gmail Users Note: Need to enable "Allow less secure app access" or use "App-specific password".

💡 Privacy Note: Email passwords are encrypted (AES-256-GCM) and only used for IMAP connections.

PlaceholderIMAP email monitoring configuration interface


Best Practices

✅ Single Page Scrape Scenario

Scenario: Monitor government policy document page

  • Data Source: Scrape
  • Schedule Strategy: Once per day (
    {"hours": 24}
    )
  • Deduplication Strategy: UPDATE (get latest version)
  • Content Cleaning: Enable (extract key policy information)

✅ Batch Crawl Scenario

Scenario: Crawl tech blog articles

  • Data Source: Crawl
  • Max Pages: 50
  • Depth Limit: 2 (list page → detail page)
  • Deduplication Strategy: KEEP_OLD (avoid duplicate crawling)
  • Schedule Strategy: Once per week (
    {"days": [1], "hour": 9, "minute": 0}
    )

✅ Email Monitor Scenario

Scenario: Monitor server alert emails

  • Data Source: Email (IMAP)
  • Sender Filter:
    [email protected]
  • Schedule Strategy: Hourly (
    {"hours": 1}
    )
  • Content Cleaning: Disable (raw email content is sufficient)

FAQ

Q1: What if Scrape fails?

A: System automatically falls back:

  1. First try Firecrawl (supports JavaScript rendering)
  2. If fails, automatically switch to Browserless
  3. If still fails, task log will show specific error

Q2: Crawl is too slow?

A: Adjust strategy:

  • Reduce
    maxPages
    (e.g., from 100 to 50)
  • Reduce
    maxDepth
    (e.g., from 3 to 2)
  • Use
    urlPattern
    to filter irrelevant pages

Q3: How to avoid crawling same pages repeatedly?

A: Use

KEEP_OLD
deduplication strategy:

  • System records crawled URLs
  • Next execution only crawls new pages
  • Significantly reduces cost (no duplicate scraping)

Q4: IMAP email connection fails?

A: Check configuration:

  • ✅ IMAP server address and port are correct
  • ✅ Username and password are correct (Gmail requires app-specific password)
  • ✅ Email service provider allows IMAP access (needs to be enabled in email settings)

Next Steps

  • Data Sources Overview - Learn about all data source types
  • Government & News Sources - Tender announcements and Google News
  • Knowledge Base Management - Manage collected content