Web & Email Sources

Web scraping and email monitoring are two powerful content collection methods, suitable for websites without RSS feeds and email notification scenarios.

🌐 Web Scraping (Scrape & Crawl)

OctoReport supports two web scraping modes: Single Page Scrape and Batch Crawl.

Single Page Scrape

Suitable for scraping single fixed pages, such as company announcement pages, policy document pages, etc.

How It Works

Periodically access specified URL
Use Firecrawl API to scrape complete page content (automatically handles JavaScript rendering)
If Firecrawl fails, automatically fallback to Browserless (backup solution)
Extract page Markdown content and save to knowledge base

Configuration Parameters

Parameter	Description	Example
Target URL	Page address to scrape	`https://example.com/news`
Schedule Strategy	Scraping frequency	Every 6 hours / Daily at 9:00
Deduplication Strategy	UPDATE (get latest version) or KEEP_OLD (avoid duplicate scraping)	UPDATE
Content Cleaning	Whether to use LLM to extract key information	Enable/Disable

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Batch Crawl

Suitable for batch crawling multiple pages, such as crawling entire blogs, product catalogs, etc.

How It Works

Start crawling from starting URL
Automatically discover and follow links in pages (configurable depth)
Batch scrape all discovered pages
Save each page separately with automatic deduplication

Configuration Parameters

Parameter	Description	Example
Starting URL	Entry page for crawling	`https://blog.example.com`
Max Pages	Limit number of pages to crawl	50
Depth Limit	Maximum number of link layers to follow	2 (start page → list page → detail page)
URL Filter Rule	Only crawl URLs matching pattern (regex)	`/blog/.*`
Deduplication Strategy	Recommended to use KEEP_OLD (avoid duplicate crawling)	KEEP_OLD

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Scrape vs Crawl Comparison

Feature	Scrape (Single Page)	Crawl (Batch)
Use Case	Single fixed page	Multiple pages, entire site crawling
Page Count	1 page	Multiple (configurable limit)
Link Discovery	❌ Not supported	✅ Automatic discovery
Depth Control	❌ Not applicable	✅ Supported
Cost	Low (1-5 credits per run)	Medium (1-5 credits per page)
Recommended Deduplication	UPDATE (get latest)	KEEP_OLD (avoid duplicates)

PlaceholderScrape vs Crawl workflow comparison diagram

📧 Email Source (IMAP)

Suitable for monitoring emails in mailboxes, such as subscription notifications, system alerts, etc.

How It Works

Connect to mailbox via IMAP protocol (check every hour)
Read new emails from specified folder
Extract email subject, sender, body content
Save to knowledge base (each email as one content item)

Configuration Parameters

Parameter	Description	Example
IMAP Server	Email server address	`imap.gmail.com`
Port	IMAP port (usually 993)	993
Username	Email account	`[email protected]`
Password	Email password or app-specific password	`app_password_123`
Email Folder	Folder to monitor (default INBOX)	`INBOX`
Sender Filter	Only collect emails from specified sender	`[email protected]`

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

ℹ️ Gmail Users Note: Need to enable "Allow less secure app access" or use "App-specific password".

💡 Privacy Note: Email passwords are encrypted (AES-256-GCM) and only used for IMAP connections.

PlaceholderIMAP email monitoring configuration interface

Best Practices

✅ Single Page Scrape Scenario

Scenario: Monitor government policy document page

Data Source: Scrape
Schedule Strategy: Once per day (
```
{"hours": 24}
```
)
Deduplication Strategy: UPDATE (get latest version)
Content Cleaning: Enable (extract key policy information)

✅ Batch Crawl Scenario

Scenario: Crawl tech blog articles

Data Source: Crawl
Max Pages: 50
Depth Limit: 2 (list page → detail page)
Deduplication Strategy: KEEP_OLD (avoid duplicate crawling)
Schedule Strategy: Once per week (
```
{"days": [1], "hour": 9, "minute": 0}
```
)

✅ Email Monitor Scenario

Scenario: Monitor server alert emails

Data Source: Email (IMAP)
Sender Filter:
```
[email protected]
```
Schedule Strategy: Hourly (
```
{"hours": 1}
```
)
Content Cleaning: Disable (raw email content is sufficient)

FAQ

Q1: What if Scrape fails?

A: System automatically falls back:

First try Firecrawl (supports JavaScript rendering)
If fails, automatically switch to Browserless
If still fails, task log will show specific error

Q2: Crawl is too slow?

A: Adjust strategy:

Reduce
```
maxPages
```
(e.g., from 100 to 50)
Reduce
```
maxDepth
```
(e.g., from 3 to 2)
Use
```
urlPattern
```
to filter irrelevant pages

Q3: How to avoid crawling same pages repeatedly?

A: Use

KEEP_OLD

deduplication strategy:

System records crawled URLs
Next execution only crawls new pages
Significantly reduces cost (no duplicate scraping)

Q4: IMAP email connection fails?

A: Check configuration:

✅ IMAP server address and port are correct
✅ Username and password are correct (Gmail requires app-specific password)
✅ Email service provider allows IMAP access (needs to be enabled in email settings)

Next Steps

Data Sources Overview - Learn about all data source types
Government & News Sources - Tender announcements and Google News
Knowledge Base Management - Manage collected content

Web & Email Sources

Web scraping and email monitoring are two powerful content collection methods, suitable for websites without RSS feeds and email notification scenarios.

🌐 Web Scraping (Scrape & Crawl)

OctoReport supports two web scraping modes: Single Page Scrape and Batch Crawl.

Single Page Scrape

Suitable for scraping single fixed pages, such as company announcement pages, policy document pages, etc.

How It Works

Periodically access specified URL
Use Firecrawl API to scrape complete page content (automatically handles JavaScript rendering)
If Firecrawl fails, automatically fallback to Browserless (backup solution)
Extract page Markdown content and save to knowledge base

Configuration Parameters

Parameter	Description	Example
Target URL	Page address to scrape	`https://example.com/news`
Schedule Strategy	Scraping frequency	Every 6 hours / Daily at 9:00
Deduplication Strategy	UPDATE (get latest version) or KEEP_OLD (avoid duplicate scraping)	UPDATE
Content Cleaning	Whether to use LLM to extract key information	Enable/Disable

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Batch Crawl

Suitable for batch crawling multiple pages, such as crawling entire blogs, product catalogs, etc.

How It Works

Start crawling from starting URL
Automatically discover and follow links in pages (configurable depth)
Batch scrape all discovered pages
Save each page separately with automatic deduplication

Configuration Parameters

Parameter	Description	Example
Starting URL	Entry page for crawling	`https://blog.example.com`
Max Pages	Limit number of pages to crawl	50
Depth Limit	Maximum number of link layers to follow	2 (start page → list page → detail page)
URL Filter Rule	Only crawl URLs matching pattern (regex)	`/blog/.*`
Deduplication Strategy	Recommended to use KEEP_OLD (avoid duplicate crawling)	KEEP_OLD

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

Scrape vs Crawl Comparison

Feature	Scrape (Single Page)	Crawl (Batch)
Use Case	Single fixed page	Multiple pages, entire site crawling
Page Count	1 page	Multiple (configurable limit)
Link Discovery	❌ Not supported	✅ Automatic discovery
Depth Control	❌ Not applicable	✅ Supported
Cost	Low (1-5 credits per run)	Medium (1-5 credits per page)
Recommended Deduplication	UPDATE (get latest)	KEEP_OLD (avoid duplicates)

PlaceholderScrape vs Crawl workflow comparison diagram

📧 Email Source (IMAP)

Suitable for monitoring emails in mailboxes, such as subscription notifications, system alerts, etc.

How It Works

Connect to mailbox via IMAP protocol (check every hour)
Read new emails from specified folder
Extract email subject, sender, body content
Save to knowledge base (each email as one content item)

Configuration Parameters

Parameter	Description	Example
IMAP Server	Email server address	`imap.gmail.com`
Port	IMAP port (usually 993)	993
Username	Email account	`[email protected]`
Password	Email password or app-specific password	`app_password_123`
Email Folder	Folder to monitor (default INBOX)	`INBOX`
Sender Filter	Only collect emails from specified sender	`[email protected]`

Configuration Example

[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],
  ,[object Object],[object Object], ,[object Object],
,[object Object],
hljs json

ℹ️ Gmail Users Note: Need to enable "Allow less secure app access" or use "App-specific password".

💡 Privacy Note: Email passwords are encrypted (AES-256-GCM) and only used for IMAP connections.

PlaceholderIMAP email monitoring configuration interface

Best Practices

✅ Single Page Scrape Scenario

Scenario: Monitor government policy document page

Data Source: Scrape
Schedule Strategy: Once per day (
```
{"hours": 24}
```
)
Deduplication Strategy: UPDATE (get latest version)
Content Cleaning: Enable (extract key policy information)

✅ Batch Crawl Scenario

Scenario: Crawl tech blog articles

Data Source: Crawl
Max Pages: 50
Depth Limit: 2 (list page → detail page)
Deduplication Strategy: KEEP_OLD (avoid duplicate crawling)
Schedule Strategy: Once per week (
```
{"days": [1], "hour": 9, "minute": 0}
```
)

✅ Email Monitor Scenario

Scenario: Monitor server alert emails

Data Source: Email (IMAP)
Sender Filter:
```
[email protected]
```
Schedule Strategy: Hourly (
```
{"hours": 1}
```
)
Content Cleaning: Disable (raw email content is sufficient)

FAQ

Q1: What if Scrape fails?

A: System automatically falls back:

First try Firecrawl (supports JavaScript rendering)
If fails, automatically switch to Browserless
If still fails, task log will show specific error

Q2: Crawl is too slow?

A: Adjust strategy:

Reduce
```
maxPages
```
(e.g., from 100 to 50)
Reduce
```
maxDepth
```
(e.g., from 3 to 2)
Use
```
urlPattern
```
to filter irrelevant pages

Q3: How to avoid crawling same pages repeatedly?

A: Use

KEEP_OLD

deduplication strategy:

System records crawled URLs
Next execution only crawls new pages
Significantly reduces cost (no duplicate scraping)

Q4: IMAP email connection fails?

A: Check configuration:

✅ IMAP server address and port are correct
✅ Username and password are correct (Gmail requires app-specific password)
✅ Email service provider allows IMAP access (needs to be enabled in email settings)

Next Steps

Data Sources Overview - Learn about all data source types
Government & News Sources - Tender announcements and Google News
Knowledge Base Management - Manage collected content