Web & Email Sources
Web scraping and email monitoring are two powerful content collection methods, suitable for websites without RSS feeds and email notification scenarios.
🌐 Web Scraping (Scrape & Crawl)
OctoReport supports two web scraping modes: Single Page Scrape and Batch Crawl.
Single Page Scrape
Suitable for scraping single fixed pages, such as company announcement pages, policy document pages, etc.
How It Works
- Periodically access specified URL
- Use Firecrawl API to scrape complete page content (automatically handles JavaScript rendering)
- If Firecrawl fails, automatically fallback to Browserless (backup solution)
- Extract page Markdown content and save to knowledge base
Configuration Parameters
| Parameter | Description | Example |
|---|---|---|
| Target URL | Page address to scrape | |
| Schedule Strategy | Scraping frequency | Every 6 hours / Daily at 9:00 |
| Deduplication Strategy | UPDATE (get latest version) or KEEP_OLD (avoid duplicate scraping) | UPDATE |
| Content Cleaning | Whether to use LLM to extract key information | Enable/Disable |
Configuration Example
[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object], ,[object Object],hljs json
Batch Crawl
Suitable for batch crawling multiple pages, such as crawling entire blogs, product catalogs, etc.
How It Works
- Start crawling from starting URL
- Automatically discover and follow links in pages (configurable depth)
- Batch scrape all discovered pages
- Save each page separately with automatic deduplication
Configuration Parameters
| Parameter | Description | Example |
|---|---|---|
| Starting URL | Entry page for crawling | |
| Max Pages | Limit number of pages to crawl | 50 |
| Depth Limit | Maximum number of link layers to follow | 2 (start page → list page → detail page) |
| URL Filter Rule | Only crawl URLs matching pattern (regex) | |
| Deduplication Strategy | Recommended to use KEEP_OLD (avoid duplicate crawling) | KEEP_OLD |
Configuration Example
[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object], ,[object Object],hljs json
Scrape vs Crawl Comparison
| Feature | Scrape (Single Page) | Crawl (Batch) |
|---|---|---|
| Use Case | Single fixed page | Multiple pages, entire site crawling |
| Page Count | 1 page | Multiple (configurable limit) |
| Link Discovery | ❌ Not supported | ✅ Automatic discovery |
| Depth Control | ❌ Not applicable | ✅ Supported |
| Cost | Low (1-5 credits per run) | Medium (1-5 credits per page) |
| Recommended Deduplication | UPDATE (get latest) | KEEP_OLD (avoid duplicates) |
PlaceholderScrape vs Crawl workflow comparison diagram
📧 Email Source (IMAP)
Suitable for monitoring emails in mailboxes, such as subscription notifications, system alerts, etc.
How It Works
- Connect to mailbox via IMAP protocol (check every hour)
- Read new emails from specified folder
- Extract email subject, sender, body content
- Save to knowledge base (each email as one content item)
Configuration Parameters
| Parameter | Description | Example |
|---|---|---|
| IMAP Server | Email server address | |
| Port | IMAP port (usually 993) | 993 |
| Username | Email account | |
| Password | Email password or app-specific password | |
| Email Folder | Folder to monitor (default INBOX) | |
| Sender Filter | Only collect emails from specified sender |
Configuration Example
[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object],[object Object], ,[object Object],[object Object], ,[object Object], ,[object Object],hljs json
ℹ️ Gmail Users Note: Need to enable "Allow less secure app access" or use "App-specific password".
💡 Privacy Note: Email passwords are encrypted (AES-256-GCM) and only used for IMAP connections.
PlaceholderIMAP email monitoring configuration interface
Best Practices
✅ Single Page Scrape Scenario
Scenario: Monitor government policy document page
- Data Source: Scrape
- Schedule Strategy: Once per day ()
{"hours": 24} - Deduplication Strategy: UPDATE (get latest version)
- Content Cleaning: Enable (extract key policy information)
✅ Batch Crawl Scenario
Scenario: Crawl tech blog articles
- Data Source: Crawl
- Max Pages: 50
- Depth Limit: 2 (list page → detail page)
- Deduplication Strategy: KEEP_OLD (avoid duplicate crawling)
- Schedule Strategy: Once per week ()
{"days": [1], "hour": 9, "minute": 0}
✅ Email Monitor Scenario
Scenario: Monitor server alert emails
- Data Source: Email (IMAP)
- Sender Filter:
- Schedule Strategy: Hourly ()
{"hours": 1} - Content Cleaning: Disable (raw email content is sufficient)
FAQ
Q1: What if Scrape fails?
A: System automatically falls back:
- First try Firecrawl (supports JavaScript rendering)
- If fails, automatically switch to Browserless
- If still fails, task log will show specific error
Q2: Crawl is too slow?
A: Adjust strategy:
- Reduce (e.g., from 100 to 50)
maxPages - Reduce (e.g., from 3 to 2)
maxDepth - Use to filter irrelevant pages
urlPattern
Q3: How to avoid crawling same pages repeatedly?
A: Use
KEEP_OLD- System records crawled URLs
- Next execution only crawls new pages
- Significantly reduces cost (no duplicate scraping)
Q4: IMAP email connection fails?
A: Check configuration:
- ✅ IMAP server address and port are correct
- ✅ Username and password are correct (Gmail requires app-specific password)
- ✅ Email service provider allows IMAP access (needs to be enabled in email settings)
Next Steps
- Data Sources Overview - Learn about all data source types
- Government & News Sources - Tender announcements and Google News
- Knowledge Base Management - Manage collected content