Source Management - Overview

What is a Data Source

A data source defines where to collect what content from.

After creating a data source, the system will automatically collect content according to your scheduled strategy and store it in the specified library.

8+ Data Source Types

Type	Use Case	Update Frequency	Cost
🔍 Search Sources	Actively search keywords	1-2 times/day	Medium
📡 RSS Feeds	Subscribe to website updates	Every 1-6 hours	Low
📧 Email Sources	Monitor mailbox emails	Every hour	Low
🌐 Web Scraping	Scrape specific pages	1-2 times/day	Medium
📢 Tender Announcements	Monitor government procurement	Once/day	Medium
📰 Google News	Global news monitoring	Every 2-6 hours	Low

PlaceholderData source type cards

How to Choose Data Source Type

Decision Flow

What is your content source?
├─ Specific site has RSS → Use **RSS Feed** (most cost-effective)
├─ Need to search keywords → Use **Search Source**
├─ Monitor email notifications → Use **Email Source**
├─ Scrape specific pages → Use **Web Scraping**
├─ Government tender info → Use **Tender Source**
└─ Global news monitoring → Use **Google News**

Recommended Combinations

News Aggregation Scenario:

RSS Feeds (main source, low cost)
Google News (supplement, cover more regions)
Search Sources (supplement specific keywords)

Tender Monitoring Scenario:

Tender Sources (government platforms)
Search Sources (company website tender pages)

PlaceholderData source selection decision tree

Common Configuration Options

1. Schedule Strategy

Interval Mode

Execute every X hours
Example: Every 6 hours (suitable for news)

Weekly Plan Mode

Specific days + time
Example: Mon/Wed/Fri 9:00 (suitable for periodic reports)

Manual Trigger

No auto-execution
Click "Execute Now" button to trigger

How to Choose:

News/Real-time content → Interval mode (1-6 hours)
Tender/Periodic updates → Weekly plan (fixed time daily)
Temporary needs → Manual trigger

2. Deduplication Strategy

UPDATE (Default)

When duplicate URL found, save new version
Old version marked as expired (
```
isExpired=true
```
)
Auto-filter expired content when generating reports

Use Case:

Need latest version (news updates, price changes)

KEEP_OLD

When duplicate URL found, only log, don't re-scrape
Keep original content

Use Case:

Content won't update (RSS news, tender announcements)
Save cost (avoid duplicate scraping)

Comparison:

Strategy	Re-scrape	Cost	Use Case
UPDATE	✅ Yes	High	Content updates
KEEP_OLD	❌ No	Low	Content unchanged

3. Content Cleaning

Enable Cleaning

Use LLM to extract title, summary, keywords
Remove HTML tags and irrelevant content
Extra cost: 10-20 credits/time

Disable Cleaning

Keep original HTML content
Clean later when needed (recommended)

How to Choose:

Need structured data immediately → Enable
Collect raw data first → Disable (can manually trigger cleaning later)

4. Associated Libraries

Each data source can associate with 1 or more libraries.

Scenarios:

1 source → 1 library (simple scenario)
1 source → multiple libraries (categorize by topic)

Example:

Data Source: "36Kr Tech News"
  ├─ Associated Library: "AI Industry News"
  └─ Associated Library: "Startup Investment News"

PlaceholderConfiguration diagram

Best Practices

✅ News Content

Data Source: RSS Feed
Schedule: Every 6 hours
Dedup: KEEP_OLD (save cost)
Cleaning: Disabled (raw content sufficient)

✅ Tender Content

Data Source: Tender Source
Schedule: Once daily
Dedup: KEEP_OLD (no duplicate scraping)
Cleaning: Enabled (extract key info)

✅ Keyword Monitoring

Data Source: Search Source
Schedule: 2 times daily
Dedup: UPDATE (get latest)
Cleaning: Enabled (structured data)

Next Steps

Search Sources - Detailed config for 4 search engines
RSS Feeds - RSSHub advanced config
Web & Email Sources - Automated monitoring

Source Management - Overview

What is a Data Source

A data source defines where to collect what content from.

After creating a data source, the system will automatically collect content according to your scheduled strategy and store it in the specified library.

8+ Data Source Types

Type	Use Case	Update Frequency	Cost
🔍 Search Sources	Actively search keywords	1-2 times/day	Medium
📡 RSS Feeds	Subscribe to website updates	Every 1-6 hours	Low
📧 Email Sources	Monitor mailbox emails	Every hour	Low
🌐 Web Scraping	Scrape specific pages	1-2 times/day	Medium
📢 Tender Announcements	Monitor government procurement	Once/day	Medium
📰 Google News	Global news monitoring	Every 2-6 hours	Low

PlaceholderData source type cards

How to Choose Data Source Type

Decision Flow

What is your content source?
├─ Specific site has RSS → Use **RSS Feed** (most cost-effective)
├─ Need to search keywords → Use **Search Source**
├─ Monitor email notifications → Use **Email Source**
├─ Scrape specific pages → Use **Web Scraping**
├─ Government tender info → Use **Tender Source**
└─ Global news monitoring → Use **Google News**

Recommended Combinations

News Aggregation Scenario:

RSS Feeds (main source, low cost)
Google News (supplement, cover more regions)
Search Sources (supplement specific keywords)

Tender Monitoring Scenario:

Tender Sources (government platforms)
Search Sources (company website tender pages)

PlaceholderData source selection decision tree

Common Configuration Options

1. Schedule Strategy

Interval Mode

Execute every X hours
Example: Every 6 hours (suitable for news)

Weekly Plan Mode

Specific days + time
Example: Mon/Wed/Fri 9:00 (suitable for periodic reports)

Manual Trigger

No auto-execution
Click "Execute Now" button to trigger

How to Choose:

News/Real-time content → Interval mode (1-6 hours)
Tender/Periodic updates → Weekly plan (fixed time daily)
Temporary needs → Manual trigger

2. Deduplication Strategy

UPDATE (Default)

When duplicate URL found, save new version
Old version marked as expired (
```
isExpired=true
```
)
Auto-filter expired content when generating reports

Use Case:

Need latest version (news updates, price changes)

KEEP_OLD

When duplicate URL found, only log, don't re-scrape
Keep original content

Use Case:

Content won't update (RSS news, tender announcements)
Save cost (avoid duplicate scraping)

Comparison:

Strategy	Re-scrape	Cost	Use Case
UPDATE	✅ Yes	High	Content updates
KEEP_OLD	❌ No	Low	Content unchanged

3. Content Cleaning

Enable Cleaning

Use LLM to extract title, summary, keywords
Remove HTML tags and irrelevant content
Extra cost: 10-20 credits/time

Disable Cleaning

Keep original HTML content
Clean later when needed (recommended)

How to Choose:

Need structured data immediately → Enable
Collect raw data first → Disable (can manually trigger cleaning later)

4. Associated Libraries

Each data source can associate with 1 or more libraries.

Scenarios:

1 source → 1 library (simple scenario)
1 source → multiple libraries (categorize by topic)

Example:

Data Source: "36Kr Tech News"
  ├─ Associated Library: "AI Industry News"
  └─ Associated Library: "Startup Investment News"

PlaceholderConfiguration diagram

Best Practices

✅ News Content

Data Source: RSS Feed
Schedule: Every 6 hours
Dedup: KEEP_OLD (save cost)
Cleaning: Disabled (raw content sufficient)

✅ Tender Content

Data Source: Tender Source
Schedule: Once daily
Dedup: KEEP_OLD (no duplicate scraping)
Cleaning: Enabled (extract key info)

✅ Keyword Monitoring

Data Source: Search Source
Schedule: 2 times daily
Dedup: UPDATE (get latest)
Cleaning: Enabled (structured data)

Next Steps

Search Sources - Detailed config for 4 search engines
RSS Feeds - RSSHub advanced config
Web & Email Sources - Automated monitoring