Source Management - Overview
What is a Data Source
A data source defines where to collect what content from.
After creating a data source, the system will automatically collect content according to your scheduled strategy and store it in the specified library.
8+ Data Source Types
| Type | Use Case | Update Frequency | Cost |
|---|---|---|---|
| 🔍 Search Sources | Actively search keywords | 1-2 times/day | Medium |
| 📡 RSS Feeds | Subscribe to website updates | Every 1-6 hours | Low |
| 📧 Email Sources | Monitor mailbox emails | Every hour | Low |
| 🌐 Web Scraping | Scrape specific pages | 1-2 times/day | Medium |
| 📢 Tender Announcements | Monitor government procurement | Once/day | Medium |
| 📰 Google News | Global news monitoring | Every 2-6 hours | Low |
PlaceholderData source type cards
How to Choose Data Source Type
Decision Flow
What is your content source? ├─ Specific site has RSS → Use **RSS Feed** (most cost-effective) ├─ Need to search keywords → Use **Search Source** ├─ Monitor email notifications → Use **Email Source** ├─ Scrape specific pages → Use **Web Scraping** ├─ Government tender info → Use **Tender Source** └─ Global news monitoring → Use **Google News**
Recommended Combinations
News Aggregation Scenario:
- RSS Feeds (main source, low cost)
- Google News (supplement, cover more regions)
- Search Sources (supplement specific keywords)
Tender Monitoring Scenario:
- Tender Sources (government platforms)
- Search Sources (company website tender pages)
PlaceholderData source selection decision tree
Common Configuration Options
1. Schedule Strategy
Interval Mode
- Execute every X hours
- Example: Every 6 hours (suitable for news)
Weekly Plan Mode
- Specific days + time
- Example: Mon/Wed/Fri 9:00 (suitable for periodic reports)
Manual Trigger
- No auto-execution
- Click "Execute Now" button to trigger
How to Choose:
- News/Real-time content → Interval mode (1-6 hours)
- Tender/Periodic updates → Weekly plan (fixed time daily)
- Temporary needs → Manual trigger
2. Deduplication Strategy
UPDATE (Default)
- When duplicate URL found, save new version
- Old version marked as expired ()
isExpired=true - Auto-filter expired content when generating reports
Use Case:
- Need latest version (news updates, price changes)
KEEP_OLD
- When duplicate URL found, only log, don't re-scrape
- Keep original content
Use Case:
- Content won't update (RSS news, tender announcements)
- Save cost (avoid duplicate scraping)
Comparison:
| Strategy | Re-scrape | Cost | Use Case |
|---|---|---|---|
| UPDATE | ✅ Yes | High | Content updates |
| KEEP_OLD | ❌ No | Low | Content unchanged |
3. Content Cleaning
Enable Cleaning
- Use LLM to extract title, summary, keywords
- Remove HTML tags and irrelevant content
- Extra cost: 10-20 credits/time
Disable Cleaning
- Keep original HTML content
- Clean later when needed (recommended)
How to Choose:
- Need structured data immediately → Enable
- Collect raw data first → Disable (can manually trigger cleaning later)
4. Associated Libraries
Each data source can associate with 1 or more libraries.
Scenarios:
- 1 source → 1 library (simple scenario)
- 1 source → multiple libraries (categorize by topic)
Example:
Data Source: "36Kr Tech News" ├─ Associated Library: "AI Industry News" └─ Associated Library: "Startup Investment News"
PlaceholderConfiguration diagram
Best Practices
✅ News Content
- Data Source: RSS Feed
- Schedule: Every 6 hours
- Dedup: KEEP_OLD (save cost)
- Cleaning: Disabled (raw content sufficient)
✅ Tender Content
- Data Source: Tender Source
- Schedule: Once daily
- Dedup: KEEP_OLD (no duplicate scraping)
- Cleaning: Enabled (extract key info)
✅ Keyword Monitoring
- Data Source: Search Source
- Schedule: 2 times daily
- Dedup: UPDATE (get latest)
- Cleaning: Enabled (structured data)
Next Steps
- Search Sources - Detailed config for 4 search engines
- RSS Feeds - RSSHub advanced config
- Web & Email Sources - Automated monitoring