URL Deduplication Technology
Intelligent URL deduplication mechanism that avoids duplicate scraping, reduces costs, and maintains content version history. This is one of OctoReport's core technical advantages.
💡 Core Value: URL deduplication can save 70-90% of redundant scraping costs while ensuring data accuracy and consistency.
1. Why URL Deduplication Is Needed
1.1 Problem Scenarios
Scenario 1: Daily RSS Feed Execution
- RSS Feed returns the latest 20 articles
- Day 1: Collect 20 items (all new content)
- Day 2: Collect 20 items (18 of which were already collected yesterday)
- Problem: Without deduplication, the same URLs would be scraped repeatedly, wasting 18 API calls
Scenario 2: Periodic Search Source Execution
- Keywords: "AI Large Models"
- Executes every 6 hours
- Many popular articles repeatedly appear in search results
- Problem: Without deduplication, the same articles would be scraped multiple times and saved as multiple copies
1.2 Consequences of Not Deduplicating
| Consequence | Impact | Cost Increase |
|---|---|---|
| Duplicate Scraping | Wasted API calls (Firecrawl, Browserless) | 10-50 credits/time |
| Duplicate Cleaning | Wasted LLM tokens | 10-20 credits/time |
| Data Redundancy | Same content saved as multiple copies | Database bloat |
| Report Quality Decline | Duplicate content in reports | Poor user experience |
Cost Comparison Example:
Scenario: RSS feed, daily execution, 20 items per return Cost Without Deduplication (daily): - Day 1: 20 items × 2 credits = 40 credits - Day 2: 20 items × 2 credits = 40 credits (18 duplicates) - Day 3: 20 items × 2 credits = 40 credits (18 duplicates) - Total: 120 credits/3 days = 40 credits/day Cost With Deduplication (daily, using KEEP_OLD): - Day 1: 20 items × 2 credits = 40 credits - Day 2: 2 items × 2 credits = 4 credits (skip 18 existing) - Day 3: 2 items × 2 credits = 4 credits (skip 18 existing) - Total: 48 credits/3 days = 16 credits/day Savings: (40 - 16) / 40 = 60%
2. Deduplication Strategies Explained
OctoReport provides two deduplication strategies for different scenarios.
2.1 UPDATE Strategy (Default)
How It Works:
- New URL Found → Scrape and save directly
- Duplicate URL + Scraping Success → Mark old content as "expired", save new version
- Duplicate URL + Scraping Failure → Update old content's collection time, don't save new content
Suitable Scenarios:
- ✅ Content gets updated (e.g., product prices, inventory info, news corrections)
- ✅ Need to retain historical versions (audit, comparative analysis)
- ✅ Willing to pay update costs (re-scraping requires API calls)
Advantages:
- Always get the latest version
- Retain historical versions (old content marked as "expired" but not deleted)
- Keep old content when scraping fails (fault tolerance)
Cost:
- Re-scrape every duplicate URL
- Higher cost (but ensures data is current)
2.2 KEEP_OLD Strategy
How It Works:
- New URL Found → Scrape and save
- Duplicate URL → Skip directly (no scraping, no saving), only update old content's collection time
Suitable Scenarios:
- ✅ Content doesn't update (e.g., news articles, RSS Feeds, historical documents)
- ✅ Only care about new content (don't care about updates to collected content)
- ✅ Cost-sensitive (want to save API calls)
Advantages:
- Avoid duplicate scraping (save 70-90% cost)
- Faster execution (no waiting for duplicate URL scraping)
- Suitable for high-volume URL scenarios (e.g., RSS Feed, news sources)
Considerations:
- ⚠️ Won't re-scrape even if content actually updated
- ⚠️ Cannot get the latest version of content
2.3 Strategy Comparison
| Feature | UPDATE (Default) | KEEP_OLD |
|---|---|---|
| Duplicate URL Handling | Re-scrape | Skip scraping |
| Content Versions | Keep all versions (old marked expired) | Keep only first version |
| Cost | High (scrape every time) | Low (scrape new URLs only) |
| Speed | Slow (wait for scraping) | Fast (skip scraping) |
| Suitable Scenario | Content gets updated | Content doesn't update |
| Recommendation | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ (RSS/News) |
PlaceholderUPDATE vs KEEP_OLD flow comparison - showing different processing flows for the same URL
3. Deduplication Timeline Examples
3.1 UPDATE Strategy Example
Scenario: Monitor product prices (need latest price)
Timeline: Day 1, 9:00 AM - First Collection • URL: https://example.com/product/123 • Price: $99 • Action: Save content A • Cost: 10 credits Day 2, 9:00 AM - Second Collection • URL: https://example.com/product/123 (duplicate) • Price: $89 (price dropped!) • Actions: 1. Re-scrape (success) 2. Mark content A as "expired" 3. Save content B (new version) • Cost: 10 credits Day 3, 9:00 AM - Third Collection • URL: https://example.com/product/123 (duplicate) • Scraping failed (site maintenance) • Actions: 1. Attempt scraping (failed) 2. Update content B's collection time 3. Don't save new content • Cost: 0 credits (failed scraping not charged) Result: • 2 items in library (A expired, B not expired) • Report generation only uses content B (latest price $89) • Can view historical version (content A, price $99)
3.2 KEEP_OLD Strategy Example
Scenario: Subscribe to news RSS (content doesn't update)
Timeline: Day 1, 9:00 AM - First Collection • RSS returns 20 articles • All new URLs • Action: Save 20 items • Cost: 20 × 2 credits = 40 credits Day 2, 9:00 AM - Second Collection • RSS returns 20 articles - 18 from yesterday (duplicates) - 2 new articles • Actions: 1. Detect 18 duplicate URLs 2. Skip these 18 (no scraping) 3. Only scrape 2 new articles 4. Update old 18 items' collection time • Cost: 2 × 2 credits = 4 credits Day 3, 9:00 AM - Third Collection • RSS returns 20 articles - 18 from 2 days ago - 2 from yesterday - 0 new articles • Actions: 1. Detect 20 duplicate URLs 2. Skip all (no scraping) 3. Update these 20 items' collection time • Cost: 0 credits Result: • 20 items in library (all not expired) • 3-day total cost: 44 credits • If using UPDATE: 120 credits • Savings: 63%
4. Technical Implementation Details
4.1 URL Uniqueness Identification
Identification Method:
- Use complete URL as unique identifier
- URL normalization (remove trailing slash, unify protocol)
- Database index optimization (fast duplicate URL queries)
Special Handling:
- Query Parameters: Tracking parameters like are preserved (different parameters treated as different URLs)
?utm_source=xxx - Fragment: Anchor like is removed (treated as same URL)
#section - Case: Insensitive (and
Example.comtreated as same)example.com
4.2 Version Management Mechanism
Content Fields:
- : Whether expired (boolean)
isExpired - : Expiration time (timestamp)
expiredAt - : Last collection time (timestamp)
collectedAt
Version Management Flow:
- Detect Duplicate: Query database by URL
- Mark Expired (UPDATE strategy): Set ,
isExpired=trueexpiredAt=now - Save New Version: Create new record,
isExpired=false - Report Filtering: Automatically filter content in queries
isExpired=true
4.3 Performance Optimization
Batch Detection:
- Detect multiple URLs in one query (reduce database round trips)
- Use query (not individual queries)
IN
Index Optimization:
- Add index on field
sourceUrl - Add index on field
isExpired - Composite index:
(sourceId, sourceUrl, isExpired)
Concurrency Control:
- Use database transactions (avoid race conditions)
- Optimistic locking mechanism (version number control)
PlaceholderDeduplication flow architecture - complete flow from URL detection to content saving
5. Practical Application Recommendations
5.1 How to Choose Strategy
Decision Tree:
Ask yourself: Will this URL's content be updated? Will Update ├─ Price, inventory, ratings, etc. → Use UPDATE ├─ News corrections, article revisions → Use UPDATE └─ Need to keep historical versions → Use UPDATE Won't Update ├─ RSS Feed articles → Use KEEP_OLD ⭐ ├─ Social media posts → Use KEEP_OLD ⭐ ├─ Historical docs, archives → Use KEEP_OLD ⭐ └─ News releases (fixed content) → Use KEEP_OLD ⭐ Uncertain └─ Start with UPDATE, adjust after observing for a few days
5.2 Strategy Switching
How to Switch:
- Edit data source configuration
- Modify field (
update_strategyorUPDATE)KEEP_OLD - Save configuration
- Takes effect on next execution
Switching Impact:
- UPDATE → KEEP_OLD: Won't re-scrape existing URLs subsequently (cost reduction)
- KEEP_OLD → UPDATE: Will re-scrape all URLs subsequently (including existing ones)
- Already Saved Content: Not affected (strategy only affects subsequent executions)
5.3 View Deduplication Results
Task Logs:
- View "Task Logs" (Sidebar → Task Logs)
- Check data source task's field
message - Shows: "Found X new URLs, skipped Y existing URLs"
Content Library:
- View library content (Library Management → View Content)
- Filter "expired" content (if using UPDATE strategy)
- Compare different versions of content
6. Frequently Asked Questions
Q1: Does KEEP_OLD strategy completely skip duplicate URLs?
A: Yes. When using KEEP_OLD strategy, duplicate URLs are completely skipped (no scraping, no saving), only the
collectedAtQ2: Does UPDATE strategy keep all historical versions?
A: Yes. Old versions are marked as "expired" (
isExpired=trueQ3: How to clean up expired content?
A: The system doesn't automatically clean up expired content. If cleanup is needed:
- Manually delete expired content in content library
- Contact admin for batch cleanup (use with caution, irreversible)
Q4: Which strategy should RSS feeds use?
A: Strongly recommend using KEEP_OLD strategy. Reasons:
- RSS Feed article content typically doesn't update
- Only need to get newly published articles
- Can save 70-90% of costs
Q5: Is deduplication based on entire URL or just domain?
A: Based on complete URL (including path and query parameters). For example:
- and
https://example.com/page?id=1are treated as different URLshttps://example.com/page?id=2 - and
https://example.com/pageare treated as same URL (trailing slash removed)https://example.com/page/
Next Steps
- Atomic Billing Mechanism - Learn about credit deduction reliability guarantees
- System Reliability - Learn about failover and health check mechanisms
- Configuration Tips - Optimize deduplication strategy selection