OctoReport Docs
Back to HomeGo to Console
🚀快速开始
  • 产品概述
  • 快速上手
✨核心功能
    • 数据源总览
    • 搜索类源
    • RSS订阅源
    • 网页与邮件源
    • 政府与新闻源
  • 知识库管理
  • 报告生成
  • 交互式对话
  • 邮件触发
  • 积分与日志
💡使用技巧
  • 配置技巧
  • 优化与排查
🔬产品亮点
  • URL去重
  • 原子计费
  • 系统可靠性
❓帮助中心
  • FAQ与支持

URL Deduplication Technology

Intelligent URL deduplication mechanism that avoids duplicate scraping, reduces costs, and maintains content version history. This is one of OctoReport's core technical advantages.

💡 Core Value: URL deduplication can save 70-90% of redundant scraping costs while ensuring data accuracy and consistency.

1. Why URL Deduplication Is Needed

1.1 Problem Scenarios

Scenario 1: Daily RSS Feed Execution

  • RSS Feed returns the latest 20 articles
  • Day 1: Collect 20 items (all new content)
  • Day 2: Collect 20 items (18 of which were already collected yesterday)
  • Problem: Without deduplication, the same URLs would be scraped repeatedly, wasting 18 API calls

Scenario 2: Periodic Search Source Execution

  • Keywords: "AI Large Models"
  • Executes every 6 hours
  • Many popular articles repeatedly appear in search results
  • Problem: Without deduplication, the same articles would be scraped multiple times and saved as multiple copies

1.2 Consequences of Not Deduplicating

ConsequenceImpactCost Increase
Duplicate ScrapingWasted API calls (Firecrawl, Browserless)10-50 credits/time
Duplicate CleaningWasted LLM tokens10-20 credits/time
Data RedundancySame content saved as multiple copiesDatabase bloat
Report Quality DeclineDuplicate content in reportsPoor user experience

Cost Comparison Example:

Scenario: RSS feed, daily execution, 20 items per return

Cost Without Deduplication (daily):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 20 items × 2 credits = 40 credits (18 duplicates)
- Day 3: 20 items × 2 credits = 40 credits (18 duplicates)
- Total: 120 credits/3 days = 40 credits/day

Cost With Deduplication (daily, using KEEP_OLD):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 2 items × 2 credits = 4 credits (skip 18 existing)
- Day 3: 2 items × 2 credits = 4 credits (skip 18 existing)
- Total: 48 credits/3 days = 16 credits/day

Savings: (40 - 16) / 40 = 60%

2. Deduplication Strategies Explained

OctoReport provides two deduplication strategies for different scenarios.

2.1 UPDATE Strategy (Default)

How It Works:

  1. New URL Found → Scrape and save directly
  2. Duplicate URL + Scraping Success → Mark old content as "expired", save new version
  3. Duplicate URL + Scraping Failure → Update old content's collection time, don't save new content

Suitable Scenarios:

  • ✅ Content gets updated (e.g., product prices, inventory info, news corrections)
  • ✅ Need to retain historical versions (audit, comparative analysis)
  • ✅ Willing to pay update costs (re-scraping requires API calls)

Advantages:

  • Always get the latest version
  • Retain historical versions (old content marked as "expired" but not deleted)
  • Keep old content when scraping fails (fault tolerance)

Cost:

  • Re-scrape every duplicate URL
  • Higher cost (but ensures data is current)

2.2 KEEP_OLD Strategy

How It Works:

  1. New URL Found → Scrape and save
  2. Duplicate URL → Skip directly (no scraping, no saving), only update old content's collection time

Suitable Scenarios:

  • ✅ Content doesn't update (e.g., news articles, RSS Feeds, historical documents)
  • ✅ Only care about new content (don't care about updates to collected content)
  • ✅ Cost-sensitive (want to save API calls)

Advantages:

  • Avoid duplicate scraping (save 70-90% cost)
  • Faster execution (no waiting for duplicate URL scraping)
  • Suitable for high-volume URL scenarios (e.g., RSS Feed, news sources)

Considerations:

  • ⚠️ Won't re-scrape even if content actually updated
  • ⚠️ Cannot get the latest version of content

2.3 Strategy Comparison

FeatureUPDATE (Default)KEEP_OLD
Duplicate URL HandlingRe-scrapeSkip scraping
Content VersionsKeep all versions (old marked expired)Keep only first version
CostHigh (scrape every time)Low (scrape new URLs only)
SpeedSlow (wait for scraping)Fast (skip scraping)
Suitable ScenarioContent gets updatedContent doesn't update
Recommendation⭐⭐⭐⭐⭐⭐⭐⭐ (RSS/News)

PlaceholderUPDATE vs KEEP_OLD flow comparison - showing different processing flows for the same URL

3. Deduplication Timeline Examples

3.1 UPDATE Strategy Example

Scenario: Monitor product prices (need latest price)

Timeline:

Day 1, 9:00 AM - First Collection
• URL: https://example.com/product/123
• Price: $99
• Action: Save content A
• Cost: 10 credits

Day 2, 9:00 AM - Second Collection
• URL: https://example.com/product/123 (duplicate)
• Price: $89 (price dropped!)
• Actions:
  1. Re-scrape (success)
  2. Mark content A as "expired"
  3. Save content B (new version)
• Cost: 10 credits

Day 3, 9:00 AM - Third Collection
• URL: https://example.com/product/123 (duplicate)
• Scraping failed (site maintenance)
• Actions:
  1. Attempt scraping (failed)
  2. Update content B's collection time
  3. Don't save new content
• Cost: 0 credits (failed scraping not charged)

Result:
• 2 items in library (A expired, B not expired)
• Report generation only uses content B (latest price $89)
• Can view historical version (content A, price $99)

3.2 KEEP_OLD Strategy Example

Scenario: Subscribe to news RSS (content doesn't update)

Timeline:

Day 1, 9:00 AM - First Collection
• RSS returns 20 articles
• All new URLs
• Action: Save 20 items
• Cost: 20 × 2 credits = 40 credits

Day 2, 9:00 AM - Second Collection
• RSS returns 20 articles
  - 18 from yesterday (duplicates)
  - 2 new articles
• Actions:
  1. Detect 18 duplicate URLs
  2. Skip these 18 (no scraping)
  3. Only scrape 2 new articles
  4. Update old 18 items' collection time
• Cost: 2 × 2 credits = 4 credits

Day 3, 9:00 AM - Third Collection
• RSS returns 20 articles
  - 18 from 2 days ago
  - 2 from yesterday
  - 0 new articles
• Actions:
  1. Detect 20 duplicate URLs
  2. Skip all (no scraping)
  3. Update these 20 items' collection time
• Cost: 0 credits

Result:
• 20 items in library (all not expired)
• 3-day total cost: 44 credits
• If using UPDATE: 120 credits
• Savings: 63%

4. Technical Implementation Details

4.1 URL Uniqueness Identification

Identification Method:

  • Use complete URL as unique identifier
  • URL normalization (remove trailing slash, unify protocol)
  • Database index optimization (fast duplicate URL queries)

Special Handling:

  • Query Parameters: Tracking parameters like
    ?utm_source=xxx
    are preserved (different parameters treated as different URLs)
  • Fragment: Anchor like
    #section
    is removed (treated as same URL)
  • Case: Insensitive (
    Example.com
    and
    example.com
    treated as same)

4.2 Version Management Mechanism

Content Fields:

  • isExpired
    : Whether expired (boolean)
  • expiredAt
    : Expiration time (timestamp)
  • collectedAt
    : Last collection time (timestamp)

Version Management Flow:

  1. Detect Duplicate: Query database by URL
  2. Mark Expired (UPDATE strategy): Set
    isExpired=true
    ,
    expiredAt=now
  3. Save New Version: Create new record,
    isExpired=false
  4. Report Filtering: Automatically filter
    isExpired=true
    content in queries

4.3 Performance Optimization

Batch Detection:

  • Detect multiple URLs in one query (reduce database round trips)
  • Use
    IN
    query (not individual queries)

Index Optimization:

  • Add index on
    sourceUrl
    field
  • Add index on
    isExpired
    field
  • Composite index:
    (sourceId, sourceUrl, isExpired)

Concurrency Control:

  • Use database transactions (avoid race conditions)
  • Optimistic locking mechanism (version number control)

PlaceholderDeduplication flow architecture - complete flow from URL detection to content saving

5. Practical Application Recommendations

5.1 How to Choose Strategy

Decision Tree:

Ask yourself: Will this URL's content be updated?

Will Update
  ├─ Price, inventory, ratings, etc. → Use UPDATE
  ├─ News corrections, article revisions → Use UPDATE
  └─ Need to keep historical versions → Use UPDATE

Won't Update
  ├─ RSS Feed articles → Use KEEP_OLD ⭐
  ├─ Social media posts → Use KEEP_OLD ⭐
  ├─ Historical docs, archives → Use KEEP_OLD ⭐
  └─ News releases (fixed content) → Use KEEP_OLD ⭐

Uncertain
  └─ Start with UPDATE, adjust after observing for a few days

5.2 Strategy Switching

How to Switch:

  1. Edit data source configuration
  2. Modify
    update_strategy
    field (
    UPDATE
    or
    KEEP_OLD
    )
  3. Save configuration
  4. Takes effect on next execution

Switching Impact:

  • UPDATE → KEEP_OLD: Won't re-scrape existing URLs subsequently (cost reduction)
  • KEEP_OLD → UPDATE: Will re-scrape all URLs subsequently (including existing ones)
  • Already Saved Content: Not affected (strategy only affects subsequent executions)

5.3 View Deduplication Results

Task Logs:

  • View "Task Logs" (Sidebar → Task Logs)
  • Check data source task's
    message
    field
  • Shows: "Found X new URLs, skipped Y existing URLs"

Content Library:

  • View library content (Library Management → View Content)
  • Filter "expired" content (if using UPDATE strategy)
  • Compare different versions of content

6. Frequently Asked Questions

Q1: Does KEEP_OLD strategy completely skip duplicate URLs?

A: Yes. When using KEEP_OLD strategy, duplicate URLs are completely skipped (no scraping, no saving), only the

collectedAt
timestamp of old content is updated. This significantly reduces costs.

Q2: Does UPDATE strategy keep all historical versions?

A: Yes. Old versions are marked as "expired" (

isExpired=true
) but not deleted. You can filter "expired" content in the content library to view historical versions.

Q3: How to clean up expired content?

A: The system doesn't automatically clean up expired content. If cleanup is needed:

  • Manually delete expired content in content library
  • Contact admin for batch cleanup (use with caution, irreversible)

Q4: Which strategy should RSS feeds use?

A: Strongly recommend using KEEP_OLD strategy. Reasons:

  • RSS Feed article content typically doesn't update
  • Only need to get newly published articles
  • Can save 70-90% of costs

Q5: Is deduplication based on entire URL or just domain?

A: Based on complete URL (including path and query parameters). For example:

  • https://example.com/page?id=1
    and
    https://example.com/page?id=2
    are treated as different URLs
  • https://example.com/page
    and
    https://example.com/page/
    are treated as same URL (trailing slash removed)

Next Steps

  • Atomic Billing Mechanism - Learn about credit deduction reliability guarantees
  • System Reliability - Learn about failover and health check mechanisms
  • Configuration Tips - Optimize deduplication strategy selection