URL Deduplication Technology

Intelligent URL deduplication mechanism that avoids duplicate scraping, reduces costs, and maintains content version history. This is one of OctoReport's core technical advantages.

💡 Core Value: URL deduplication can save 70-90% of redundant scraping costs while ensuring data accuracy and consistency.

1. Why URL Deduplication Is Needed

1.1 Problem Scenarios

Scenario 1: Daily RSS Feed Execution

RSS Feed returns the latest 20 articles
Day 1: Collect 20 items (all new content)
Day 2: Collect 20 items (18 of which were already collected yesterday)
Problem: Without deduplication, the same URLs would be scraped repeatedly, wasting 18 API calls

Scenario 2: Periodic Search Source Execution

Keywords: "AI Large Models"
Executes every 6 hours
Many popular articles repeatedly appear in search results
Problem: Without deduplication, the same articles would be scraped multiple times and saved as multiple copies

1.2 Consequences of Not Deduplicating

Consequence	Impact	Cost Increase
Duplicate Scraping	Wasted API calls (Firecrawl, Browserless)	10-50 credits/time
Duplicate Cleaning	Wasted LLM tokens	10-20 credits/time
Data Redundancy	Same content saved as multiple copies	Database bloat
Report Quality Decline	Duplicate content in reports	Poor user experience

Cost Comparison Example:

Scenario: RSS feed, daily execution, 20 items per return

Cost Without Deduplication (daily):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 20 items × 2 credits = 40 credits (18 duplicates)
- Day 3: 20 items × 2 credits = 40 credits (18 duplicates)
- Total: 120 credits/3 days = 40 credits/day

Cost With Deduplication (daily, using KEEP_OLD):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 2 items × 2 credits = 4 credits (skip 18 existing)
- Day 3: 2 items × 2 credits = 4 credits (skip 18 existing)
- Total: 48 credits/3 days = 16 credits/day

Savings: (40 - 16) / 40 = 60%

2. Deduplication Strategies Explained

OctoReport provides two deduplication strategies for different scenarios.

2.1 UPDATE Strategy (Default)

How It Works:

New URL Found → Scrape and save directly
Duplicate URL + Scraping Success → Mark old content as "expired", save new version
Duplicate URL + Scraping Failure → Update old content's collection time, don't save new content

Suitable Scenarios:

✅ Content gets updated (e.g., product prices, inventory info, news corrections)
✅ Need to retain historical versions (audit, comparative analysis)
✅ Willing to pay update costs (re-scraping requires API calls)

Advantages:

Always get the latest version
Retain historical versions (old content marked as "expired" but not deleted)
Keep old content when scraping fails (fault tolerance)

Cost:

Re-scrape every duplicate URL
Higher cost (but ensures data is current)

2.2 KEEP_OLD Strategy

How It Works:

New URL Found → Scrape and save
Duplicate URL → Skip directly (no scraping, no saving), only update old content's collection time

Suitable Scenarios:

✅ Content doesn't update (e.g., news articles, RSS Feeds, historical documents)
✅ Only care about new content (don't care about updates to collected content)
✅ Cost-sensitive (want to save API calls)

Advantages:

Avoid duplicate scraping (save 70-90% cost)
Faster execution (no waiting for duplicate URL scraping)
Suitable for high-volume URL scenarios (e.g., RSS Feed, news sources)

Considerations:

⚠️ Won't re-scrape even if content actually updated
⚠️ Cannot get the latest version of content

2.3 Strategy Comparison

Feature	UPDATE (Default)	KEEP_OLD
Duplicate URL Handling	Re-scrape	Skip scraping
Content Versions	Keep all versions (old marked expired)	Keep only first version
Cost	High (scrape every time)	Low (scrape new URLs only)
Speed	Slow (wait for scraping)	Fast (skip scraping)
Suitable Scenario	Content gets updated	Content doesn't update
Recommendation	⭐⭐⭐	⭐⭐⭐⭐⭐ (RSS/News)

PlaceholderUPDATE vs KEEP_OLD flow comparison - showing different processing flows for the same URL

3. Deduplication Timeline Examples

3.1 UPDATE Strategy Example

Scenario: Monitor product prices (need latest price)

Timeline:

Day 1, 9:00 AM - First Collection
• URL: https://example.com/product/123
• Price: $99
• Action: Save content A
• Cost: 10 credits

Day 2, 9:00 AM - Second Collection
• URL: https://example.com/product/123 (duplicate)
• Price: $89 (price dropped!)
• Actions:
  1. Re-scrape (success)
  2. Mark content A as "expired"
  3. Save content B (new version)
• Cost: 10 credits

Day 3, 9:00 AM - Third Collection
• URL: https://example.com/product/123 (duplicate)
• Scraping failed (site maintenance)
• Actions:
  1. Attempt scraping (failed)
  2. Update content B's collection time
  3. Don't save new content
• Cost: 0 credits (failed scraping not charged)

Result:
• 2 items in library (A expired, B not expired)
• Report generation only uses content B (latest price $89)
• Can view historical version (content A, price $99)

3.2 KEEP_OLD Strategy Example

Scenario: Subscribe to news RSS (content doesn't update)

Timeline:

Day 1, 9:00 AM - First Collection
• RSS returns 20 articles
• All new URLs
• Action: Save 20 items
• Cost: 20 × 2 credits = 40 credits

Day 2, 9:00 AM - Second Collection
• RSS returns 20 articles
  - 18 from yesterday (duplicates)
  - 2 new articles
• Actions:
  1. Detect 18 duplicate URLs
  2. Skip these 18 (no scraping)
  3. Only scrape 2 new articles
  4. Update old 18 items' collection time
• Cost: 2 × 2 credits = 4 credits

Day 3, 9:00 AM - Third Collection
• RSS returns 20 articles
  - 18 from 2 days ago
  - 2 from yesterday
  - 0 new articles
• Actions:
  1. Detect 20 duplicate URLs
  2. Skip all (no scraping)
  3. Update these 20 items' collection time
• Cost: 0 credits

Result:
• 20 items in library (all not expired)
• 3-day total cost: 44 credits
• If using UPDATE: 120 credits
• Savings: 63%

4. Technical Implementation Details

4.1 URL Uniqueness Identification

Identification Method:

Use complete URL as unique identifier
URL normalization (remove trailing slash, unify protocol)
Database index optimization (fast duplicate URL queries)

Special Handling:

Query Parameters: Tracking parameters like
```
?utm_source=xxx
```
are preserved (different parameters treated as different URLs)
Fragment: Anchor like
```
#section
```
is removed (treated as same URL)
Case: Insensitive (
```
Example.com
```
and
```
example.com
```
treated as same)

4.2 Version Management Mechanism

Content Fields:

```
isExpired
```
: Whether expired (boolean)
```
expiredAt
```
: Expiration time (timestamp)
```
collectedAt
```
: Last collection time (timestamp)

Version Management Flow:

Detect Duplicate: Query database by URL
Mark Expired (UPDATE strategy): Set
```
isExpired=true
```
,
```
expiredAt=now
```
Save New Version: Create new record,
```
isExpired=false
```
Report Filtering: Automatically filter
```
isExpired=true
```
content in queries

4.3 Performance Optimization

Batch Detection:

Detect multiple URLs in one query (reduce database round trips)
Use
```
IN
```
query (not individual queries)

Index Optimization:

Add index on
```
sourceUrl
```
field
Add index on
```
isExpired
```
field
Composite index:
```
(sourceId, sourceUrl, isExpired)
```

Concurrency Control:

Use database transactions (avoid race conditions)
Optimistic locking mechanism (version number control)

PlaceholderDeduplication flow architecture - complete flow from URL detection to content saving

5. Practical Application Recommendations

5.1 How to Choose Strategy

Decision Tree:

Ask yourself: Will this URL's content be updated?

Will Update
  ├─ Price, inventory, ratings, etc. → Use UPDATE
  ├─ News corrections, article revisions → Use UPDATE
  └─ Need to keep historical versions → Use UPDATE

Won't Update
  ├─ RSS Feed articles → Use KEEP_OLD ⭐
  ├─ Social media posts → Use KEEP_OLD ⭐
  ├─ Historical docs, archives → Use KEEP_OLD ⭐
  └─ News releases (fixed content) → Use KEEP_OLD ⭐

Uncertain
  └─ Start with UPDATE, adjust after observing for a few days

5.2 Strategy Switching

How to Switch:

Edit data source configuration
Modify
```
update_strategy
```
field (
```
UPDATE
```
or
```
KEEP_OLD
```
)
Save configuration
Takes effect on next execution

Switching Impact:

UPDATE → KEEP_OLD: Won't re-scrape existing URLs subsequently (cost reduction)
KEEP_OLD → UPDATE: Will re-scrape all URLs subsequently (including existing ones)
Already Saved Content: Not affected (strategy only affects subsequent executions)

5.3 View Deduplication Results

Task Logs:

View "Task Logs" (Sidebar → Task Logs)
Check data source task's
```
message
```
field
Shows: "Found X new URLs, skipped Y existing URLs"

Content Library:

View library content (Library Management → View Content)
Filter "expired" content (if using UPDATE strategy)
Compare different versions of content

6. Frequently Asked Questions

Q1: Does KEEP_OLD strategy completely skip duplicate URLs?

A: Yes. When using KEEP_OLD strategy, duplicate URLs are completely skipped (no scraping, no saving), only the

collectedAt

timestamp of old content is updated. This significantly reduces costs.

Q2: Does UPDATE strategy keep all historical versions?

A: Yes. Old versions are marked as "expired" (

isExpired=true

) but not deleted. You can filter "expired" content in the content library to view historical versions.

Q3: How to clean up expired content?

A: The system doesn't automatically clean up expired content. If cleanup is needed:

Manually delete expired content in content library
Contact admin for batch cleanup (use with caution, irreversible)

Q4: Which strategy should RSS feeds use?

A: Strongly recommend using KEEP_OLD strategy. Reasons:

RSS Feed article content typically doesn't update
Only need to get newly published articles
Can save 70-90% of costs

Q5: Is deduplication based on entire URL or just domain?

A: Based on complete URL (including path and query parameters). For example:

https://example.com/page?id=1

and

https://example.com/page?id=2

are treated as different URLs

```
https://example.com/page
```
and
```
https://example.com/page/
```
are treated as same URL (trailing slash removed)

Next Steps

Atomic Billing Mechanism - Learn about credit deduction reliability guarantees
System Reliability - Learn about failover and health check mechanisms
Configuration Tips - Optimize deduplication strategy selection

URL Deduplication Technology

Intelligent URL deduplication mechanism that avoids duplicate scraping, reduces costs, and maintains content version history. This is one of OctoReport's core technical advantages.

💡 Core Value: URL deduplication can save 70-90% of redundant scraping costs while ensuring data accuracy and consistency.

1. Why URL Deduplication Is Needed

1.1 Problem Scenarios

Scenario 1: Daily RSS Feed Execution

RSS Feed returns the latest 20 articles
Day 1: Collect 20 items (all new content)
Day 2: Collect 20 items (18 of which were already collected yesterday)
Problem: Without deduplication, the same URLs would be scraped repeatedly, wasting 18 API calls

Scenario 2: Periodic Search Source Execution

Keywords: "AI Large Models"
Executes every 6 hours
Many popular articles repeatedly appear in search results
Problem: Without deduplication, the same articles would be scraped multiple times and saved as multiple copies

1.2 Consequences of Not Deduplicating

Consequence	Impact	Cost Increase
Duplicate Scraping	Wasted API calls (Firecrawl, Browserless)	10-50 credits/time
Duplicate Cleaning	Wasted LLM tokens	10-20 credits/time
Data Redundancy	Same content saved as multiple copies	Database bloat
Report Quality Decline	Duplicate content in reports	Poor user experience

Cost Comparison Example:

Scenario: RSS feed, daily execution, 20 items per return

Cost Without Deduplication (daily):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 20 items × 2 credits = 40 credits (18 duplicates)
- Day 3: 20 items × 2 credits = 40 credits (18 duplicates)
- Total: 120 credits/3 days = 40 credits/day

Cost With Deduplication (daily, using KEEP_OLD):
- Day 1: 20 items × 2 credits = 40 credits
- Day 2: 2 items × 2 credits = 4 credits (skip 18 existing)
- Day 3: 2 items × 2 credits = 4 credits (skip 18 existing)
- Total: 48 credits/3 days = 16 credits/day

Savings: (40 - 16) / 40 = 60%

2. Deduplication Strategies Explained

OctoReport provides two deduplication strategies for different scenarios.

2.1 UPDATE Strategy (Default)

How It Works:

New URL Found → Scrape and save directly
Duplicate URL + Scraping Success → Mark old content as "expired", save new version
Duplicate URL + Scraping Failure → Update old content's collection time, don't save new content

Suitable Scenarios:

✅ Content gets updated (e.g., product prices, inventory info, news corrections)
✅ Need to retain historical versions (audit, comparative analysis)
✅ Willing to pay update costs (re-scraping requires API calls)

Advantages:

Always get the latest version
Retain historical versions (old content marked as "expired" but not deleted)
Keep old content when scraping fails (fault tolerance)

Cost:

Re-scrape every duplicate URL
Higher cost (but ensures data is current)

2.2 KEEP_OLD Strategy

How It Works:

New URL Found → Scrape and save
Duplicate URL → Skip directly (no scraping, no saving), only update old content's collection time

Suitable Scenarios:

✅ Content doesn't update (e.g., news articles, RSS Feeds, historical documents)
✅ Only care about new content (don't care about updates to collected content)
✅ Cost-sensitive (want to save API calls)

Advantages:

Avoid duplicate scraping (save 70-90% cost)
Faster execution (no waiting for duplicate URL scraping)
Suitable for high-volume URL scenarios (e.g., RSS Feed, news sources)

Considerations:

⚠️ Won't re-scrape even if content actually updated
⚠️ Cannot get the latest version of content

2.3 Strategy Comparison

Feature	UPDATE (Default)	KEEP_OLD
Duplicate URL Handling	Re-scrape	Skip scraping
Content Versions	Keep all versions (old marked expired)	Keep only first version
Cost	High (scrape every time)	Low (scrape new URLs only)
Speed	Slow (wait for scraping)	Fast (skip scraping)
Suitable Scenario	Content gets updated	Content doesn't update
Recommendation	⭐⭐⭐	⭐⭐⭐⭐⭐ (RSS/News)

PlaceholderUPDATE vs KEEP_OLD flow comparison - showing different processing flows for the same URL

3. Deduplication Timeline Examples

3.1 UPDATE Strategy Example

Scenario: Monitor product prices (need latest price)

Timeline:

Day 1, 9:00 AM - First Collection
• URL: https://example.com/product/123
• Price: $99
• Action: Save content A
• Cost: 10 credits

Day 2, 9:00 AM - Second Collection
• URL: https://example.com/product/123 (duplicate)
• Price: $89 (price dropped!)
• Actions:
  1. Re-scrape (success)
  2. Mark content A as "expired"
  3. Save content B (new version)
• Cost: 10 credits

Day 3, 9:00 AM - Third Collection
• URL: https://example.com/product/123 (duplicate)
• Scraping failed (site maintenance)
• Actions:
  1. Attempt scraping (failed)
  2. Update content B's collection time
  3. Don't save new content
• Cost: 0 credits (failed scraping not charged)

Result:
• 2 items in library (A expired, B not expired)
• Report generation only uses content B (latest price $89)
• Can view historical version (content A, price $99)

3.2 KEEP_OLD Strategy Example

Scenario: Subscribe to news RSS (content doesn't update)

Timeline:

Day 1, 9:00 AM - First Collection
• RSS returns 20 articles
• All new URLs
• Action: Save 20 items
• Cost: 20 × 2 credits = 40 credits

Day 2, 9:00 AM - Second Collection
• RSS returns 20 articles
  - 18 from yesterday (duplicates)
  - 2 new articles
• Actions:
  1. Detect 18 duplicate URLs
  2. Skip these 18 (no scraping)
  3. Only scrape 2 new articles
  4. Update old 18 items' collection time
• Cost: 2 × 2 credits = 4 credits

Day 3, 9:00 AM - Third Collection
• RSS returns 20 articles
  - 18 from 2 days ago
  - 2 from yesterday
  - 0 new articles
• Actions:
  1. Detect 20 duplicate URLs
  2. Skip all (no scraping)
  3. Update these 20 items' collection time
• Cost: 0 credits

Result:
• 20 items in library (all not expired)
• 3-day total cost: 44 credits
• If using UPDATE: 120 credits
• Savings: 63%

4. Technical Implementation Details

4.1 URL Uniqueness Identification

Identification Method:

Use complete URL as unique identifier
URL normalization (remove trailing slash, unify protocol)
Database index optimization (fast duplicate URL queries)

Special Handling:

Query Parameters: Tracking parameters like
```
?utm_source=xxx
```
are preserved (different parameters treated as different URLs)
Fragment: Anchor like
```
#section
```
is removed (treated as same URL)
Case: Insensitive (
```
Example.com
```
and
```
example.com
```
treated as same)

4.2 Version Management Mechanism

Content Fields:

```
isExpired
```
: Whether expired (boolean)
```
expiredAt
```
: Expiration time (timestamp)
```
collectedAt
```
: Last collection time (timestamp)

Version Management Flow:

Detect Duplicate: Query database by URL
Mark Expired (UPDATE strategy): Set
```
isExpired=true
```
,
```
expiredAt=now
```
Save New Version: Create new record,
```
isExpired=false
```
Report Filtering: Automatically filter
```
isExpired=true
```
content in queries

4.3 Performance Optimization

Batch Detection:

Detect multiple URLs in one query (reduce database round trips)
Use
```
IN
```
query (not individual queries)

Index Optimization:

Add index on
```
sourceUrl
```
field
Add index on
```
isExpired
```
field
Composite index:
```
(sourceId, sourceUrl, isExpired)
```

Concurrency Control:

Use database transactions (avoid race conditions)
Optimistic locking mechanism (version number control)

PlaceholderDeduplication flow architecture - complete flow from URL detection to content saving

5. Practical Application Recommendations

5.1 How to Choose Strategy

Decision Tree:

Ask yourself: Will this URL's content be updated?

Will Update
  ├─ Price, inventory, ratings, etc. → Use UPDATE
  ├─ News corrections, article revisions → Use UPDATE
  └─ Need to keep historical versions → Use UPDATE

Won't Update
  ├─ RSS Feed articles → Use KEEP_OLD ⭐
  ├─ Social media posts → Use KEEP_OLD ⭐
  ├─ Historical docs, archives → Use KEEP_OLD ⭐
  └─ News releases (fixed content) → Use KEEP_OLD ⭐

Uncertain
  └─ Start with UPDATE, adjust after observing for a few days

5.2 Strategy Switching

How to Switch:

Edit data source configuration
Modify
```
update_strategy
```
field (
```
UPDATE
```
or
```
KEEP_OLD
```
)
Save configuration
Takes effect on next execution

Switching Impact:

UPDATE → KEEP_OLD: Won't re-scrape existing URLs subsequently (cost reduction)
KEEP_OLD → UPDATE: Will re-scrape all URLs subsequently (including existing ones)
Already Saved Content: Not affected (strategy only affects subsequent executions)

5.3 View Deduplication Results

Task Logs:

View "Task Logs" (Sidebar → Task Logs)
Check data source task's
```
message
```
field
Shows: "Found X new URLs, skipped Y existing URLs"

Content Library:

View library content (Library Management → View Content)
Filter "expired" content (if using UPDATE strategy)
Compare different versions of content

6. Frequently Asked Questions

Q1: Does KEEP_OLD strategy completely skip duplicate URLs?

A: Yes. When using KEEP_OLD strategy, duplicate URLs are completely skipped (no scraping, no saving), only the

collectedAt

timestamp of old content is updated. This significantly reduces costs.

Q2: Does UPDATE strategy keep all historical versions?

A: Yes. Old versions are marked as "expired" (

isExpired=true

) but not deleted. You can filter "expired" content in the content library to view historical versions.

Q3: How to clean up expired content?

A: The system doesn't automatically clean up expired content. If cleanup is needed:

Manually delete expired content in content library
Contact admin for batch cleanup (use with caution, irreversible)

Q4: Which strategy should RSS feeds use?

A: Strongly recommend using KEEP_OLD strategy. Reasons:

RSS Feed article content typically doesn't update
Only need to get newly published articles
Can save 70-90% of costs

Q5: Is deduplication based on entire URL or just domain?

A: Based on complete URL (including path and query parameters). For example:

https://example.com/page?id=1

and

https://example.com/page?id=2

are treated as different URLs

```
https://example.com/page
```
and
```
https://example.com/page/
```
are treated as same URL (trailing slash removed)

Next Steps

Atomic Billing Mechanism - Learn about credit deduction reliability guarantees
System Reliability - Learn about failover and health check mechanisms
Configuration Tips - Optimize deduplication strategy selection