System Reliability
High-availability system architecture based on multi-instance failover, task observability, and health checks to ensure stable service operation.
💡 Core Goal: 99.9% service availability, automatic failure recovery, complete task tracking capability.
1. Multi-Instance Failover
1.1 RSSHub Multi-Instance Architecture
Architecture Design:
- Support multiple RSSHub instances (official + self-hosted)
- Independent health checks for each instance
- Automatic failover (primary fails → backup instance)
- Instance priority sorting (self-hosted > official rsshub.app)
Instance Configuration Example:
Admin Panel → RSSHub Instance Management Instance List: ┌─────────────────────────┬──────────┬────────┬──────────┐ │ Instance URL │ Priority │ Status │ Auth Mode│ ├─────────────────────────┼──────────┼────────┼──────────┤ │ https://my-rsshub.com │ 1 (High) │ Healthy│ BEARER │ │ https://rsshub.app │ 2 │ Healthy│ NONE │ │ https://backup.rsshub.com│ 3 │ Down │ KEY │ └─────────────────────────┴──────────┴────────┴──────────┘
1.2 Failover Process
Automatic Failover Mechanism:
- Request primary instance (highest priority healthy instance)
- Detect failure (timeout, 500 error, connection failure)
- Mark instance as unhealthy (lower priority or temporarily disable)
- Try next instance (by priority order)
- Log failover (audit trail)
Process Example:
Timeline: 10:00:00 - User creates RSS data source • Selected instance: my-rsshub.com (priority 1) • Request URL: https://my-rsshub.com/bilibili/user/video/123 • Result: ✅ Success (200 OK) 11:00:00 - Scheduled task execution • Selected instance: my-rsshub.com (priority 1) • Request URL: https://my-rsshub.com/bilibili/user/video/123 • Result: ❌ Failed (Connection Timeout) • Actions: 1. Mark my-rsshub.com as "unhealthy" 2. Switch to backup instance rsshub.app (priority 2) 3. Request URL: https://rsshub.app/bilibili/user/video/123 4. Result: ✅ Success (200 OK) • Log: "Failover from my-rsshub.com to rsshub.app" 12:00:00 - Health check recovery • Detected my-rsshub.com has recovered • Restore its priority • Use my-rsshub.com again on next execution
1.3 Health Check Mechanism
Check Methods:
- Active Check: Send test request every 5 minutes
- Passive Check: Mark immediately on user request failure
- Recovery Check: Attempt recovery every 10 minutes for unhealthy instances
Check Metrics:
| Metric | Healthy Threshold | Unhealthy Threshold |
|---|---|---|
| Response Time | < 5 seconds | > 10 seconds |
| Error Rate | < 5% | > 20% |
| Connection Success Rate | > 95% | < 80% |
PlaceholderFailover flow diagram - showing switchover from primary to backup instance
2. Task Observability
2.1 Complete Task Logs
Log Contents:
- : Unique task ID
id - : Task type (COLLECT, CLEAN, REPORT_GENERATE)
type - : Status (PENDING, PROCESSING, SUCCESS, FAILED)
status - : Creation time
createdAt - : Start time
startedAt - : Completion time
completedAt - : Execution duration (milliseconds)
duration - : Credits consumed
creditsUsed - : Execution message
message - : Error information (if failed)
error
Log Viewing:
Sidebar → Task Logs Supported Filters: • By type: Data Collection / Content Cleaning / Report Generation • By status: All / Success / Failed / In Progress • By time: Last 24 hours / Last 7 days / Last 30 days • By data source: Select specific source • By report: Select specific template Sample Logs: ┌──────────────────┬──────────┬────────┬────────┬──────────┐ │ Time │ Type │ Status │ Duration│ Credits │ ├──────────────────┼──────────┼────────┼────────┼──────────┤ │ 10:00:15 │ Collect │ Success│ 2.3s │ 20 │ │ 10:05:30 │ Report │ Success│ 45s │ 150 │ │ 10:10:00 │ Collect │ Failed │ 0.5s │ 0 │ │ 10:15:45 │ Ask │ Success│ 3s │ 15 │ └──────────────────┴──────────┴────────┴────────┴──────────┘ Click task to view details: { "id": "task_abc123", "type": "COLLECT", "status": "FAILED", "message": "Connection timeout", "error": "Failed to connect to rsshub.app after 3 retries", "startedAt": "2025-10-27T10:10:00Z", "duration": 500, "creditsUsed": 0 }
2.2 Real-time Status Monitoring
Task State Machine:
PENDING (Waiting) ↓ PROCESSING (Running) ↓ SUCCESS / FAILED State Transition Rules: • PENDING → PROCESSING: Worker starts processing • PROCESSING → SUCCESS: Execution succeeds • PROCESSING → FAILED: Execution fails (timeout/error/insufficient balance) • No state rollback (one-way flow)
In-Progress Tasks:
- Task log shows "In Progress" label
- Real-time execution duration updates
- Support viewing Worker logs (advanced feature)
2.3 Performance Metrics Tracking
Key Metrics:
| Metric | Normal Range | Abnormal Threshold | Impact |
|---|---|---|---|
| Data Collection Duration | 1-5 seconds | > 30 seconds | Slow data source response |
| Report Generation Duration | 10-60 seconds | > 5 minutes | Slow LLM response or too much content |
| Task Success Rate | > 95% | < 80% | Configuration error or service anomaly |
3. Failure Recovery Mechanism
3.1 Automatic Retry Strategy
Retry Scenarios:
- ✅ Network timeout
- ✅ Temporary service unavailable (503)
- ✅ Rate limit (429 Too Many Requests)
- ❌ Configuration error (404 Not Found, no retry)
- ❌ Authentication failure (401 Unauthorized, no retry)
Retry Strategy:
Exponential Backoff 1st failure → Wait 1 second → Retry 2nd failure → Wait 2 seconds → Retry 3rd failure → Wait 4 seconds → Retry 4th failure → Give up, mark task failed Max retries: 3 times Total timeout: 30 seconds Sample Logs: 2025-10-27 10:00:00 [INFO] Attempting request (1/3) 2025-10-27 10:00:05 [WARN] Timeout, retrying in 1s (2/3) 2025-10-27 10:00:08 [WARN] Timeout, retrying in 2s (3/3) 2025-10-27 10:00:15 [ERROR] Max retries exceeded, task failed
3.2 Degradation Strategy
Service Degradation Scenarios:
- Firecrawl unavailable → Auto-degrade to Browserless
- Primary LLM model unavailable → Switch to backup model (admin configured)
- RSSHub instance unavailable → Switch to other instances
Degradation Example (Web Scraping):
scrapePageDetail() function degradation flow: 1. Try Firecrawl (preferred) ↓ Failed (timeout/API error) 2. Log: "Firecrawl failed, falling back to Browserless" ↓ 3. Try Browserless (backup) ↓ Success 4. Return result + indicate provider used Task log shows: { "provider": "browserless", "fallback": true, "reason": "firecrawl_timeout" }
3.3 Manual Intervention Capability
Admin Operations:
- Disable unhealthy instance (RSSHub Instance Management → Disable)
- Manually retry task (Task Logs → Click "Retry")
- Adjust priority (RSSHub Instance Management → Modify Priority)
User Operations:
- Manually trigger execution (Data Source/Report List → "Execute Now")
- View failure reason (Task Logs → View Error Details)
- Modify config and retry (Edit Data Source/Report → Save → Execute Now)
PlaceholderFailure recovery flow - complete process from failure detection to auto-retry to degradation
4. Data Consistency Guarantee
4.1 Transaction Protection
Critical Operations Use Transactions:
- ✅ Credit deduction + Task creation (atomic)
- ✅ Content saving + Deduplication (atomic)
- ✅ Report generation + Step result saving (atomic)
Transaction Example:
[object Object], ,[object Object], prisma.$transaction(,[object Object], (tx) => { ,[object Object], ,[object Object], existing = ,[object Object], tx.,[object Object],.,[object Object],({ ,[object Object],: { ,[object Object],: url } }) ,[object Object], ,[object Object], (existing && strategy === ,[object Object],) { ,[object Object], tx.,[object Object],.,[object Object],({ ,[object Object],: { ,[object Object],: existing.,[object Object], }, ,[object Object],: { ,[object Object],: ,[object Object],, ,[object Object],: ,[object Object], ,[object Object],() } }) } ,[object Object], ,[object Object], ,[object Object], tx.,[object Object],.,[object Object],({ ,[object Object],: { ,[object Object],: url, title, content } }) }) ,[object Object],hljs javascript
4.2 Concurrency Control
Prevent Data Races:
- Optimistic Lock: Use version number control (field)
version - Pessimistic Lock: Critical operations use (e.g., credit deduction)
FOR UPDATE - Unique Constraints: Database-level duplicate prevention (e.g., index)
sourceUrl
Concurrency Scenario Example:
User submits 2 report generation tasks simultaneously: Task A: 1. Query credit balance: 1000 2. Deduct 200 → Balance 800 3. Create Report A 4. Commit transaction ✅ Task B: 1. Query credit balance: 800 (Task A already deducted) 2. Deduct 200 → Balance 600 3. Create Report B 4. Commit transaction ✅ Result: Both tasks succeed, balance correct (600) Without transaction protection: Task A and B query balance simultaneously → Both 1000 Task A deducts → Balance 800 Task B deducts → Balance 800 (Wrong! Should be 600) Result: Credit inconsistency ❌
5. System Monitoring Metrics
5.1 Key Performance Indicators (KPIs)
| Metric | Target | Current | Monitoring Method |
|---|---|---|---|
| System Availability | > 99.9% | 99.95% | Health Checks |
| Task Success Rate | > 95% | 97.3% | Task Log Statistics |
| Average Response Time | < 3 seconds | 2.1 seconds | Performance Tracking |
| Data Consistency | 100% | 100% | Transaction Audit |
5.2 Alert Mechanism
Alert Triggers:
- ✅ Task success rate < 80% (within 1 hour)
- ✅ All RSSHub instances unavailable
- ✅ Database connection pool exhausted
- ✅ Redis connection failure
Alert Notifications:
- Admin email notification
- Red warning badge in admin panel
- System log recording
6. User Experience Guarantee
6.1 Transparent Error Messages
User-Friendly Error Messages:
| Technical Error | User-Facing Message |
|---|---|
| Network connection timeout, please retry later |
| Insufficient credits, please top up and retry |
| RSS route configuration error, please check route format |
| API call rate too high, please retry in 5 minutes |
6.2 Self-Service Troubleshooting Tools
Provided Tools:
- Task Logs: View detailed execution logs and error information
- Consumption Details: Track credit consumption and refund records
- Health Status: View system component status (future feature)
- Documentation Search: Quickly find solutions to common issues
7. Frequently Asked Questions
Q1: What if all RSSHub instances are unavailable?
A: The system will:
- Mark task as failed
- Automatically refund (if already charged)
- Send alert to administrator
- Recommendation: Configure multiple instances (official + self-hosted) to reduce risk
Q2: What if a task stays in "In Progress" status?
A: Possible reasons:
- Worker process exception: Contact admin to check Worker status
- Task is actually executing: Complex reports may take 5-10 minutes
- Timeout not detected: System automatically marks timeout after 30 minutes
Q3: Why do tasks sometimes retry automatically?
A: System retries in these cases:
- Network timeout
- Temporary service unavailable (503)
- Rate limit (429, retry after waiting)
Won't retry for:
- Configuration errors (404, 401)
- Insufficient credits
- Business logic errors (Invalid data)
Q4: How to view system health status?
A: Current viewing methods:
- Task Logs: View recent task success rate
- RSSHub Instance Management (Admin): View instance status
- Future Feature: System status page (display all component health)
Next Steps
- URL Deduplication Technology - Learn about content deduplication mechanisms
- Atomic Billing Mechanism - Learn about credit deduction guarantees
- Optimization & Troubleshooting - Solve common issues