Actions
Task #181
openGeneric Web Scraping Engine for Vidyarti
Start date:
03/17/2026
Due date:
% Done:
0%
Estimated time:
Description
Develop a centralized web scraping module that can fetch data from multiple external sources and map it into different modules of Vidyarti such as:
- Syllabus
- Current Affairs
- Mock Test Questions
- Study Materials
The system should be configurable, reusable, and scalable.
Table
vid_scraping_source_master
- id INT (PK) Source ID
- source_name VARCHAR(150) Website name
- base_url VARCHAR(255) Website URL
- module_type ENUM('current_affairs','syllabus','mock_test','study_material') Target module
- parsing_rules TEXT JSON rules for scraping
- status BOOLEAN Active/Inactive
- created_at DATETIME Created date
vid_scraped_data_staging
- id INT (PK) ID
- source_id INT (FK) Reference source
- module_type VARCHAR(50) Target module
- raw_title TEXT Extracted title
- raw_content TEXT Extracted content
- raw_data JSON Full raw scraped data
- source_url VARCHAR(255) Original link
- status ENUM('pending','approved','rejected') Workflow status
- created_at DATETIME Scraped time
vid_scraping_logs
- id INT (PK) Log ID
- source_id INT Source reference
- status VARCHAR(50) Success/Failed
- message TEXT Error or success message
- run_time DATETIME Execution time
Validations
Backend
- source_name → required
- base_url → valid URL
- module_type → must be valid enum
- source_url → unique (avoid duplicates)
- Prevent duplicate data:
Same source_url OR same title
Frontend
Required fields:
- Source Name
- URL
- Module Type
- JSON validation for parsing rules
- Show preview test scraping (optional)
Updated by Sreemayi C M about 2 months ago
- Status changed from New to In Progress
Actions