This project consists of two microservices:
-
Scraper Service (Golang): Fetches HTML content from the
books.toscrape.com
website and sends the raw HTML to the Parser Service. -
Parser Service (Golang): Receives HTML content and extracts product information such as Name, Availability, UPC, Price (excl. tax), and Tax, storing the data in a SQLite database.
Both services are deployed in Kubernetes and communicate via gRPC.
-
Scraper Service:
- Scrapes paginated product lists from
https://books.toscrape.com
. - Sends HTML content to the Parser Service for data extraction.
- Saves product data in a SQLite database.
- Scrapes paginated product lists from
-
Parser Service:
- Exposes a gRPC endpoint to receive HTML content.
- Extracts product details from the raw HTML received from the Scraper Service.
[Scraper Service] --> [Parser Service] --> [SQLite Database]
Deployment files are located in the k8s
directory:
scraper-deployment.yaml
: Deploys the Scraper Service.parser-deployment.yaml
: Deploys the Parser Service.parser-service.yaml
: Exposes the Parser Service to other services.sqlite-pv.yaml
: Persistent volume for storing the SQLite database.
Helper scripts are provided in the scripts
directory to simplify the deployment process:
-
Ensure Kubernetes is Running Locally:
minikube start
-
Build Docker Images:
./scripts/build.ps1
-
Deploy Services:
./scripts/deploy.ps1
-
Check Deployment Status:
kubectl get pods kubectl get pvc
-
Access Logs:
kubectl logs <scraper-pod> kubectl logs <parser-pod>
-
Check Scraper Service Logs:
kubectl logs -l app=scraper
-
Check Parser Service Logs:
kubectl logs -l app=parser
-
Verify Database Contents:
Access the SQLite database stored in the persistent volume:
kubectl exec -it <scraper-pod> -- sqlite3 /data/products.db
Query the database:
SELECT * FROM products;
-
Parser Service Timeout
- Description: The scraper service may experience timeouts when waiting for the parser service to respond, especially if the parsing operation takes a long time or the parser service is under heavy load.
- Solution:
- Increase the scraper’s request timeout configuration if needed.
- Monitor the resource utilization of the parser service and scale it accordingly.
-
HTML Page Structure Changes
- Description: The scraper service relies on a specific HTML structure to extract data. If the structure of the target website changes, the scraper may fail to extract data correctly or miss important information.
- Solution:
- Regularly monitor the target website for changes in its HTML structure.
- Implement error handling to detect when parsing fails or yields unexpected results.
- Use flexible parsing methods that can adapt to minor changes in the HTML structure.
-
Resource Limits and Out-of-Memory (OOM) Errors
- Description: The scraper or parser service may be terminated due to exceeding memory or CPU limits.
- Solution:
- Set appropriate resource requests and limits in the deployment manifests.
- Monitor the resource usage of the services using Kubernetes tools like
kubectl top pods
or Prometheus.
-
Logging:
- Add structured logging and log levels (INFO, ERROR) to help with debugging.
-
Database:
- Migrate from SQLite to a more scalable database like PostgreSQL for production use.
-
Security:
- Implement TLS verification properly and avoid bypassing it in production environments.