Web Scraping application demo

🔍 Overview

This project consists of two microservices:

Scraper Service (Golang): Fetches HTML content from the books.toscrape.com website and sends the raw HTML to the Parser Service.
Parser Service (Golang): Receives HTML content and extracts product information such as Name, Availability, UPC, Price (excl. tax), and Tax, storing the data in a SQLite database.

Both services are deployed in Kubernetes and communicate via gRPC.

⚙️ Architecture

Scraper Service:
- Scrapes paginated product lists from https://books.toscrape.com.
- Sends HTML content to the Parser Service for data extraction.
- Saves product data in a SQLite database.
Parser Service:
- Exposes a gRPC endpoint to receive HTML content.
- Extracts product details from the raw HTML received from the Scraper Service.

Deployment Diagram

[Scraper Service] --> [Parser Service] --> [SQLite Database]

📄 Deployment Files

Deployment files are located in the k8s directory:

scraper-deployment.yaml: Deploys the Scraper Service.
parser-deployment.yaml: Deploys the Parser Service.
parser-service.yaml: Exposes the Parser Service to other services.
sqlite-pv.yaml: Persistent volume for storing the SQLite database.

🚀 Deployment Instructions

Helper scripts are provided in the scripts directory to simplify the deployment process:

Ensure Kubernetes is Running Locally:
```
minikube start
```
Build Docker Images:
```
./scripts/build.ps1
```
Deploy Services:
```
./scripts/deploy.ps1
```
Check Deployment Status:
```
kubectl get pods
kubectl get pvc
```

Access Logs:

kubectl logs <scraper-pod>
kubectl logs <parser-pod>

🧪 Testing the Services

Check Scraper Service Logs:
```
kubectl logs -l app=scraper
```
Check Parser Service Logs:
```
kubectl logs -l app=parser
```
Verify Database Contents:

Access the SQLite database stored in the persistent volume:
```
kubectl exec -it <scraper-pod> -- sqlite3 /data/products.db
```
Query the database:
```
SELECT * FROM products;
```

🛠️ Possible Issues and Improvements

Issues

Parser Service Timeout
- Description: The scraper service may experience timeouts when waiting for the parser service to respond, especially if the parsing operation takes a long time or the parser service is under heavy load.
- Solution:
  - Increase the scraper’s request timeout configuration if needed.
  - Monitor the resource utilization of the parser service and scale it accordingly.
HTML Page Structure Changes
- Description: The scraper service relies on a specific HTML structure to extract data. If the structure of the target website changes, the scraper may fail to extract data correctly or miss important information.
- Solution:
  - Regularly monitor the target website for changes in its HTML structure.
  - Implement error handling to detect when parsing fails or yields unexpected results.
  - Use flexible parsing methods that can adapt to minor changes in the HTML structure.
Resource Limits and Out-of-Memory (OOM) Errors
- Description: The scraper or parser service may be terminated due to exceeding memory or CPU limits.
- Solution:
  - Set appropriate resource requests and limits in the deployment manifests.
  - Monitor the resource usage of the services using Kubernetes tools like kubectl top pods or Prometheus.

Improvements

Logging:
- Add structured logging and log levels (INFO, ERROR) to help with debugging.
Database:
- Migrate from SQLite to a more scalable database like PostgreSQL for production use.
Security:
- Implement TLS verification properly and avoid bypassing it in production environments.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cmd		cmd
internal		internal
k8s		k8s
proto		proto
scripts		scripts
.gitignore		.gitignore
go.mod		go.mod
go.sum		go.sum
golangci.yml		golangci.yml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraping application demo

🔍 Overview

⚙️ Architecture

Deployment Diagram

📄 Deployment Files

🚀 Deployment Instructions

🧪 Testing the Services

🛠️ Possible Issues and Improvements

Issues

Improvements

About

Releases

Packages

Languages

jleipus/scarper-demo

Folders and files

Latest commit

History

Repository files navigation

Web Scraping application demo

🔍 Overview

⚙️ Architecture

Deployment Diagram

📄 Deployment Files

🚀 Deployment Instructions

🧪 Testing the Services

🛠️ Possible Issues and Improvements

Issues

Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages