-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: lesson about using a framework #1303
base: master
Are you sure you want to change the base?
Conversation
cb0f718
to
e18ea31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point for avoiding type hints. However, in the case of the handler:
@crawler.router.default_handler
async def handle_listing(context):
...
It leaves the reader without any possibility for code completions or static analysis when working with the context
object.
In my opinion, type hints should be included here. We have been using them across all docs & examples.
Just a suggestion for you to reconsider, not a request.
Other than that, good job 🙂, and the code seems to be working.
|
||
From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development. | ||
|
||
We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but if you would want more reasons you can check out this PR.
Thanks for the review! I see your point and I will indeed reconsider adding the type hint, at least for the context. It would be easier decision if the type name wasn't 28 characters long, but you're right about the benefits for people with editors like VS Code, where we could assume some level of automatic code completions. |
1. We perform imports and specify an asynchronous `main()` function. | ||
1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor. | ||
1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace. | ||
1. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work. | ||
1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. We perform imports and specify an asynchronous `main()` function. | |
1. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor. | |
1. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace. | |
1. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work. | |
1. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function. | |
1. We perform imports and specify an asynchronous `main()` function. | |
2. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor. | |
3. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace. | |
4. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work. | |
5. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function. |
This PR introduces a new lesson to the Python course for beginners in scraping. The lesson is about working with a framework. Decisions I made:
Crawlee feedback
Regarding Crawlee, I didn't have much trouble to write this lesson, apart from the part where I wanted to provide hints on how to do this:
I couldn't find good example in the docs, and I was afraid that even if I provided pointers to all the individual pieces, the student wouldn't be able to figure it out.
Also, I wanted to link to docs when pointing out the fact that
enqueue_links()
has alimit
argument, but I couldn't findenqueue_links()
in the docs. I found this which is weird. It's not clear what object is documented, or what it is, feels like some internals, not as regular docs of a method. I probably know how come it's this way, but I don't think it's useful this way and I decided I don't want to send people from the course to that page.One more thing: I do think that Crawlee should log some "progress" information about requests made or - especially - items scraped. It's so weird to run the program and then just look at the program as if it hanged, waiting if something happens or not. E.g. Scrapy logs how many items per minute I scraped, which I personally find super useful.