Async Crawl
We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.
Crawl a website asynchronously and return the content.
import asyncio
from spider import AsyncSpider
url = "https://spider.cloud"
async def async_crawl_url(url, params):
async with AsyncSpider() as app:
crawled_data = []
async for data in app.crawl_url(url, params=params):
crawled_data.append(data)
return crawled_data
result = asyncio.run(async_crawl_url(url, params={"limit": 10}))
print(result)
We use the AsyncSpider
class to create an asynchronous instance of the Spider class. We then use the async for
loop to iterate over the results of the crawl_url
method. The crawl_url
method returns a generator that yields the crawled data. We append the data to a list and return it. Simsalabim, we have crawled a website asynchronously.
Next we will see how to crawl asynchronously with different parameters.
Async Crawl with different parameters
The crawl_url
method has the following parameters:
url
(str): The URL of the website to crawl.
the following are recommended parameters and can be set in the params
dictionary:
limit
(int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to0
to crawl all pages.request_timeout
(int): The maximum amount of time to wait for a response from the website.stealth
(bool): Whether to use stealth mode. Default isFalse
on chrome.- a ton more, visit the documentation for more parameters.
import asyncio
from spider import AsyncSpider
url = "https://spider.cloud"
async def async_crawl_url(url, params):
async with AsyncSpider() as app:
crawled_data = []
async for data in app.crawl_url(url, params=params):
crawled_data.append(data)
return crawled_data
result = asyncio.run(
async_crawl_url(
url,
params={
"limit": 10,
"request_timeout": 10,
"stealth": True,
"return_format": "html",
},
)
)
print(result)
If you have a lot of params, setting them inside the crawl_url
method can be cumbersome. You can set them in a seperate params
variable that has the RequestParams
type which is also available in the spider
package.
import asyncio
from spider import AsyncSpider, spider_types
url = "https://spider.cloud"
async def async_crawl_url(url, params):
async with AsyncSpider() as app:
crawled_data = []
async for data in app.crawl_url(url, params=params):
crawled_data.append(data)
return crawled_data
params: spider_types.RequestParamsDict = {
"limit": 10,
"request_timeout": 10,
"stealth": True,
# Easier to read and intellisense will help you with the available options
}
result = asyncio.run(async_crawl_url(url, params=params))
print(result)