Crawl
We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.
Crawl a website and return the content.
from spider import Spider
app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(url, params={"limit": 10})
print(crawled_data)
The crawl_url
method returns the content of the website in markdown format as default. We set the limit
parameter to 10 to limit the number of pages to crawl. The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0
to crawl all pages.
Next we will see how to crawl with with different parameters.
Crawl with different parameters
The crawl_url
method has the following parameters:
url
(str): The URL of the website to crawl.
the following are recommended parameters and can be set in the params
dictionary:
limit
(int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to0
to crawl all pages.request_timeout
(int): The maximum amount of time to wait for a response from the website.stealth
(bool): Whether to use stealth mode. Default isFalse
on chrome.- visit the documentation for more parameters.
from spider import Spider
app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(
url, params={"limit": 10, "request_timeout": 10, "stealth": True}
)
print(crawled_data)
If you have a lot of params, setting them inside the crawl_url
method can be cumbersome. You can set them in a seperate params
variable that has the RequestParams
type which is also available in the spider
package.
from spider import Spider, spider_types
params: spider_types.RequestParamsDict = {
"limit": 10,
"request_timeout": 10,
"stealth": True,
"return_format": [ "raw", "markdown" ],
# Easier to read and intellisense will help you with the available options
}
app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(url, params)
print(crawled_data)