Scrape
We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.
Scrape a website and return the content.
from spider import Spider
app = Spider()
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)
print(scraped_data)
The scrape_url
method returns the content of the website in markdown format as default. Next we will see how to scrape with with different parameters.
Scrape with different parameters
The scrape_url
method has the following parameters:
url
(str): The URL of the website to scrape.
the following are optional parameters and can be set in the params
dictionary:
request
("http", "chrome", "smart") : The type of request to make. Default is "http".return_format
("raw", "markdown", "commonmark", "html2text", "text", "bytes") : The format in which to return the scraped data. Default is "markdown".stealth
,anti_bot
and a ton of other parameters that you can find in the documentation.
from spider import Spider
app = Spider()
url = "https://spider.cloud"
scraped_data = app.scrape_url(url, params={"request_timeout": 10, "stealth": True})
print(scraped_data)
If you have a lot of params, setting them inside the scrape_url
method can be cumbersome. You can set them in a seperate params
variable that has the RequestParams
type which is also available in the spider
package.
from spider import Spider, spider_types
params: spider_types.RequestParamsDict = {
"request_timeout": 10,
"stealth": True,
# Easier to read and intellisense will help you with the available options
}
app = Spider()
url = "https://spider.cloud"
scraped_data = app.scrape_url(url, params)
print(scraped_data)