Crawl
We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.
Crawl a website and return the content.
import { Spider } from "@spider-cloud/spider-client";
const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.crawlUrl(url, { limit: 10 });
console.log(scrapedData);
The crawlUrl
method returns the content of the website in markdown format as default. We set the limit
parameter to 10 to limit the number of pages to crawl. The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0
to crawl all pages.
Next we will see how to crawl with with different parameters.
Crawl with different parameters
The crawlUrl
method has the following parameters:
url
(str): The URL of the website to crawl.
the following are recommended parameters and can be set in the params
dictionary:
limit
(int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to0
to crawl all pages.request_timeout
(int): The maximum amount of time to wait for a response from the website.stealth
(bool): Whether to use stealth mode. Default isFalse
on chrome.- visit the documentation for more parameters.
import { Spider } from "@spider-cloud/spider-client";
const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.crawlUrl(url, {
limit: 10,
anti_bot: true,
return_format: "raw",
});
console.log(scrapedData);
If you have a lot of params, setting them inside the crawlUrl
method can be cumbersome. You can set them in a seperate params
variable that has the SpiderParams
type which is also available in the spider
package. You will have to use Typescript if you want type annotations.
import { Spider } from "@spider-cloud/spider-client";
import type { SpiderParams } from "@spider-cloud/spider-client/dist/config";
const app = new Spider();
const url = "https://spider.cloud";
const params: SpiderParams = {
return_format: ["raw", "markdown"],
anti_bot: true,
};
const scrapedData = await app.crawlUrl(url, params);
console.log(scrapedData);