Introduction

spider-client is a client library to use with the Spider Cloud web crawler and scraper.

  • Concurrent
  • Streaming
  • Headless Chrome
  • HTTP Proxies
  • Cron Jobs
  • Subscriptions
  • AI Scraping and Event Driven Actions
  • Blacklisting and Budgeting Depth
  • Exponential Backoff

Simple Example

This is a simple example of what you can do with the spider-client library.

Installation

To install the library, you can use pip for Python or npm (make sure to have node installed) for JavaScript.:

# for python
pip install spider-client
# for javascript
npm install @spider-cloud/spider-client

Usage

Here is an example of how you can use the library, make sure to replace your_api_key with your actual API key which you can get from the spider.cloud website.

from spider import Spider

app = Spider(api_key='your_api_key')
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)
import { Spider } from "@spider-cloud/spider-client";

const app = new Spider({ apiKey: "your-api-key" });
const url = "https://spider.cloud";
const scrapedData = await app.scrapeUrl(url);
console.log(scrapedData);

Getting started

To use the python SDK you will (of course) have to install it :)

pip install spider-client

Here is the link to the package on PyPi.

Setting & Getting Api Key

To use the SDK you will need an API key. You can get one by signing up on spider.cloud.

Then you need to set the API key in your environment variables.

export SPIDER_API_KEY=your_api_key

if you don't want to set the API key in your environment variables you can pass it as an argument to the Spider class.

from spider import Spider
app = Spider(api_key='your_api_key')

We recommend setting the API key in your environment variables.

Crawl

We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.

Crawl a website and return the content.

from spider import Spider

app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(url, params={"limit": 10})
print(crawled_data)

The crawl_url method returns the content of the website in markdown format as default. We set the limit parameter to 10 to limit the number of pages to crawl. The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

Next we will see how to crawl with with different parameters.

Crawl with different parameters

The crawl_url method has the following parameters:

  • url (str): The URL of the website to crawl.

the following are recommended parameters and can be set in the params dictionary:

  • limit (int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
  • request_timeout (int): The maximum amount of time to wait for a response from the website.
  • stealth (bool): Whether to use stealth mode. Default is False on chrome.
  • visit the documentation for more parameters.
from spider import Spider

app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(
    url, params={"limit": 10, "request_timeout": 10, "stealth": True}
)

print(crawled_data)

If you have a lot of params, setting them inside the crawl_url method can be cumbersome. You can set them in a seperate params variable that has the RequestParams type which is also available in the spider package.

from spider import Spider, spider_types

params: spider_types.RequestParamsDict = {
    "limit": 10,
    "request_timeout": 10,
    "stealth": True,
    "return_format": [ "raw", "markdown" ],
    # Easier to read and intellisense will help you with the available options
}

app = Spider()
url = "https://spider.cloud"
crawled_data = app.crawl_url(url, params)

print(crawled_data)

Scrape

We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.

Scrape a website and return the content.

from spider import Spider

app = Spider()
url = 'https://spider.cloud'
scraped_data = app.scrape_url(url)

print(scraped_data)

The scrape_url method returns the content of the website in markdown format as default. Next we will see how to scrape with with different parameters.

Scrape with different parameters

The scrape_url method has the following parameters:

  • url (str): The URL of the website to scrape.

the following are optional parameters and can be set in the params dictionary:

  • request ("http", "chrome", "smart") : The type of request to make. Default is "http".
  • return_format ("raw", "markdown", "commonmark", "html2text", "text", "bytes") : The format in which to return the scraped data. Default is "markdown".
  • stealth, anti_bot and a ton of other parameters that you can find in the documentation.
from spider import Spider

app = Spider()
url = "https://spider.cloud"
scraped_data = app.scrape_url(url, params={"request_timeout": 10, "stealth": True})

print(scraped_data)

If you have a lot of params, setting them inside the scrape_url method can be cumbersome. You can set them in a seperate params variable that has the RequestParams type which is also available in the spider package.

from spider import Spider, spider_types

params: spider_types.RequestParamsDict = {
    "request_timeout": 10,
    "stealth": True,
    # Easier to read and intellisense will help you with the available options
}

app = Spider()
url = "https://spider.cloud"
scraped_data = app.scrape_url(url, params)

print(scraped_data)

Async Crawl

We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.

Crawl a website asynchronously and return the content.

import asyncio

from spider import AsyncSpider

url = "https://spider.cloud"


async def async_crawl_url(url, params):
    async with AsyncSpider() as app:
        crawled_data = []
        async for data in app.crawl_url(url, params=params):
            crawled_data.append(data)
    return crawled_data


result = asyncio.run(async_crawl_url(url, params={"limit": 10}))
print(result)

We use the AsyncSpider class to create an asynchronous instance of the Spider class. We then use the async for loop to iterate over the results of the crawl_url method. The crawl_url method returns a generator that yields the crawled data. We append the data to a list and return it. Simsalabim, we have crawled a website asynchronously.

Next we will see how to crawl asynchronously with different parameters.

Async Crawl with different parameters

The crawl_url method has the following parameters:

  • url (str): The URL of the website to crawl.

the following are recommended parameters and can be set in the params dictionary:

  • limit (int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
  • request_timeout (int): The maximum amount of time to wait for a response from the website.
  • stealth (bool): Whether to use stealth mode. Default is False on chrome.
  • a ton more, visit the documentation for more parameters.
import asyncio

from spider import AsyncSpider

url = "https://spider.cloud"


async def async_crawl_url(url, params):
    async with AsyncSpider() as app:
        crawled_data = []
        async for data in app.crawl_url(url, params=params):
            crawled_data.append(data)
    return crawled_data


result = asyncio.run(
    async_crawl_url(
        url,
        params={
            "limit": 10,
            "request_timeout": 10,
            "stealth": True,
            "return_format": "html",
        },
    )
)
print(result)

If you have a lot of params, setting them inside the crawl_url method can be cumbersome. You can set them in a seperate params variable that has the RequestParams type which is also available in the spider package.

import asyncio

from spider import AsyncSpider, spider_types

url = "https://spider.cloud"


async def async_crawl_url(url, params):
    async with AsyncSpider() as app:
        crawled_data = []
        async for data in app.crawl_url(url, params=params):
            crawled_data.append(data)
    return crawled_data


params: spider_types.RequestParamsDict = {
    "limit": 10,
    "request_timeout": 10,
    "stealth": True,
    # Easier to read and intellisense will help you with the available options
}

result = asyncio.run(async_crawl_url(url, params=params))
print(result)

Getting started

To be able to use the javascript SDK you will (of course) have to install it. You can do so with your package manager of choice.

npm install @spider-cloud/spider-client
yarn add @spider-cloud/spider-client

Here is the link to the package on npm.

Setting & Getting Api Key

To use the SDK you will need an API key. You can get one by signing up on spider.cloud.

Then you need to set the API key in your environment variables.

export SPIDER_API_KEY=your_api_key

if you don't want to set the API key in your environment variables you can pass it as an argument to the Spider class.

import { Spider } from "@spider-cloud/spider-client";

We recommend setting the API key in your environment variables.

Crawl

We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.

Crawl a website and return the content.

import { Spider } from "@spider-cloud/spider-client";

const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.crawlUrl(url, { limit: 10 });
console.log(scrapedData);

The crawlUrl method returns the content of the website in markdown format as default. We set the limit parameter to 10 to limit the number of pages to crawl. The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.

Next we will see how to crawl with with different parameters.

Crawl with different parameters

The crawlUrl method has the following parameters:

  • url (str): The URL of the website to crawl.

the following are recommended parameters and can be set in the params dictionary:

  • limit (int): The maximum amount of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.
  • request_timeout (int): The maximum amount of time to wait for a response from the website.
  • stealth (bool): Whether to use stealth mode. Default is False on chrome.
  • visit the documentation for more parameters.
import { Spider } from "@spider-cloud/spider-client";

const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.crawlUrl(url, {
  limit: 10,
  anti_bot: true,
  return_format: "raw",
});
console.log(scrapedData);

If you have a lot of params, setting them inside the crawlUrl method can be cumbersome. You can set them in a seperate params variable that has the SpiderParams type which is also available in the spider package. You will have to use Typescript if you want type annotations.

import { Spider } from "@spider-cloud/spider-client";
import type { SpiderParams } from "@spider-cloud/spider-client/dist/config";

const app = new Spider();
const url = "https://spider.cloud";
const params: SpiderParams = {
  return_format: ["raw", "markdown"],
  anti_bot: true,
};
const scrapedData = await app.crawlUrl(url, params);
console.log(scrapedData);

Scrape

We will assume that you have installed the Spider package and exported your API key as an environment variable. If you haven't, please refer to the Getting Started guide.

Scrape a website and return the content.

import { Spider } from "@spider-cloud/spider-client";

const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.scrapeUrl(url);
console.log(scrapedData);

The scrapeUrl method returns the content of the website in markdown format as default. Next we will see how to scrape with with different parameters.

Scrape with different parameters

The scrapeUrl method has the following parameters:

  • url (str): The URL of the website to scrape.

the following are optional parameters and can be set in the params dictionary:

  • request ("http", "chrome", "smart") : The type of request to make. Default is "http".
  • return_format ("raw", "markdown", "commonmark", "html2text", "text", "bytes") : The format in which to return the scraped data. Default is "markdown".
  • stealth, anti_bot and a ton of other parameters that you can find in the documentation.
import { Spider } from "@spider-cloud/spider-client";

const app = new Spider();
const url = "https://spider.cloud";
const scrapedData = await app.scrapeUrl(url, {
  return_format: "raw",
  anti_bot: true,
});
console.log(scrapedData);

If you have a lot of params, setting them inside the scrapeUrl method can be cumbersome. You can set them in a seperate params variable that has the SpiderParams type which is also available in the spider package. You will have to use Typescript if you want type annotations.

import { Spider } from "@spider-cloud/spider-client";
import type { SpiderParams } from "@spider-cloud/spider-client/dist/config";

const app = new Spider();
const url = "https://spider.cloud";
const params: SpiderParams = {
  return_format: "raw",
  anti_bot: true,
};
const scrapedData = await app.scrapeUrl(url, params);
console.log(scrapedData);

Getting Started

The Spider Cloud Rust SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.

Installation

To use the Spider Cloud Rust SDK, include the following in your Cargo.toml:

[dependencies]
spider-client = "0.1"

Usage

  1. Get an API key from spider.cloud
  2. Set the API key as an environment variable named SPIDER_API_KEY or pass it as an argument when creating an instance of the Spider struct.

Here's an example of how to use the SDK:

use serde_json::json;
use std::env;

#[tokio::main]
async fn main() {
    // Set the API key as an environment variable
    env::set_var("SPIDER_API_KEY", "your_api_key");

    // Initialize the Spider with your API key
    let spider = Spider::new(None).expect("API key must be provided");

    let url = "https://spider.cloud";

    // Scrape a single URL
    let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");

    println!("Scraped Data: {:?}", scraped_data);

    // Crawl a website
    let crawler_params = RequestParams {
        limit: Some(1),
        proxy_enabled: Some(true),
        store_data: Some(false),
        metadata: Some(false),
        request: Some(RequestType::Http),
        ..Default::default()
    };

    let crawl_result = spider.crawl_url(url, Some(crawler_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");

    println!("Crawl Result: {:?}", crawl_result);
}

Scraping a URL

To scrape data from a single URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");
}

Crawling a Website

To automate crawling a website:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    request: Some(RequestType::Smart),
    ..Default::default()
};
let crawl_result = spider.crawl_url(url, Some(crawl_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
}

Crawl Streaming

Stream crawl the website in chunks to scale with a callback:

#![allow(unused)]
fn main() {
fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    store_data: Some(false),
    ..Default::default()
};

spider.crawl_url(
    url,
    Some(crawl_params),
    true,
    "application/json",
    Some(handle_json)
).await.expect("Failed to crawl the URL");
}

Perform a search for websites to crawl or gather search results:

#![allow(unused)]
fn main() {
let query = "a sports website";
let crawl_params = RequestParams {
    request: Some(RequestType::Smart),
    search_limit: Some(5),
    limit: Some(5),
    fetch_page_content: Some(true),
    ..Default::default()
};
let crawl_result = spider.search(query, Some(crawl_params), false, "application/json").await.expect("Failed to perform search");
}

Extract all links from a specified URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let links = spider.links(url, None, false, "application/json").await.expect("Failed to retrieve links from URL");
}

Transform

Transform HTML to markdown or text lightning fast:

#![allow(unused)]
fn main() {
let data = vec![json!({"html": "<html><body><h1>Hello world</h1></body></html>"})];
let params = RequestParams {
    readability: Some(false),
    return_format: Some(ReturnFormat::Markdown),
    ..Default::default()
};
let result = spider.transform(data, Some(params), false, "application/json").await.expect("Failed to transform HTML to markdown");
println!("Transformed Data: {:?}", result);
}

Taking Screenshots of a URL(s)

Capture a screenshot of a given URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let screenshot = spider.screenshot(url, None, false, "application/json").await.expect("Failed to take screenshot of URL");
}

Extracting Contact Information

Extract contact details from a specified URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let contacts = spider.extract_contacts(url, None, false, "application/json").await.expect("Failed to extract contacts from URL");
println!("Extracted Contacts: {:?}", contacts);
}

Labeling Data from a URL(s)

Label the data extracted from a particular URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let labeled_data = spider.label(url, None, false, "application/json").await.expect("Failed to label data from URL");
println!("Labeled Data: {:?}", labeled_data);
}

Checking Crawl State

You can check the crawl state of a specific URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let state = spider.get_crawl_state(url, None, false, "application/json").await.expect("Failed to get crawl state for URL");
println!("Crawl State: {:?}", state);
}

Downloading Files

You can download the results of the website:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let options = hashmap!{
    "page" => 0,
    "limit" => 100,
    "expiresIn" => 3600 // Optional, add if needed
};
let response = spider.create_signed_url(Some(url), Some(options)).await.expect("Failed to create signed URL");
println!("Download URL: {:?}", response);
}

Checking Available Credits

You can check the remaining credits on your account:

#![allow(unused)]
fn main() {
let credits = spider.get_credits().await.expect("Failed to get credits");
println!("Remaining Credits: {:?}", credits);
}

Data Operations

The Spider client can now interact with specific data tables to create, retrieve, and delete data.

Retrieve Data from a Table

To fetch data from a specified table by applying query parameters:

#![allow(unused)]
fn main() {
let table_name = "pages";
let query_params = RequestParams {
    limit: Some(20),
    ..Default::default()
};
let response = spider.data_get(table_name, Some(query_params)).await.expect("Failed to retrieve data from table");
println!("Data from table: {:?}", response);
}

Delete Data from a Table

To delete data from a specified table based on certain conditions:

#![allow(unused)]
fn main() {
let table_name = "websites";
let delete_params = RequestParams {
    domain: Some("www.example.com".to_string()),
    ..Default::default()
};
let response = spider.data_delete(table_name, Some(delete_params)).await.expect("Failed to delete data from table");
println!("Delete Response: {:?}", response);
}

Streaming

If you need to use streaming, set the stream parameter to true and provide a callback function:

#![allow(unused)]
fn main() {
fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    store_data: Some(false),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

spider.links(url, Some(crawler_params), true, "application/json").await.expect("Failed to retrieve links from URL");
}

Content-Type

The following Content-type headers are supported using the content_type parameter:

  • application/json
  • text/csv
  • application/xml
  • application/jsonl
#![allow(unused)]
fn main() {
let url = "https://example.com";

let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    store_data: Some(false),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

// Stream JSON lines back to the client
spider.crawl_url(url, Some(crawler_params), true, "application/jsonl", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
}

Error Handling

The SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, it will be propagated to the caller with a descriptive error message. By default request use a Exponential Backoff to retry as needed.

Getting Started

Spider Cloud CLI is a command-line interface to interact with the Spider Cloud web crawler. It allows you to scrape, crawl, search, and perform various other web-related tasks through simple commands.

Installation

Install the CLI using homebrew or cargo from crates.io:

Homebrew

brew tap spider-rs/spider-cloud-cli
brew install spider-cloud-cli

Cargo

cargo install spider-cloud-cli

Usage

After installing, you can use the CLI by typing spider-cloud-cli followed by a command and its respective arguments.

Authentication

Before using most of the commands, you need to authenticate by providing an API key:

spider-cloud-cli auth --api_key YOUR_API_KEY

Commands

Scrape

Scrape data from a specified URL.

spider-cloud-cli scrape --url http://example.com

Crawl

Crawl a specified URL with an optional limit on the number of pages.

spider-cloud-cli crawl --url http://example.com --limit 10

Fetch links from a specified URL.

spider-cloud-cli links --url http://example.com

Screenshot

Take a screenshot of a specified URL.

spider-cloud-cli screenshot --url http://example.com

Search

Search for a query.

spider-cloud-cli search --query "example query"

Transform

Transform specified data.

spider-cloud-cli transform --data "sample data"

Extract Contacts

Extract contact information from a specified URL.

spider-cloud-cli extract_contacts --url http://example.com

Label

Label data from a specified URL.

spider-cloud-cli label --url http://example.com

Get Crawl State

Get the crawl state of a specified URL.

spider-cloud-cli get_crawl_state --url http://example.com

Query

Query records of a specified domain.

spider-cloud-cli query --domain example.com

Get Credits

Fetch the account credits left.

spider-cloud-cli get_credits