Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Getting Started

The Spider Cloud Rust SDK offers a toolkit for straightforward website scraping, crawling at scale, and other utilities like extracting links and taking screenshots, enabling you to collect data formatted for compatibility with language models (LLMs). It features a user-friendly interface for seamless integration with the Spider Cloud API.

Installation

To use the Spider Cloud Rust SDK, include the following in your Cargo.toml:

[dependencies]
spider-client = "0.1"

Usage

  1. Get an API key from spider.cloud
  2. Set the API key as an environment variable named SPIDER_API_KEY or pass it as an argument when creating an instance of the Spider struct.

Here's an example of how to use the SDK:

use serde_json::json;
use std::env;

#[tokio::main]
async fn main() {
    // Set the API key as an environment variable
    env::set_var("SPIDER_API_KEY", "your_api_key");

    // Initialize the Spider with your API key
    let spider = Spider::new(None).expect("API key must be provided");

    let url = "https://spider.cloud";

    // Scrape a single URL
    let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");

    println!("Scraped Data: {:?}", scraped_data);

    // Crawl a website
    let crawler_params = RequestParams {
        limit: Some(1),
        proxy_enabled: Some(true),
        metadata: Some(false),
        request: Some(RequestType::Http),
        ..Default::default()
    };

    let crawl_result = spider.crawl_url(url, Some(crawler_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");

    println!("Crawl Result: {:?}", crawl_result);
}

Scraping a URL

To scrape data from a single URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let scraped_data = spider.scrape_url(url, None, false, "application/json").await.expect("Failed to scrape the URL");
}

Crawling a Website

To automate crawling a website:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    request: Some(RequestType::Smart),
    ..Default::default()
};
let crawl_result = spider.crawl_url(url, Some(crawl_params), false, "application/json", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
}

Crawl Streaming

Stream crawl the website in chunks to scale with a callback:

#![allow(unused)]
fn main() {
fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawl_params = RequestParams {
    limit: Some(200),
    ..Default::default()
};

spider.crawl_url(
    url,
    Some(crawl_params),
    true,
    "application/json",
    Some(handle_json)
).await.expect("Failed to crawl the URL");
}

Perform a search for websites to crawl or gather search results:

#![allow(unused)]
fn main() {
let query = "a sports website";
let crawl_params = RequestParams {
    request: Some(RequestType::Smart),
    search_limit: Some(5),
    limit: Some(5),
    fetch_page_content: Some(true),
    ..Default::default()
};
let crawl_result = spider.search(query, Some(crawl_params), false, "application/json").await.expect("Failed to perform search");
}

Extract all links from a specified URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let links = spider.links(url, None, false, "application/json").await.expect("Failed to retrieve links from URL");
}

Transform

Transform HTML to markdown or text lightning fast:

#![allow(unused)]
fn main() {
let data = vec![json!({"html": "<html><body><h1>Hello world</h1></body></html>"})];
let params = RequestParams {
    readability: Some(false),
    return_format: Some(ReturnFormat::Markdown),
    ..Default::default()
};
let result = spider.transform(data, Some(params), false, "application/json").await.expect("Failed to transform HTML to markdown");
println!("Transformed Data: {:?}", result);
}

Taking Screenshots of a URL(s)

Capture a screenshot of a given URL:

#![allow(unused)]
fn main() {
let url = "https://example.com";
let screenshot = spider.screenshot(url, None, false, "application/json").await.expect("Failed to take screenshot of URL");
}

Checking Available Credits

You can check the remaining credits on your account:

#![allow(unused)]
fn main() {
let credits = spider.get_credits().await.expect("Failed to get credits");
println!("Remaining Credits: {:?}", credits);
}

Streaming

If you need to use streaming, set the stream parameter to true and provide a callback function:

#![allow(unused)]
fn main() {
fn handle_json(json_obj: serde_json::Value) {
    println!("Received chunk: {:?}", json_obj);
}

let url = "https://example.com";
let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

spider.links(url, Some(crawler_params), true, "application/json").await.expect("Failed to retrieve links from URL");
}

Content-Type

The following Content-type headers are supported using the content_type parameter:

  • application/json
  • text/csv
  • application/xml
  • application/jsonl
#![allow(unused)]
fn main() {
let url = "https://example.com";

let crawler_params = RequestParams {
    limit: Some(1),
    proxy_enabled: Some(true),
    metadata: Some(false),
    request: Some(RequestType::Http),
    ..Default::default()
};

// Stream JSON lines back to the client
spider.crawl_url(url, Some(crawler_params), true, "application/jsonl", None::<fn(serde_json::Value)>).await.expect("Failed to crawl the URL");
}

Error Handling

The SDK handles errors returned by the Spider Cloud API and raises appropriate exceptions. If an error occurs during a request, it will be propagated to the caller with a descriptive error message. By default request use a Exponential Backoff to retry as needed.