Introduction

Spider-RS is the fastest web crawler and indexer written in Rust ported to Node.js.

  • Concurrent
  • Streaming
  • Decentralization
  • Headless Chrome Rendering
  • HTTP Proxies
  • Cron Jobs
  • Subscriptions
  • Blacklisting and Budgeting Depth
  • Written in Rust for speed, safety, and simplicity

Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the spider project to learn more.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

await website.crawl()

console.log(website.getLinks())

Getting Started

Make sure to have node installed v10 and higher.

Install the package with your favorite package manager.

yarn add @spider-rs/spider-rs
# or
npm install @spider-rs/spider-rs

A simple example

We use the node-addon to port the Rust project over with napi to target node.js.

There are some performance drawbacks from the addon, even still the crawls are lightning fast and efficient.

Usage

The examples below can help get started with spider.

Basic

A basic example.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

await website.crawl()
console.log(website.getLinks())

Events

You can pass a function that could be async as param to crawl and scrape.

import { Website, type NPage } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

const links: NPage[] = []

const onPageEvent = async (err: Error | null, page: NPage) => {
  links.push(page)
}

await website.crawl(onPageEvent)
console.log(website.getLinks())

Selector

The title method allows you to extract the title of the page.

import { Website, pageTitle } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

const links = []

const onPageEvent = async (err, page) => {
  links.push({ title: pageTitle(page), url: page.url })
}

// params in order event, background, and headless chrome
await website.crawl(onPageEvent)

Shortcut

You can use the crawl shortcut method to collect contents quickly without configuration.

import { crawl } from '@spider-rs/spider-rs'

const { links, pages } = await crawl('https://choosealicense.com')

console.log([links, pages])

Website

The Website class is the foundations to the spider.

Builder pattern

We use the builder pattern to configure the website for crawling.

*note: Replace https://choosealicense.com from the examples below with your website target URL.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

Return links found on the page resource.

const website = new Website('https://choosealicense.com')
  .with_return_page_links(true)
  .build()

Custom Headers

Add custom HTTP headers to use when crawling/scraping.

const website = new Website('https://choosealicense.com')
  .withHeaders({
    authorization: 'somerandomjwt',
  })
  .build()

Blacklist

Prevent crawling a set path, url, or pattern with Regex.

const website = new Website('https://choosealicense.com')
  .withBlacklistUrl(['/blog', new RegExp('/books').source, '/resume'])
  .build()

Whitelist

Only crawl set paths, url, or pattern with Regex.

const website = new Website('https://choosealicense.com')
  .withWhitelistUrl(['/blog', new RegExp('/books').source, '/resume'])
  .build()

Crons

Setup a cron job that can run at any time in the background using cron-syntax.

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *').build()

View the cron section for details how to use the cron.

Budget

Add a crawl budget that prevents crawling x amount of pages.

const website = new Website('https://choosealicense.com')
  .withBudget({
    '*': 1,
  })
  .build()

Subdomains

Include subdomains in request.

const website = new Website('https://choosealicense.com').withSubdomains(true).build()

TLD

Include TLDs in request.

const website = new Website('https://choosealicense.com').withTlds(true).build()

External Domains

Add external domains to include with the website.

const website = new Website('https://choosealicense.com').withExternalDomains(['https://www.myotherdomain.com']).build()

Proxy

Use a proxy to crawl a website.

const website = new Website('https://choosealicense.com').withProxies(['https://www.myproxy.com']).build()

Delays

Add delays between pages. Defaults to none.

const website = new Website('https://choosealicense.com').withDelays(200).build()

Wait_For_Delay

Wait for a delay on the page. Should only be used for testing. This method does nothing if the chrome feature is not enabled. The first param is the seconds of delay and the second is the nano seconds to delay by.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_delay(2, 500).build()

Wait_For_Selector

Wait for a a selector on the page with a max timeout. This method does nothing if the chrome feature is not enabled.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_selector('.news-feed', 2, 500).build()

Wait_For_Idle_Network

Wait for idle network request. This method does nothing if the chrome feature is not enabled.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_idle_network(2, 500).build()

User-Agent

Use a custom User-Agent.

const website = new Website('https://choosealicense.com').withUserAgent('mybot/v1').build()

Chrome Remote Connection

Add a chrome remote connection url. This can be a json endpoint or ws direct connection.

const website = new Website('https://choosealicense.com').with_chrome_connection("http://localhost:9222/json/version").build()

OpenAI

Use OpenAI to generate dynamic scripts to use with headless. Make sure to set the OPENAI_API_KEY env variable.

const website = new Website('https://google.com')
  .withOpenAI({
    model: 'gpt-3.5-turbo',
    prompt: 'Search for movies',
    maxTokens: 300,
  })
  .build()

// make sure to crawl or scrape with the headless param set to true.

Screenshots

Take a screenshot of the pages on crawl when using headless chrome.

const website = new Website('https://google.com')
  .withScreenshot({
    params: {
      cdp_params: null,
      full_page: true,
      omit_background: false,
    },
    bytes: false,
    save: true,
    output_dir: null,
  })
  .build()

// make sure to crawl or scrape with the headless param set to true.

Request Timeout

Add a request timeout per page in miliseconds. Example shows 30 seconds.

const website = new Website('https://choosealicense.com').withRequestTimeout(30000).build()

Respect Robots

Respect the robots.txt file.

const website = new Website('https://choosealicense.com').withRespectRobotsTxt(true).build()

Http2 Prior Knowledge

Use http2 to connect if you know the website servers supports this.

const website = new Website('https://choosealicense.com').withHttp2PriorKnowledge(true).build()

Chrome Network Interception

Enable Network interception when using chrome to speed up request.

const website = new Website('https://choosealicense.com').withChromeIntercept(true, true).build()

Redirect Limit

Set the redirect limit for request.

const website = new Website('https://choosealicense.com').withRedirectLimit(2).build()

Depth Limit

Set the depth limit for the amount of forward pages.

const website = new Website('https://choosealicense.com').withDepth(3).build()

Cache

Enable HTTP caching, this useful when using the spider on a server.

const website = new Website('https://choosealicense.com').withCaching(true).build()

Redirect Policy

Set the redirect policy for request, either strict or loose(default). Strict only allows redirects that match the domain.

const website = new Website('https://choosealicense.com').withRedirectPolicy(true).build()

Chaining

You can chain all of the configs together for simple configuration.

const website = new Website('https://choosealicense.com')
  .withSubdomains(true)
  .withTlds(true)
  .withUserAgent('mybot/v1')
  .withRespectRobotsTxt(true)
  .build()

Raw Content

Set the second param of the website constructor to true to return content without UTF-8. This will return rawContent and leave content when using subscriptions or the Page Object.

const rawContent = true
const website = new Website('https://choosealicense.com', rawContent)
await website.scrape()

Clearing Crawl Data

Use website.clear to remove the links visited and page data or website.drainLinks to drain the links visited.

const website = new Website('https://choosealicense.com')
await website.crawl()
// links found ["https://...", "..."]
console.log(website.getLinks())
website.clear()
// links will be empty
console.log(website.getLinks())

Storing and Exporting Data

Collecting data to store can be done with website.pushData() and website.exportJsonlData().

const website = new Website('https://choosealicense.com')

const onPageEvent = (_err, page) => {
  website.pushData(page)
}

await website.crawl(onPageEvent)

// uncomment to read the data.
// console.log(website.readData());

// we only have one export method atm. Optional file path. All data by default goes to storage
await website.exportJsonlData('./storage/test.jsonl')

Stop crawl

To stop a crawl you can use website.stopCrawl(id), pass in the crawl id to stop a run or leave empty for all crawls to stop.

const website = new Website('https://choosealicense.com')

const onPageEvent = (_err, page) => {
  console.log(page)
  // stop the concurrent crawl when 8 pages are found.
  if (website.size >= 8) {
    website.stop()
  }
}

await website.crawl(onPageEvent)

Page

A single page on a website, useful if you need just the root url.

New Page

Get a new page with content.

The first param is the url, followed by if subdomains should be included, and last to include TLD's in links.

Calling page.fetch is needed to get the content.

import { Page } from '@spider-rs/spider-rs'

const page = new Page('https://choosealicense.com', false, false)
await page.fetch()

get all the links related to a page.

const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const links = await page.getLinks()
console.log(links)

Page Html

Get the markup for the page or HTML.

const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const html = page.getHtml()
console.log(html)

Page Bytes

Get the raw bytes of a page to store the files in a database.

const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const bytes = page.getBytes()
console.log(bytes)

Environment

Env variables to adjust the project.

CHROME_URL

You can set the chrome URL to connect remotely.

CHROME_URL=http://localhost:9222

Crawl

Crawl a website concurrently.

import { Website } from '@spider-rs/spider-rs'

// pass in the website url
const website = new Website('https://rsseau.fr')

await website.crawl()

// [ "https://rsseau.fr/blog", ...]
console.log(website.getLinks())

Async Event

You can pass in a async function as the first param to the crawl function for realtime updates streamed.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (err, value) => {
  console.log(value)
}

await website.crawl(onPageEvent)

Background

You can run the request in the background and receive events with the second param set to true.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (err, value) => {
  console.log(value)
}

await website.crawl(onPageEvent, true)
// this will run instantly as the crawl is in the background

Subscriptions

You can setup many subscriptions to run events when a crawl happens.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (err, value) => {
  console.log(value)
}

const subscriptionID = website.subscribe(onPageEvent)

await website.crawl()

website.unsubscribe(subscriptionID)
// this will run instantly as the crawl is in the background

Headless Chrome

Headless Chrome rendering can be done by setting the third param in crawl or scrape to true. It will attempt to connect to chrome running remotely if the CHROME_URL env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL will drastically speed up runs.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (err, value) => {
  console.log(value)
}

// all params are optional. The third param determines headless rendering.
await website.crawl(onPageEvent, false, true)
// make sure to call unsubscribe when finished or else the instance is kept alive when events are setup.
website.unsubscribe()

Scrape

Scape a website and collect the resource data.

import { Website } from '@spider-rs/spider-rs'

// pass in the website url
const website = new Website('https://rsseau.fr')

await website.scrape()

// [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]
console.log(website.getPages())

Headless Chrome

Headless Chrome rendering can be done by setting the third param in crawl or scrape to true. It will attempt to connect to chrome running remotely if the CHROME_URL env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL will drastically speed up runs.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (err, value) => {
  console.log(value)
}

// all params are optional. The third param determines headless rendering.
await website.scrape(onPageEvent, false, true)

Cron Jobs

Use a cron job that can run any time of day to gather website data.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *').build()

// get the pages of the website when the cron runs streamed.
const onPageEvent = (err, value) => {
  console.log(value)
}

const handle = await website.runCron(onPageEvent)

Storing Data

Storing data can be done to collect the raw content for a website.

This allows you to upload and download the content without UTF-8 conversion. The property only appears when setting the second param of the Website class constructor to true.

const rawContent = true

const links: Buffer[] = []

const onPageEvent = (_err: Error | null, page: NPage) => {
  if (page.rawContent) {
    // we can download or store the content now to disk.
    links.push(page.rawContent)
  }
}

await website.crawl(onPageEvent)

const website = new Website('https://choosealicense.com', rawContent)

Benchmarks

Test url: https://espn.com Mac M1 64gb 10-core CPU

librariespagesspeed
spider(rust): crawl150,3871m
spider(nodejs): crawl150,387153s
spider(python): crawl150,387186s
scrapy(python): crawl49,5981h
crawlee(nodejs): crawl18,77930m

View the latest runs on github.

-----------------------
Linux
2-core CPU
7 GB of RAM memory
-----------------------

Test url: https://choosealicense.com (small) 32 pages

librariesspeed
spider-rs: crawl 10 samples76ms
crawlee: crawl 10 samples1s

Test url: https://rsseau.fr (medium) 211 pages

librariesspeed
spider-rs: crawl 10 samples0.5s
crawlee: crawl 10 samples72s
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
-----------------------

Test url: https://choosealicense.com (small) 32 pages

librariesspeed
spider-rs: crawl 10 samples286ms
crawlee: crawl 10 samples1.7s

Test url: https://rsseau.fr (medium) 211 pages

librariesspeed
spider-rs: crawl 10 samples2.5s
crawlee: crawl 10 samples75s

The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.