Website

The Website class is the foundations to the spider.

Builder pattern

We use the builder pattern to configure the website for crawling.

*note: Replace https://choosealicense.com from the examples below with your website target URL.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com')

Return Page Links

Return links found on the page resource.

const website = new Website('https://choosealicense.com')
  .with_return_page_links(true)
  .build()

Custom Headers

Add custom HTTP headers to use when crawling/scraping.

const website = new Website('https://choosealicense.com')
  .withHeaders({
    authorization: 'somerandomjwt',
  })
  .build()

Blacklist

Prevent crawling a set path, url, or pattern with Regex.

const website = new Website('https://choosealicense.com')
  .withBlacklistUrl(['/blog', new RegExp('/books').source, '/resume'])
  .build()

Whitelist

Only crawl set paths, url, or pattern with Regex.

const website = new Website('https://choosealicense.com')
  .withWhitelistUrl(['/blog', new RegExp('/books').source, '/resume'])
  .build()

Crons

Setup a cron job that can run at any time in the background using cron-syntax.

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *').build()

View the cron section for details how to use the cron.

Budget

Add a crawl budget that prevents crawling x amount of pages.

const website = new Website('https://choosealicense.com')
  .withBudget({
    '*': 1,
  })
  .build()

Subdomains

Include subdomains in request.

const website = new Website('https://choosealicense.com').withSubdomains(true).build()

TLD

Include TLDs in request.

const website = new Website('https://choosealicense.com').withTlds(true).build()

External Domains

Add external domains to include with the website.

const website = new Website('https://choosealicense.com').withExternalDomains(['https://www.myotherdomain.com']).build()

Proxy

Use a proxy to crawl a website.

const website = new Website('https://choosealicense.com').withProxies(['https://www.myproxy.com']).build()

Delays

Add delays between pages. Defaults to none.

const website = new Website('https://choosealicense.com').withDelays(200).build()

Wait for a delay on the page. Should only be used for testing. This method does nothing if the chrome feature is not enabled. The first param is the seconds of delay and the second is the nano seconds to delay by.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_delay(2, 500).build()

Wait_For_Selector

Wait for a a selector on the page with a max timeout. This method does nothing if the chrome feature is not enabled.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_selector('.news-feed', 2, 500).build()

Wait_For_Idle_Network

Wait for idle network request. This method does nothing if the chrome feature is not enabled.

// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_idle_network(2, 500).build()

User-Agent

Use a custom User-Agent.

const website = new Website('https://choosealicense.com').withUserAgent('mybot/v1').build()

Chrome Remote Connection

Add a chrome remote connection url. This can be a json endpoint or ws direct connection.

const website = new Website('https://choosealicense.com').with_chrome_connection("http://localhost:9222/json/version").build()

OpenAI

Use OpenAI to generate dynamic scripts to use with headless. Make sure to set the OPENAI_API_KEY env variable.

const website = new Website('https://google.com')
  .withOpenAI({
    model: 'gpt-3.5-turbo',
    prompt: 'Search for movies',
    maxTokens: 300,
  })
  .build()

// make sure to crawl or scrape with the headless param set to true.

Screenshots

Take a screenshot of the pages on crawl when using headless chrome.

const website = new Website('https://google.com')
  .withScreenshot({
    params: {
      cdp_params: null,
      full_page: true,
      omit_background: false,
    },
    bytes: false,
    save: true,
    output_dir: null,
  })
  .build()

// make sure to crawl or scrape with the headless param set to true.

Request Timeout

Add a request timeout per page in miliseconds. Example shows 30 seconds.

const website = new Website('https://choosealicense.com').withRequestTimeout(30000).build()

Respect Robots

Respect the robots.txt file.

const website = new Website('https://choosealicense.com').withRespectRobotsTxt(true).build()

Http2 Prior Knowledge

Use http2 to connect if you know the website servers supports this.

const website = new Website('https://choosealicense.com').withHttp2PriorKnowledge(true).build()

Chrome Network Interception

Enable Network interception when using chrome to speed up request.

const website = new Website('https://choosealicense.com').withChromeIntercept(true, true).build()

Redirect Limit

Set the redirect limit for request.

const website = new Website('https://choosealicense.com').withRedirectLimit(2).build()

Depth Limit

Set the depth limit for the amount of forward pages.

const website = new Website('https://choosealicense.com').withDepth(3).build()

Cache

Enable HTTP caching, this useful when using the spider on a server.

const website = new Website('https://choosealicense.com').withCaching(true).build()

Redirect Policy

Set the redirect policy for request, either strict or loose(default). Strict only allows redirects that match the domain.

const website = new Website('https://choosealicense.com').withRedirectPolicy(true).build()

Chaining

You can chain all of the configs together for simple configuration.

const website = new Website('https://choosealicense.com')
  .withSubdomains(true)
  .withTlds(true)
  .withUserAgent('mybot/v1')
  .withRespectRobotsTxt(true)
  .build()

Raw Content

Set the second param of the website constructor to true to return content without UTF-8. This will return rawContent and leave content when using subscriptions or the Page Object.

const rawContent = true
const website = new Website('https://choosealicense.com', rawContent)
await website.scrape()

Clearing Crawl Data

Use website.clear to remove the links visited and page data or website.drainLinks to drain the links visited.

const website = new Website('https://choosealicense.com')
await website.crawl()
// links found ["https://...", "..."]
console.log(website.getLinks())
website.clear()
// links will be empty
console.log(website.getLinks())

Storing and Exporting Data

Collecting data to store can be done with website.pushData() and website.exportJsonlData().

const website = new Website('https://choosealicense.com')

const onPageEvent = (_err, page) => {
  website.pushData(page)
}

await website.crawl(onPageEvent)

// uncomment to read the data.
// console.log(website.readData());

// we only have one export method atm. Optional file path. All data by default goes to storage
await website.exportJsonlData('./storage/test.jsonl')

Stop crawl

To stop a crawl you can use website.stopCrawl(id), pass in the crawl id to stop a run or leave empty for all crawls to stop.

const website = new Website('https://choosealicense.com')

const onPageEvent = (_err, page) => {
  console.log(page)
  // stop the concurrent crawl when 8 pages are found.
  if (website.size >= 8) {
    website.stop()
  }
}

await website.crawl(onPageEvent)