Introduction
Spider-RS
is the fastest web crawler and indexer written in Rust ported to Node.js.
- Concurrent
- Streaming
- Decentralization
- Headless Chrome Rendering
- HTTP Proxies
- Cron Jobs
- Subscriptions
- Blacklisting and Budgeting Depth
- Written in Rust for speed, safety, and simplicity
Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the spider project to learn more.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com')
await website.crawl()
console.log(website.getLinks())
Getting Started
Make sure to have node installed v10 and higher.
Install the package with your favorite package manager.
yarn add @spider-rs/spider-rs
# or
npm install @spider-rs/spider-rs
A simple example
We use the node-addon to port the Rust project over with napi to target node.js.
There are some performance drawbacks from the addon, even still the crawls are lightning fast and efficient.
Usage
The examples below can help get started with spider.
Basic
A basic example.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com')
await website.crawl()
console.log(website.getLinks())
Events
You can pass a function that could be async as param to crawl
and scrape
.
import { Website, type NPage } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com')
const links: NPage[] = []
const onPageEvent = async (err: Error | null, page: NPage) => {
links.push(page)
}
await website.crawl(onPageEvent)
console.log(website.getLinks())
Selector
The title
method allows you to extract the title of the page.
import { Website, pageTitle } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com')
const links = []
const onPageEvent = async (err, page) => {
links.push({ title: pageTitle(page), url: page.url })
}
// params in order event, background, and headless chrome
await website.crawl(onPageEvent)
Shortcut
You can use the crawl
shortcut method to collect contents quickly without configuration.
import { crawl } from '@spider-rs/spider-rs'
const { links, pages } = await crawl('https://choosealicense.com')
console.log([links, pages])
Website
The Website class is the foundations to the spider.
Builder pattern
We use the builder pattern to configure the website for crawling.
*note: Replace https://choosealicense.com
from the examples below with your website target URL.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com')
Return Page Links
Return links found on the page resource.
const website = new Website('https://choosealicense.com')
.with_return_page_links(true)
.build()
Custom Headers
Add custom HTTP headers to use when crawling/scraping.
const website = new Website('https://choosealicense.com')
.withHeaders({
authorization: 'somerandomjwt',
})
.build()
Blacklist
Prevent crawling a set path, url, or pattern with Regex.
const website = new Website('https://choosealicense.com')
.withBlacklistUrl(['/blog', new RegExp('/books').source, '/resume'])
.build()
Whitelist
Only crawl set paths, url, or pattern with Regex.
const website = new Website('https://choosealicense.com')
.withWhitelistUrl(['/blog', new RegExp('/books').source, '/resume'])
.build()
Crons
Setup a cron job that can run at any time in the background using cron-syntax.
const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *').build()
View the cron section for details how to use the cron.
Budget
Add a crawl budget that prevents crawling x
amount of pages.
const website = new Website('https://choosealicense.com')
.withBudget({
'*': 1,
})
.build()
Subdomains
Include subdomains in request.
const website = new Website('https://choosealicense.com').withSubdomains(true).build()
TLD
Include TLDs in request.
const website = new Website('https://choosealicense.com').withTlds(true).build()
External Domains
Add external domains to include with the website.
const website = new Website('https://choosealicense.com').withExternalDomains(['https://www.myotherdomain.com']).build()
Proxy
Use a proxy to crawl a website.
const website = new Website('https://choosealicense.com').withProxies(['https://www.myproxy.com']).build()
Delays
Add delays between pages. Defaults to none.
const website = new Website('https://choosealicense.com').withDelays(200).build()
Wait_For_Delay
Wait for a delay on the page. Should only be used for testing. This method does nothing if the chrome
feature is not enabled.
The first param is the seconds of delay and the second is the nano seconds to delay by.
// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_delay(2, 500).build()
Wait_For_Selector
Wait for a a selector on the page with a max timeout. This method does nothing if the chrome
feature is not enabled.
// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_selector('.news-feed', 2, 500).build()
Wait_For_Idle_Network
Wait for idle network request. This method does nothing if the chrome
feature is not enabled.
// a delay of 2 seconds and 500 nanos
const website = new Website('https://choosealicense.com').with_wait_for_idle_network(2, 500).build()
User-Agent
Use a custom User-Agent.
const website = new Website('https://choosealicense.com').withUserAgent('mybot/v1').build()
Chrome Remote Connection
Add a chrome remote connection url. This can be a json endpoint or ws direct connection.
const website = new Website('https://choosealicense.com').with_chrome_connection("http://localhost:9222/json/version").build()
OpenAI
Use OpenAI to generate dynamic scripts to use with headless. Make sure to set the OPENAI_API_KEY
env variable.
const website = new Website('https://google.com')
.withOpenAI({
model: 'gpt-3.5-turbo',
prompt: 'Search for movies',
maxTokens: 300,
})
.build()
// make sure to crawl or scrape with the headless param set to true.
Screenshots
Take a screenshot of the pages on crawl when using headless chrome.
const website = new Website('https://google.com')
.withScreenshot({
params: {
cdp_params: null,
full_page: true,
omit_background: false,
},
bytes: false,
save: true,
output_dir: null,
})
.build()
// make sure to crawl or scrape with the headless param set to true.
Request Timeout
Add a request timeout per page in miliseconds. Example shows 30 seconds.
const website = new Website('https://choosealicense.com').withRequestTimeout(30000).build()
Respect Robots
Respect the robots.txt file.
const website = new Website('https://choosealicense.com').withRespectRobotsTxt(true).build()
Http2 Prior Knowledge
Use http2 to connect if you know the website servers supports this.
const website = new Website('https://choosealicense.com').withHttp2PriorKnowledge(true).build()
Chrome Network Interception
Enable Network interception when using chrome to speed up request.
const website = new Website('https://choosealicense.com').withChromeIntercept(true, true).build()
Redirect Limit
Set the redirect limit for request.
const website = new Website('https://choosealicense.com').withRedirectLimit(2).build()
Depth Limit
Set the depth limit for the amount of forward pages.
const website = new Website('https://choosealicense.com').withDepth(3).build()
Cache
Enable HTTP caching, this useful when using the spider on a server.
const website = new Website('https://choosealicense.com').withCaching(true).build()
Redirect Policy
Set the redirect policy for request, either strict or loose(default). Strict only allows redirects that match the domain.
const website = new Website('https://choosealicense.com').withRedirectPolicy(true).build()
Chaining
You can chain all of the configs together for simple configuration.
const website = new Website('https://choosealicense.com')
.withSubdomains(true)
.withTlds(true)
.withUserAgent('mybot/v1')
.withRespectRobotsTxt(true)
.build()
Raw Content
Set the second param of the website constructor to true
to return content without UTF-8.
This will return rawContent
and leave content
when using subscriptions or the Page Object.
const rawContent = true
const website = new Website('https://choosealicense.com', rawContent)
await website.scrape()
Clearing Crawl Data
Use website.clear
to remove the links visited and page data or website.drainLinks
to drain the links visited.
const website = new Website('https://choosealicense.com')
await website.crawl()
// links found ["https://...", "..."]
console.log(website.getLinks())
website.clear()
// links will be empty
console.log(website.getLinks())
Storing and Exporting Data
Collecting data to store can be done with website.pushData()
and website.exportJsonlData()
.
const website = new Website('https://choosealicense.com')
const onPageEvent = (_err, page) => {
website.pushData(page)
}
await website.crawl(onPageEvent)
// uncomment to read the data.
// console.log(website.readData());
// we only have one export method atm. Optional file path. All data by default goes to storage
await website.exportJsonlData('./storage/test.jsonl')
Stop crawl
To stop a crawl you can use website.stopCrawl(id)
, pass in the crawl id to stop a run or leave empty for all crawls to stop.
const website = new Website('https://choosealicense.com')
const onPageEvent = (_err, page) => {
console.log(page)
// stop the concurrent crawl when 8 pages are found.
if (website.size >= 8) {
website.stop()
}
}
await website.crawl(onPageEvent)
Page
A single page on a website, useful if you need just the root url.
New Page
Get a new page with content.
The first param is the url, followed by if subdomains should be included, and last to include TLD's in links.
Calling page.fetch
is needed to get the content.
import { Page } from '@spider-rs/spider-rs'
const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
Page Links
get all the links related to a page.
const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const links = await page.getLinks()
console.log(links)
Page Html
Get the markup for the page or HTML.
const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const html = page.getHtml()
console.log(html)
Page Bytes
Get the raw bytes of a page to store the files in a database.
const page = new Page('https://choosealicense.com', false, false)
await page.fetch()
const bytes = page.getBytes()
console.log(bytes)
Environment
Env variables to adjust the project.
CHROME_URL
You can set the chrome URL to connect remotely.
CHROME_URL=http://localhost:9222
Crawl
Crawl a website concurrently.
import { Website } from '@spider-rs/spider-rs'
// pass in the website url
const website = new Website('https://rsseau.fr')
await website.crawl()
// [ "https://rsseau.fr/blog", ...]
console.log(website.getLinks())
Async Event
You can pass in a async function as the first param to the crawl function for realtime updates streamed.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
await website.crawl(onPageEvent)
Background
You can run the request in the background and receive events with the second param set to true
.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
await website.crawl(onPageEvent, true)
// this will run instantly as the crawl is in the background
Subscriptions
You can setup many subscriptions to run events when a crawl happens.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
const subscriptionID = website.subscribe(onPageEvent)
await website.crawl()
website.unsubscribe(subscriptionID)
// this will run instantly as the crawl is in the background
Headless Chrome
Headless Chrome rendering can be done by setting the third param in crawl
or scrape
to true
.
It will attempt to connect to chrome running remotely if the CHROME_URL
env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL
will
drastically speed up runs.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
// all params are optional. The third param determines headless rendering.
await website.crawl(onPageEvent, false, true)
// make sure to call unsubscribe when finished or else the instance is kept alive when events are setup.
website.unsubscribe()
Scrape
Scape a website and collect the resource data.
import { Website } from '@spider-rs/spider-rs'
// pass in the website url
const website = new Website('https://rsseau.fr')
await website.scrape()
// [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]
console.log(website.getPages())
Headless Chrome
Headless Chrome rendering can be done by setting the third param in crawl
or scrape
to true
.
It will attempt to connect to chrome running remotely if the CHROME_URL
env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL
will
drastically speed up runs.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
// all params are optional. The third param determines headless rendering.
await website.scrape(onPageEvent, false, true)
Cron Jobs
Use a cron job that can run any time of day to gather website data.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *').build()
// get the pages of the website when the cron runs streamed.
const onPageEvent = (err, value) => {
console.log(value)
}
const handle = await website.runCron(onPageEvent)
Storing Data
Storing data can be done to collect the raw content for a website.
This allows you to upload and download the content without UTF-8 conversion. The property only appears when
setting the second param of the Website
class constructor to true.
const rawContent = true
const links: Buffer[] = []
const onPageEvent = (_err: Error | null, page: NPage) => {
if (page.rawContent) {
// we can download or store the content now to disk.
links.push(page.rawContent)
}
}
await website.crawl(onPageEvent)
const website = new Website('https://choosealicense.com', rawContent)
Benchmarks
Test url: https://espn.com
Mac M1 64gb 10-core CPU
libraries | pages | speed |
---|---|---|
spider(rust): crawl | 150,387 | 1m |
spider(nodejs): crawl | 150,387 | 153s |
spider(python): crawl | 150,387 | 186s |
scrapy(python): crawl | 49,598 | 1h |
crawlee(nodejs): crawl | 18,779 | 30m |
View the latest runs on github.
-----------------------
Linux
2-core CPU
7 GB of RAM memory
-----------------------
Test url: https://choosealicense.com
(small)
32 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 76ms |
crawlee: crawl 10 samples | 1s |
Test url: https://rsseau.fr
(medium)
211 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 0.5s |
crawlee: crawl 10 samples | 72s |
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
-----------------------
Test url: https://choosealicense.com
(small)
32 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 286ms |
crawlee: crawl 10 samples | 1.7s |
Test url: https://rsseau.fr
(medium)
211 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 2.5s |
crawlee: crawl 10 samples | 75s |
The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.