Crawl
Crawl a website concurrently.
import { Website } from '@spider-rs/spider-rs'
// pass in the website url
const website = new Website('https://rsseau.fr')
await website.crawl()
// [ "https://rsseau.fr/blog", ...]
console.log(website.getLinks())
Async Event
You can pass in a async function as the first param to the crawl function for realtime updates streamed.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
await website.crawl(onPageEvent)
Background
You can run the request in the background and receive events with the second param set to true
.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
await website.crawl(onPageEvent, true)
// this will run instantly as the crawl is in the background
Subscriptions
You can setup many subscriptions to run events when a crawl happens.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
const subscriptionID = website.subscribe(onPageEvent)
await website.crawl()
website.unsubscribe(subscriptionID)
// this will run instantly as the crawl is in the background
Headless Chrome
Headless Chrome rendering can be done by setting the third param in crawl
or scrape
to true
.
It will attempt to connect to chrome running remotely if the CHROME_URL
env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL
will
drastically speed up runs.
import { Website } from '@spider-rs/spider-rs'
const website = new Website('https://rsseau.fr')
const onPageEvent = (err, value) => {
console.log(value)
}
// all params are optional. The third param determines headless rendering.
await website.crawl(onPageEvent, false, true)
// make sure to call unsubscribe when finished or else the instance is kept alive when events are setup.
website.unsubscribe()