Crawl

Crawl a website concurrently.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://rsseau.fr")
    website.crawl()
    print(website.get_links())

asyncio.run(main())

Async Event

You can pass in a async function as the first param to the crawl function for realtime updates streamed.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription())

asyncio.run(main())

Background

You can run the request in the background and receive events with the second param set to true.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription(), True)
    # this will run instantly as the crawl is in the background

asyncio.run(main())

Subscriptions

You can setup many subscriptions to run events when a crawl happens.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl()
    subscription_id = website.subscribe(Subscription());
    website.crawl()
    website.unsubscribe(subscription_id);

asyncio.run(main())

Headless Chrome

Headless Chrome rendering can be done by setting the third param in crawl or scrape to true. It will attempt to connect to chrome running remotely if the CHROME_URL env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL will drastically speed up runs.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription(), false, True)

asyncio.run(main())