Introduction

Spider-Py is the fastest web crawler and indexer written in Rust ported to Python.

Concurrent
Streaming
Decentralization
Headless Chrome Rendering
HTTP Proxies
Cron Jobs
Subscriptions
Blacklisting and Budgeting Depth
Written in Rust for speed, safety, and simplicity

Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the spider project to learn more.

Test url: https://espn.com

`libraries`	`pages`	`speed`
`spider-rs(python): crawl`	`150,387`	`186s`
`scrapy(python): crawl`	`49,598`	`1h`

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Getting Started

Make sure to have python installed.

pip install spider_rs

Simple Example

We use the pyo3 to port the Rust project to target Python.

There are some performance drawbacks from the addon, even still the crawls are lightning fast and efficient.

Usage

The examples below can help get started with spider.

Basic

import asyncio

from spider_rs import Website

async def main():
    website = Website("https://jeffmendez.com")
    website.crawl()
    print(website.get_links())

asyncio.run(main())

Events

You can pass an object that could be async as param to crawl and scrape.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription())

asyncio.run(main())

Selector

The title method allows you to extract the title of the page.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - title: " + str(page.title()))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription())

Website

The Website class is the foundations to the spider.

Builder pattern

We use the builder pattern to configure the website for crawling.

*note: Replace https://choosealicense.com from the examples below with your website target URL.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com")
    website.crawl()
    print(website.get_links())

asyncio.run(main())

Return Page Links

Return links found on the page resource.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_return_page_links(True)

asyncio.run(main())

Custom Headers

Add custom HTTP headers to use when crawling/scraping.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_headers({ "authorization": "mytoken"})

asyncio.run(main())

Blacklist

Prevent crawling a set path, url, or pattern with Regex.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_blacklist_url(["/blog", "/resume"])

asyncio.run(main())

Whitelist

Only crawl set paths, url, or pattern with Regex.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_whitelist_url(["/licenses"])

asyncio.run(main())

Crons

Setup a cron job that can run at any time in the background using cron-syntax.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_cron("1/5 * * * * *")

asyncio.run(main())

View the cron section for details how to use the cron.

Budget

Add a crawl budget that prevents crawling x amount of pages.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_budget({
    "*": 1,
  })

asyncio.run(main())

Subdomains

Include subdomains in request.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_subdomains(True)

asyncio.run(main())

TLD

Include TLDs in request.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_tld(True)

asyncio.run(main())

Chrome Remote Connection

Add a chrome remote connection url. This can be a json endpoint or ws direct connection.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_chrome_connection("http://localhost:9222/json/version")

asyncio.run(main())

External Domains

Add external domains to include with the website.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_external_domains(["https://www.myotherdomain.com"])

asyncio.run(main())

Proxy

Use a proxy to crawl a website.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_proxies(["https://www.myproxy.com"])

asyncio.run(main())

Depth Limit

Set the depth limit for the amount of forward pages.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_depth(3)

asyncio.run(main())

Cache

Enable HTTP caching, this useful when using the spider on a server.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_caching(True)

asyncio.run(main())

Delays

Add delays between pages. Defaults to none.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_delays(200)

asyncio.run(main())

User-Agent

Use a custom User-Agent.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_user_agent("mybot/v1")

asyncio.run(main())

Request Timeout

Add a request timeout per page in miliseconds. Example shows 30 seconds.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_request_timeout(30000)

asyncio.run(main())

You can wait for the Network to become idle when using chrome. This helps load all the data from client side scripts. The first param is whether to enable or not and the second is the duration max timeout in milliseconds.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_wait_for_idle_network(True, 12000)

asyncio.run(main())

Respect Robots

Respect the robots.txt file.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_respect_robots_txt(True)

asyncio.run(main())

Collect Full Resources

Collect all resources found not just valid web pages.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_full_resources(True)

asyncio.run(main())

OpenAI

Use OpenAI to generate dynamic scripts to use with headless. Make sure to set the OPENAI_API_KEY env variable.

import asyncio
from spider_rs import Website

async def main():
    website = (
        Website("https://google.com")
        .with_openai({
            "model": "gpt-3.5-turbo",
            "prompt": "Search for movies",
            "maxTokens": 300
        })
    )

asyncio.run(main())

Screenshots

Take a screenshot of the pages on crawl when using headless chrome.

import asyncio
from spider_rs import Website

async def main():
    website = (
        Website("https://choosealicense.com", False)
        .with_screenshot({
            "params": {
                "cdp_params": None,
                "full_page": True,
                "omit_background": False
            },
            "bytes": False,
            "save": True,
            "output_dir": None
        })
    )

asyncio.run(main())

Http2 Prior Knowledge

Use http2 to connect if you know the website servers supports this.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_http2_prior_knowledge(True)

asyncio.run(main())

Preserve Host

Preserve the HOST HTTP header.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_preserve_host_header(True)

asyncio.run(main())

Chaining

You can chain all of the configs together for simple configuration.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com").with_subdomains(true).with_tlds(true).with_user_agent("mybot/v1").with_respect_robots_txt(true)

asyncio.run(main())

Raw Content

Set the second param of the website constructor to true to return content without UTF-8. This will return rawContent and leave content when using subscriptions or the Page Object.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com", True)
    website.scrape()

asyncio.run(main())

Clearing Crawl Data

Use website.clear to remove the links visited and page data or website.drain_links to drain the links visited.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com")
    website.crawl()
    print(website.getLinks())
    website.clear()
    print(website.getLinks())

asyncio.run(main())

Stop crawl

To stop a crawl you can use website.stopCrawl(id), pass in the crawl id to stop a run or leave empty for all crawls to stop.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription())
    # sleep for 2s and stop etc
    website.stop()

asyncio.run(main())

Page

A single page on a website, useful if you need just the root url.

New Page

Get a new page with content.

The first param is the url, followed by if subdomains should be included, and last to include TLD's in links.

Calling page.fetch is needed to get the content.

import asyncio
from spider_rs import Page

async def main():
    page = Page("https://choosealicense.com")
    page.fetch()

asyncio.run(main())

Page Links

get all the links related to a page.

import asyncio
from spider_rs import Page

async def main():
    page = Page("https://choosealicense.com")
    page.fetch()
    links = page.get_links()
    print(links)
asyncio.run(main())

Page Html

Get the markup for the page or HTML.

import asyncio
from spider_rs import Page

async def main():
    page = Page("https://choosealicense.com")
    page.fetch()
    links = page.get_html()
    print(links)

asyncio.run(main())

Page Bytes

Get the raw bytes of a page to store the files in a database.

import asyncio
from spider_rs import Page

async def main():
    page = Page("https://choosealicense.com")
    page.fetch()
    links = page.get_bytes()
    print(links)

asyncio.run(main())

Environment

Env variables to adjust the project.

CHROME_URL

You can set the chrome URL to connect remotely.

CHROME_URL=http://localhost:9222

Crawl

Crawl a website concurrently.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://rsseau.fr")
    website.crawl()
    print(website.get_links())

asyncio.run(main())

Async Event

You can pass in a async function as the first param to the crawl function for realtime updates streamed.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription())

asyncio.run(main())

Background

You can run the request in the background and receive events with the second param set to true.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription(), True)
    # this will run instantly as the crawl is in the background

asyncio.run(main())

Subscriptions

You can setup many subscriptions to run events when a crawl happens.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl()
    subscription_id = website.subscribe(Subscription());
    website.crawl()
    website.unsubscribe(subscription_id);

asyncio.run(main())

Headless Chrome

Headless Chrome rendering can be done by setting the third param in crawl or scrape to true. It will attempt to connect to chrome running remotely if the CHROME_URL env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL will drastically speed up runs.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription(), false, True)

asyncio.run(main())

Scrape

Scape a website and collect the resource data.

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com")
    website.scrape()
    print(website.get_pages())
    # [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]

asyncio.run(main())

Headless Chrome

import asyncio
from spider_rs import Website

async def main():
    website = Website("https://choosealicense.com")
    website.scrape(NULL, NULL, True)
    print(website.get_pages())
    # [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]

asyncio.run(main())

Cron Jobs

Use a cron job that can run any time of day to gather website data.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Cron Created...")
    def __call__(self, page):
        print(page.url + " - status: " + str(page.status_code))

async def main():
    website = Website("https://choosealicense.com").with_cron("1/5 * * * * *").build()
    handle = await website.run_cron(Subscription());

asyncio.run(main())

Storing Data

Storing data can be done to collect the raw content for a website.

This allows you to upload and download the content without UTF-8 conversion. The property only appears when setting the second param of the Website class constructor to true.

import asyncio
from spider_rs import Website

class Subscription:
    def __init__(self):
        print("Subscription Created...")
    def __call__(self, page):
        print(page.url + " - bytes: " + str(page.raw_content))
        # do something with page.raw_content

async def main():
    website = Website("https://choosealicense.com")
    website.crawl(Subscription(), True)

Benchmarks

Test url: https://espn.com Mac M1 64gb 10-core CPU

`libraries`	`pages`	`speed`
`spider(rust): crawl`	`150,387`	`1m`
`spider(nodejs): crawl`	`150,387`	`153s`
`spider(python): crawl`	`150,387`	`186s`
`scrapy(python): crawl`	`49,598`	`1h`
`crawlee(nodejs): crawl`	`18,779`	`30m`

View the latest runs on github.

-----------------------
Linux
2-core CPU
7 GB of RAM memory
-----------------------

Test url: https://choosealicense.com (small) 32 pages

`libraries`	`speed`
`spider-rs: crawl 10 samples`	`76ms`
`scrapy: crawl 10 samples`	`2s`

Test url: https://rsseau.fr (medium) 211 pages

`libraries`	`speed`
`spider-rs: crawl 10 samples`	`3s`
`scrapy: crawl 10 samples`	`8s`

----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
-----------------------

Test url: https://choosealicense.com (small) 32 pages

`libraries`	`speed`
`spider-rs: crawl 10 samples`	`286ms`
`scrapy: crawl 10 samples`	`2.5s`

Test url: https://rsseau.fr (medium) 211 pages

`libraries`	`speed`
`spider-rs: crawl 10 samples`	`2.5s`
`scrapy: crawl 10 samples`	`10s`

Test url: https://a11ywatch.com (medium) 648 pages

`libraries`	`speed`
`spider-rs: crawl 10 samples`	`2s`
`scrapy: crawl 10 samples`	`7.7s`

Test url: https://espn.com (large) 150,387 pages

`libraries`	`pages`	`speed`
`spider-rs: crawl 10 samples`	`150,387`	`186s`
`scrapy: crawl 10 samples`	`49,598`	`1h`

Scrapy used too much memory, crawl cancelled after an hour.

Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.

spider-py