Introduction
Spider-Py
is the fastest web crawler and indexer written in Rust ported to Python.
- Concurrent
- Streaming
- Decentralization
- Headless Chrome Rendering
- HTTP Proxies
- Cron Jobs
- Subscriptions
- Blacklisting and Budgeting Depth
- Written in Rust for speed, safety, and simplicity
Spider powers some big tools and helps bring the crawling aspect to almost no downtime with the correct setup, view the spider project to learn more.
Test url: https://espn.com
libraries | pages | speed |
---|---|---|
spider-rs(python): crawl | 150,387 | 186s |
scrapy(python): crawl | 49,598 | 1h |
The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.
Getting Started
Make sure to have python installed.
pip install spider_rs
Simple Example
We use the pyo3 to port the Rust project to target Python.
There are some performance drawbacks from the addon, even still the crawls are lightning fast and efficient.
Usage
The examples below can help get started with spider.
Basic
import asyncio
from spider_rs import Website
async def main():
website = Website("https://jeffmendez.com")
website.crawl()
print(website.get_links())
asyncio.run(main())
Events
You can pass an object that could be async as param to crawl
and scrape
.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription())
asyncio.run(main())
Selector
The title
method allows you to extract the title of the page.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - title: " + str(page.title()))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription())
Website
The Website class is the foundations to the spider.
Builder pattern
We use the builder pattern to configure the website for crawling.
*note: Replace https://choosealicense.com
from the examples below with your website target URL.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com")
website.crawl()
print(website.get_links())
asyncio.run(main())
Return Page Links
Return links found on the page resource.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_return_page_links(True)
asyncio.run(main())
Custom Headers
Add custom HTTP headers to use when crawling/scraping.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_headers({ "authorization": "mytoken"})
asyncio.run(main())
Blacklist
Prevent crawling a set path, url, or pattern with Regex.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_blacklist_url(["/blog", "/resume"])
asyncio.run(main())
Whitelist
Only crawl set paths, url, or pattern with Regex.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_whitelist_url(["/licenses"])
asyncio.run(main())
Crons
Setup a cron job that can run at any time in the background using cron-syntax.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_cron("1/5 * * * * *")
asyncio.run(main())
View the cron section for details how to use the cron.
Budget
Add a crawl budget that prevents crawling x
amount of pages.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_budget({
"*": 1,
})
asyncio.run(main())
Subdomains
Include subdomains in request.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_subdomains(True)
asyncio.run(main())
TLD
Include TLDs in request.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_tld(True)
asyncio.run(main())
Chrome Remote Connection
Add a chrome remote connection url. This can be a json endpoint or ws direct connection.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_chrome_connection("http://localhost:9222/json/version")
asyncio.run(main())
External Domains
Add external domains to include with the website.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_external_domains(["https://www.myotherdomain.com"])
asyncio.run(main())
Proxy
Use a proxy to crawl a website.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_proxies(["https://www.myproxy.com"])
asyncio.run(main())
Depth Limit
Set the depth limit for the amount of forward pages.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_depth(3)
asyncio.run(main())
Cache
Enable HTTP caching, this useful when using the spider on a server.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_caching(True)
asyncio.run(main())
Delays
Add delays between pages. Defaults to none.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_delays(200)
asyncio.run(main())
User-Agent
Use a custom User-Agent.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_user_agent("mybot/v1")
asyncio.run(main())
Request Timeout
Add a request timeout per page in miliseconds. Example shows 30 seconds.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_request_timeout(30000)
asyncio.run(main())
Wait For Idle Network
You can wait for the Network to become idle when using chrome. This helps load all the data from client side scripts. The first param is whether to enable or not and the second is the duration max timeout in milliseconds.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_wait_for_idle_network(True, 12000)
asyncio.run(main())
Respect Robots
Respect the robots.txt file.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_respect_robots_txt(True)
asyncio.run(main())
Collect Full Resources
Collect all resources found not just valid web pages.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_full_resources(True)
asyncio.run(main())
OpenAI
Use OpenAI to generate dynamic scripts to use with headless. Make sure to set the OPENAI_API_KEY
env variable.
import asyncio
from spider_rs import Website
async def main():
website = (
Website("https://google.com")
.with_openai({
"model": "gpt-3.5-turbo",
"prompt": "Search for movies",
"maxTokens": 300
})
)
asyncio.run(main())
Screenshots
Take a screenshot of the pages on crawl when using headless chrome.
import asyncio
from spider_rs import Website
async def main():
website = (
Website("https://choosealicense.com", False)
.with_screenshot({
"params": {
"cdp_params": None,
"full_page": True,
"omit_background": False
},
"bytes": False,
"save": True,
"output_dir": None
})
)
asyncio.run(main())
Http2 Prior Knowledge
Use http2 to connect if you know the website servers supports this.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_http2_prior_knowledge(True)
asyncio.run(main())
Preserve Host
Preserve the HOST HTTP header.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_preserve_host_header(True)
asyncio.run(main())
Chaining
You can chain all of the configs together for simple configuration.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com").with_subdomains(true).with_tlds(true).with_user_agent("mybot/v1").with_respect_robots_txt(true)
asyncio.run(main())
Raw Content
Set the second param of the website constructor to true
to return content without UTF-8.
This will return rawContent
and leave content
when using subscriptions or the Page Object.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com", True)
website.scrape()
asyncio.run(main())
Clearing Crawl Data
Use website.clear
to remove the links visited and page data or website.drain_links
to drain the links visited.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com")
website.crawl()
print(website.getLinks())
website.clear()
print(website.getLinks())
asyncio.run(main())
Stop crawl
To stop a crawl you can use website.stopCrawl(id)
, pass in the crawl id to stop a run or leave empty for all crawls to stop.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription())
# sleep for 2s and stop etc
website.stop()
asyncio.run(main())
Page
A single page on a website, useful if you need just the root url.
New Page
Get a new page with content.
The first param is the url, followed by if subdomains should be included, and last to include TLD's in links.
Calling page.fetch
is needed to get the content.
import asyncio
from spider_rs import Page
async def main():
page = Page("https://choosealicense.com")
page.fetch()
asyncio.run(main())
Page Links
get all the links related to a page.
import asyncio
from spider_rs import Page
async def main():
page = Page("https://choosealicense.com")
page.fetch()
links = page.get_links()
print(links)
asyncio.run(main())
Page Html
Get the markup for the page or HTML.
import asyncio
from spider_rs import Page
async def main():
page = Page("https://choosealicense.com")
page.fetch()
links = page.get_html()
print(links)
asyncio.run(main())
Page Bytes
Get the raw bytes of a page to store the files in a database.
import asyncio
from spider_rs import Page
async def main():
page = Page("https://choosealicense.com")
page.fetch()
links = page.get_bytes()
print(links)
asyncio.run(main())
Environment
Env variables to adjust the project.
CHROME_URL
You can set the chrome URL to connect remotely.
CHROME_URL=http://localhost:9222
Crawl
Crawl a website concurrently.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://rsseau.fr")
website.crawl()
print(website.get_links())
asyncio.run(main())
Async Event
You can pass in a async function as the first param to the crawl function for realtime updates streamed.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription())
asyncio.run(main())
Background
You can run the request in the background and receive events with the second param set to true
.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription(), True)
# this will run instantly as the crawl is in the background
asyncio.run(main())
Subscriptions
You can setup many subscriptions to run events when a crawl happens.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl()
subscription_id = website.subscribe(Subscription());
website.crawl()
website.unsubscribe(subscription_id);
asyncio.run(main())
Headless Chrome
Headless Chrome rendering can be done by setting the third param in crawl
or scrape
to true
.
It will attempt to connect to chrome running remotely if the CHROME_URL
env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL
will
drastically speed up runs.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription(), false, True)
asyncio.run(main())
Scrape
Scape a website and collect the resource data.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com")
website.scrape()
print(website.get_pages())
# [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]
asyncio.run(main())
Headless Chrome
Headless Chrome rendering can be done by setting the third param in crawl
or scrape
to true
.
It will attempt to connect to chrome running remotely if the CHROME_URL
env variable is set with chrome launching as a fallback. Using a remote connection with CHROME_URL
will
drastically speed up runs.
import asyncio
from spider_rs import Website
async def main():
website = Website("https://choosealicense.com")
website.scrape(NULL, NULL, True)
print(website.get_pages())
# [ { url: "https://rsseau.fr/blog", html: "<html>...</html>"}, ...]
asyncio.run(main())
Cron Jobs
Use a cron job that can run any time of day to gather website data.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Cron Created...")
def __call__(self, page):
print(page.url + " - status: " + str(page.status_code))
async def main():
website = Website("https://choosealicense.com").with_cron("1/5 * * * * *").build()
handle = await website.run_cron(Subscription());
asyncio.run(main())
Storing Data
Storing data can be done to collect the raw content for a website.
This allows you to upload and download the content without UTF-8 conversion. The property only appears when
setting the second param of the Website
class constructor to true.
import asyncio
from spider_rs import Website
class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - bytes: " + str(page.raw_content))
# do something with page.raw_content
async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription(), True)
Benchmarks
Test url: https://espn.com
Mac M1 64gb 10-core CPU
libraries | pages | speed |
---|---|---|
spider(rust): crawl | 150,387 | 1m |
spider(nodejs): crawl | 150,387 | 153s |
spider(python): crawl | 150,387 | 186s |
scrapy(python): crawl | 49,598 | 1h |
crawlee(nodejs): crawl | 18,779 | 30m |
View the latest runs on github.
-----------------------
Linux
2-core CPU
7 GB of RAM memory
-----------------------
Test url: https://choosealicense.com
(small)
32 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 76ms |
scrapy: crawl 10 samples | 2s |
Test url: https://rsseau.fr
(medium)
211 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 3s |
scrapy: crawl 10 samples | 8s |
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
-----------------------
Test url: https://choosealicense.com
(small)
32 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 286ms |
scrapy: crawl 10 samples | 2.5s |
Test url: https://rsseau.fr
(medium)
211 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 2.5s |
scrapy: crawl 10 samples | 10s |
Test url: https://a11ywatch.com
(medium)
648 pages
libraries | speed |
---|---|
spider-rs: crawl 10 samples | 2s |
scrapy: crawl 10 samples | 7.7s |
Test url: https://espn.com
(large)
150,387 pages
libraries | pages | speed |
---|---|---|
spider-rs: crawl 10 samples | 150,387 | 186s |
scrapy: crawl 10 samples | 49,598 | 1h |
Scrapy used too much memory, crawl cancelled after an hour.
Note: The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.