Scrapling là gì?

admin

3 months ago

Scrapling là gì?

Scrapling là một framework web scraping adaptive (thích ứng) cho Python, được thiết kế để
xử lý mọi thứ từ một request đơn giản đến một chiến dịch crawl quy mô lớn. Điểm khác biệt lớn nhất của Scrapling so
với các thư viện như BeautifulSoup hay Scrapy:

🔄 Adaptive Scraping: Tự động tìm lại elements khi website thay đổi cấu trúc
🛡️ Anti-bot Bypass: Vượt qua Cloudflare Turnstile, captcha mà không cần cấu hình phức tạp
🕷️ Spider Framework: API giống Scrapy, hỗ trợ concurrent crawling, pause/resume
🤖 MCP Server cho AI: Tích hợp với Claude, Cursor để scraping bằng AI
⚡ Hiệu suất cao: Nhanh hơn hầu hết thư viện scraping Python, JSON serialization nhanh gấp 10x

📌 Yêu cầu: Python 3.10 trở lên

1. Cài đặt Scrapling

Cài đặt cơ bản (chỉ parser)

pip install scrapling

Cài đặt đầy đủ (có fetchers + browsers)

# Cài fetchers
pip install "scrapling[fetchers]"

# Cài browsers (Chromium, Firefox)
scrapling install

Các gói tùy chọn

# MCP Server cho AI (Claude, Cursor)
pip install "scrapling[ai]"

# Interactive Shell
pip install "scrapling[shell]"

# Cài tất cả
pip install "scrapling[all]"
scrapling install

Cài đặt bằng Docker

# Từ Docker Hub
docker pull pyd4vinci/scrapling

# Từ GitHub Container Registry
docker pull ghcr.io/d4vinci/scrapling:latest

2. Ba loại Fetcher — Chọn đúng công cụ

Scrapling cung cấp 3 loại fetcher, mỗi loại phù hợp với từng tình huống:

Fetcher	Mô tả	Khi nào dùng
`Fetcher`	HTTP requests nhanh, giả lập TLS fingerprint	Website tĩnh, API
`StealthyFetcher`	Stealth mode, bypass Cloudflare	Website có anti-bot
`DynamicFetcher`	Full browser automation (Playwright)	Website cần JS render

2.1 Fetcher — Request HTTP nhanh

from scrapling.fetchers import Fetcher

# Request đơn giản
page = Fetcher.get('https://quotes.toscrape.com/')

# Lấy dữ liệu bằng CSS selector
quotes = page.css('.quote .text::text').getall()
print(quotes)
# ['The world as we have created it...', 'It is our choices...', ...]

2.2 StealthyFetcher — Bypass Cloudflare

from scrapling.fetchers import StealthyFetcher

# Bypass Cloudflare Turnstile tự động
page = StealthyFetcher.fetch(
    'https://nopecha.com/demo/cloudflare',
    headless=True,
    google_search=False
)

data = page.css('#padded_content a').getall()
print(f"Đã bypass thành công, lấy được {len(data)} links")

2.3 DynamicFetcher — Full Browser Automation

from scrapling.fetchers import DynamicFetcher

# Render JavaScript đầy đủ
page = DynamicFetcher.fetch(
    'https://quotes.toscrape.com/',
    headless=True,
    disable_resources=False,
    network_idle=True    # Chờ network idle
)

# Có thể dùng cả CSS và XPath
data_css = page.css('.quote .text::text').getall()
data_xpath = page.xpath('//span[@class="text"]/text()').getall()

3. Session — Quản lý phiên làm việc

Scrapling hỗ trợ session để giữ cookies, state qua nhiều requests:

HTTP Session

from scrapling.fetchers import FetcherSession

# Session giữ cookies qua nhiều requests
with FetcherSession(impersonate='chrome') as session:
    # Request 1: Login
    login_page = session.get('https://example.com/login', stealthy_headers=True)
    
    # Request 2: Trang cần auth (cookies được giữ)
    dashboard = session.get('https://example.com/dashboard')
    data = dashboard.css('.user-data::text').get()

Stealth Session — Bypass anti-bot liên tục

from scrapling.fetchers import StealthySession

# Giữ browser mở qua nhiều requests
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page1 = session.fetch('https://protected-site.com/page1')
    page2 = session.fetch('https://protected-site.com/page2')
    # Browser chỉ đóng khi thoát context manager

Async Session — Crawl đồng thời

import asyncio
from scrapling.fetchers import AsyncStealthySession

async def crawl():
    async with AsyncStealthySession(max_pages=5) as session:
        tasks = []
        urls = [
            'https://example.com/page1',
            'https://example.com/page2',
            'https://example.com/page3',
        ]
        for url in urls:
            tasks.append(session.fetch(url))
        
        # Fetch đồng thời
        results = await asyncio.gather(*tasks)
        
        for result in results:
            print(result.css('title::text').get())

asyncio.run(crawl())

4. Spider — Crawling quy mô lớn

Scrapling có Spider framework giống Scrapy nhưng dễ dùng hơn:

Spider cơ bản

from scrapling.spiders import Spider, Request, Response

class QuotesSpider(Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 10  # 10 requests đồng thời

    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {
                "text": quote.css('.text::text').get(),
                "author": quote.css('.author::text').get(),
            }
        
        # Follow pagination
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

# Chạy spider
result = QuotesSpider().start()
print(f"Đã scrape {len(result.items)} quotes")

# Export ra JSON
result.items.to_json("quotes.json")
result.items.to_jsonl("quotes.jsonl")

Multi-Session Spider — Kết hợp nhiều loại fetcher

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class SmartSpider(Spider):
    name = "smart"
    start_urls = ["https://example.com/"]

    def configure_sessions(self, manager):
        # Session nhanh cho trang thường
        manager.add("fast", FetcherSession(impersonate="chrome"))
        # Session stealth cho trang có anti-bot
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                # Route qua stealth session
                yield Request(link, sid="stealth")
            else:
                # Route qua fast session
                yield Request(link, sid="fast", callback=self.parse)

Pause & Resume — Tạm dừng và tiếp tục crawl

# Chạy với checkpoint directory
spider = QuotesSpider(crawldir="./crawl_data")
spider.start()

# Nhấn Ctrl+C để tạm dừng — progress được lưu tự động
# Chạy lại → tự động resume từ chỗ đã dừng

Streaming Mode — Xử lý real-time

import asyncio
from scrapling.spiders import Spider, Response

class StreamSpider(Spider):
    name = "stream"
    start_urls = ["https://quotes.toscrape.com/"]

    async def parse(self, response: Response):
        for quote in response.css('.quote'):
            yield {"text": quote.css('.text::text').get()}

async def main():
    spider = StreamSpider()
    async for item in spider.stream():
        # Xử lý từng item ngay khi có
        print(f"Got: {item}")

asyncio.run(main())

5. Adaptive Scraping — Tự động thích ứng

Tính năng đặc biệt nhất của Scrapling: khi website thay đổi HTML structure, Scrapling tự động tìm
lại elements bằng AI similarity algorithms.

from scrapling.fetchers import StealthyFetcher

# Bật adaptive mode
StealthyFetcher.adaptive = True

# Lần đầu: lưu cấu trúc elements
page = StealthyFetcher.fetch('https://example.com', headless=True)
products = page.css('.product', auto_save=True)  # Lưu fingerprint

# Sau khi website redesign, elements tự tìm lại
products = page.css('.product', adaptive=True)
# Scrapling dùng similarity algorithms để relocate elements!

Tìm elements tương tự

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')
first_quote = page.css('.quote')[0]

# Tìm tất cả elements tương tự
similar = first_quote.find_similar()

# Tìm elements phía dưới
below = first_quote.below_elements()

6. Parsing nâng cao

Nhiều cách select elements

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')

# CSS Selector
quotes = page.css('.quote')

# XPath
quotes = page.xpath('//div[@class="quote"]')

# BeautifulSoup-style
quotes = page.find_all('div', {'class': 'quote'})
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(class_='quote')

# Tìm theo text
quotes = page.find_by_text('quote', tag='div')

Navigation — Điều hướng DOM

# Lấy element đầu tiên
first_quote = page.css('.quote')[0]

# Chained selectors
text = first_quote.css('.text::text').get()
author = first_quote.css('.author::text').get()

# DOM traversal
parent = first_quote.parent
next_el = first_quote.next_sibling
children = first_quote.children

Tự động tạo Selector

# Auto-generate CSS/XPath selector cho element bất kỳ
element = page.css('.quote')[0]
css_selector = element.generate_css_selector()
xpath_selector = element.generate_xpath()

Parse HTML trực tiếp (không cần fetch)

from scrapling.parser import Selector

html = "<html><body><h1>Hello Scrapling!</h1></body></html>"
page = Selector(html)
title = page.css('h1::text').get()
print(title)  # "Hello Scrapling!"

7. Proxy Rotation

from scrapling.fetchers import Fetcher

# Single proxy
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')

# Proxy rotation trong Spider
from scrapling.core import ProxyRotator

proxies = [
    'http://proxy1:8080',
    'http://proxy2:8080',
    'http://proxy3:8080',
]

rotator = ProxyRotator(proxies, strategy='cyclic')

class MySpider(Spider):
    name = "proxy_spider"
    start_urls = ["https://example.com"]
    proxy_rotator = rotator

8. CLI & Interactive Shell

Scrapling có command-line interface mạnh mẽ:

Interactive Shell

# Mở shell tương tác (IPython)
scrapling shell

Extract trực tiếp từ Terminal

# Extract content ra Markdown
scrapling extract get 'https://example.com' content.md

# Extract text thuần
scrapling extract get 'https://example.com' content.txt

# Extract với CSS selector cụ thể
scrapling extract get 'https://example.com' content.txt \
    --css-selector '#main-content' \
    --impersonate 'chrome'

# Bypass Cloudflare và extract
scrapling extract stealthy-fetch 'https://protected-site.com' data.html \
    --css-selector '#content' \
    --solve-cloudflare

9. MCP Server — Tích hợp AI

Scrapling có MCP (Model Context Protocol) Server built-in, cho phép AI assistants (Claude, Cursor,
Windsurf) scrape web trực tiếp:

# Cài MCP feature
pip install "scrapling[ai]"
scrapling install

# Chạy MCP server
scrapling mcp

Sau đó cấu hình trong Claude Desktop hoặc Cursor để kết nối. AI sẽ dùng Scrapling để:

Trích xuất nội dung web trước khi xử lý (giảm token cost)
Scrape dữ liệu theo yêu cầu tự nhiên
Tự động chọn fetcher phù hợp

10. So sánh hiệu suất

Tính năng	Scrapling	BeautifulSoup	Scrapy	Selenium
Tốc độ parsing	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐
Anti-bot bypass	✅ Built-in	❌	Plugin	Một phần
Adaptive scraping	✅	❌	❌	❌
Browser automation	✅ 3 levels	❌	Plugin	✅
Spider/Crawler	✅ Built-in	❌	✅ Core	❌
AI Integration	✅ MCP	❌	❌	❌
Async support	✅ Full	❌	✅ Twisted	❌
Pause/Resume	✅	❌	Plugin	❌
Memory usage	Thấp	Cao	Trung bình	Rất cao

Ví dụ thực tế: Scrape sản phẩm e-commerce

from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://quotes.toscrape.com/"]
    concurrent_requests = 5

    async def parse(self, response: Response):
        # Scrape quotes (thay bằng product selector thực tế)
        for item in response.css('.quote'):
            yield {
                "text": item.css('.text::text').get(),
                "author": item.css('.author::text').get(),
                "tags": item.css('.tag::text').getall(),
            }
        
        # Pagination
        next_btn = response.css('.next a')
        if next_btn:
            yield response.follow(next_btn[0].attrib['href'])

# Chạy và export
result = ProductSpider().start()
print(f"✅ Scrape xong {len(result.items)} items")
result.items.to_json("products.json")

Tổng kết

Scrapling là lựa chọn all-in-one cho web scraping Python hiện đại:

✅ Đơn giản: API quen thuộc (giống Scrapy + BeautifulSoup)
✅ Mạnh mẽ: 3 cấp độ fetcher, bypass anti-bot, adaptive tracking
✅ AI-ready: MCP Server cho Claude/Cursor
✅ Production-ready: Spider framework, proxy rotation, pause/resume
✅ Hiệu suất cao: Nhanh hơn BeautifulSoup, nhẹ hơn Selenium

📖 Tài liệu chính thức: scrapling.readthedocs.io
📦 GitHub: github.com/D4Vinci/Scrapling