feat: initial commit
This commit is contained in:
8
.env.test.example
Normal file
8
.env.test.example
Normal file
@@ -0,0 +1,8 @@
|
||||
# Scrapling service URL (default: http://localhost:8765)
|
||||
SCRAPLING_SERVICE_URL=http://localhost:8765
|
||||
|
||||
# Optional API key for the Scrapling service
|
||||
SCRAPLING_API_KEY=
|
||||
|
||||
# Test URL used in integration tests
|
||||
SCRAPLING_TEST_URL=https://httpbin.org/html
|
||||
10
.gitignore
vendored
Normal file
10
.gitignore
vendored
Normal file
@@ -0,0 +1,10 @@
|
||||
node_modules/
|
||||
dist/
|
||||
.env.test
|
||||
*.js.map
|
||||
*.d.ts.map
|
||||
coverage/
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
service/.venv/
|
||||
399
README.md
Normal file
399
README.md
Normal file
@@ -0,0 +1,399 @@
|
||||
# n8n-nodes-scrapling
|
||||
|
||||
Community node dla [n8n](https://n8n.io) integrujący bibliotekę [Scrapling](https://github.com/D4Vinci/Scrapling) — szybki, adaptacyjny scraper stron internetowych z obsługą zwykłego HTTP, trybu stealth (TLS fingerprint impersonation) oraz pełnej przeglądarki Playwright.
|
||||
|
||||
---
|
||||
|
||||
## Architektura
|
||||
|
||||
```
|
||||
n8n (Node.js / TypeScript)
|
||||
│
|
||||
│ HTTP POST /scrape
|
||||
▼
|
||||
Scrapling Service (Python / FastAPI)
|
||||
│
|
||||
├── Fetcher — szybkie żądania HTTP
|
||||
├── StealthyFetcher — impersonacja TLS (curl-impersonate)
|
||||
└── PlayWrightFetcher — pełna przeglądarka Chromium
|
||||
```
|
||||
|
||||
n8n node komunikuje się z Scrapling Service przez HTTP. Serwis Python zarządza instancjami scraperów i zwraca ustrukturyzowane JSON.
|
||||
|
||||
---
|
||||
|
||||
## Wymagania
|
||||
|
||||
| Komponent | Wersja |
|
||||
|-----------------|----------------|
|
||||
| n8n | ≥ 1.0 |
|
||||
| Node.js | ≥ 18 |
|
||||
| Python | ≥ 3.11 |
|
||||
| Docker | opcjonalnie |
|
||||
|
||||
---
|
||||
|
||||
## Instalacja
|
||||
|
||||
### Opcja A — Docker Compose (zalecana)
|
||||
|
||||
Uruchamia n8n oraz Scrapling Service w jednym poleceniu.
|
||||
|
||||
```bash
|
||||
# Skopiuj i uzupełnij zmienne środowiskowe
|
||||
cp .env.test.example .env
|
||||
|
||||
# Zbuduj i uruchom
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
n8n dostępne pod adresem: `http://localhost:5678`
|
||||
Scrapling Service: `http://localhost:8765`
|
||||
|
||||
### Opcja B — Ręczna instalacja
|
||||
|
||||
**1. Zainstaluj zależności n8n node:**
|
||||
|
||||
```bash
|
||||
npm install
|
||||
npm run build
|
||||
```
|
||||
|
||||
**2. Zainstaluj serwis Python:**
|
||||
|
||||
```bash
|
||||
cd service
|
||||
python3 -m venv .venv
|
||||
.venv/bin/pip install -e ".[dev]"
|
||||
.venv/bin/scrapling install # pobiera przeglądarki Playwright
|
||||
```
|
||||
|
||||
**3. Uruchom serwis:**
|
||||
|
||||
```bash
|
||||
.venv/bin/uvicorn service.main:app --host 0.0.0.0 --port 8765
|
||||
```
|
||||
|
||||
**4. Zainstaluj node w n8n:**
|
||||
|
||||
Skopiuj katalog `dist/` do katalogu community nodes n8n:
|
||||
|
||||
```bash
|
||||
# Domyślna ścieżka dla n8n instalowanego przez npm
|
||||
cp -r dist ~/.n8n/nodes/node_modules/n8n-nodes-scrapling
|
||||
```
|
||||
|
||||
Następnie zrestartuj n8n.
|
||||
|
||||
---
|
||||
|
||||
## Konfiguracja credentials
|
||||
|
||||
W n8n dodaj nowy credential typu **Scrapling API**:
|
||||
|
||||
| Pole | Opis | Domyślnie |
|
||||
|-------------|---------------------------------------------------|------------------------|
|
||||
| Service URL | URL Scrapling Service (bez trailing slash) | `http://localhost:8765` |
|
||||
| API Key | Opcjonalny klucz autoryzacyjny (header `X-API-Key`) | _(puste)_ |
|
||||
|
||||
Jeśli `API_KEY` w serwisie jest pusty, autoryzacja jest wyłączona.
|
||||
|
||||
---
|
||||
|
||||
## Użycie node
|
||||
|
||||
### Resource: Page
|
||||
|
||||
Służy do pobierania całych stron. Zwraca URL, status HTTP, czas pobierania i (opcjonalnie) surowy HTML.
|
||||
|
||||
#### Operacje
|
||||
|
||||
| Operacja | Fetcher | Opis |
|
||||
|----------------|--------------------|---------------------------------------------------------|
|
||||
| Fetch | `Fetcher` | Szybkie żądanie HTTP. Najszybsza opcja dla statycznych stron. |
|
||||
| Fetch Stealth | `StealthyFetcher` | Impersonacja TLS (curl-impersonate). Omija podstawowe anti-bot. |
|
||||
| Fetch Dynamic | `PlayWrightFetcher`| Pełna przeglądarka Chromium (Playwright). Dla SPA i stron wymagających JS. |
|
||||
|
||||
**Przykładowy output:**
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"status_code": 200,
|
||||
"html": "<html>...</html>",
|
||||
"data": {},
|
||||
"fetcher_used": "http",
|
||||
"elapsed_ms": 312.5
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Resource: Data
|
||||
|
||||
Służy do ekstrakcji ustrukturyzowanych danych z pobranych stron.
|
||||
|
||||
#### Operacje
|
||||
|
||||
| Operacja | Opis |
|
||||
|-----------------|------------------------------------------------------------------|
|
||||
| Extract | Pobiera stronę i wyciąga dane za pomocą selektorów CSS lub XPath. |
|
||||
| Extract Tables | Pobiera stronę i zwraca wszystkie tabele HTML jako tablice JSON. |
|
||||
|
||||
#### Konfiguracja selektorów (Extract)
|
||||
|
||||
Każdy selektor definiuje pole w wyjściowym obiekcie `data`:
|
||||
|
||||
| Pole | Opis |
|
||||
|------------------|-------------------------------------------------------------------|
|
||||
| Field Name | Nazwa klucza w wyjściowym JSON |
|
||||
| Selector | Wyrażenie CSS lub XPath, np. `h1.title` lub `//h1[@class="title"]` |
|
||||
| Type | `css` lub `xpath` |
|
||||
| Attribute | Atrybut HTML do pobrania (np. `href`, `src`). Puste = tekst. |
|
||||
| Return Multiple | Zwróć tablicę wszystkich pasujących elementów |
|
||||
|
||||
**Przykładowy output dla Extract:**
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://news.ycombinator.com",
|
||||
"status_code": 200,
|
||||
"data": {
|
||||
"titles": ["Show HN: ...", "Ask HN: ...", "..."],
|
||||
"top_link": "https://..."
|
||||
},
|
||||
"fetcher_used": "http",
|
||||
"elapsed_ms": 187.2
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Opcje wspólne
|
||||
|
||||
| Opcja | Opis | Domyślnie |
|
||||
|------------------------|------------------------------------------------------------------|-----------|
|
||||
| Return Raw HTML | Dołącz surowy HTML do odpowiedzi | `false` |
|
||||
| Timeout (ms) | Maksymalny czas żądania (1000–120000 ms) | `30000` |
|
||||
| Proxy | URL proxy, np. `http://user:pass@proxy.example.com:8080` | _(puste)_ |
|
||||
| Extra Headers | Dodatkowe nagłówki HTTP jako JSON, np. `{"Accept-Language": "pl"}` | `{}` |
|
||||
| Wait for Selector | CSS selector, na który czeka Playwright przed ekstrakcją | _(puste)_ |
|
||||
| Wait for Network Idle | Czekaj na zakończenie aktywności sieciowej (Playwright) | `false` |
|
||||
| Headless Browser | Uruchom Playwright bez GUI | `true` |
|
||||
|
||||
> Opcje **Wait for Selector**, **Wait for Network Idle** i **Headless Browser** działają tylko z fetcherem `dynamic`.
|
||||
|
||||
---
|
||||
|
||||
## CLI Runner
|
||||
|
||||
Do testowania bez uruchamiania n8n:
|
||||
|
||||
```bash
|
||||
# Podstawowe użycie
|
||||
npx ts-node scripts/scrapling-run.ts https://example.com
|
||||
|
||||
# Tryb stealth
|
||||
npx ts-node scripts/scrapling-run.ts https://example.com --fetcher stealth
|
||||
|
||||
# Ekstrakcja danych
|
||||
npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com \
|
||||
--selector "titles:.titleline a" \
|
||||
--fetcher http \
|
||||
--format json
|
||||
|
||||
# Playwright — czekaj na element
|
||||
npx ts-node scripts/scrapling-run.ts https://spa.example.com \
|
||||
--fetcher dynamic \
|
||||
--wait "#content" \
|
||||
--html
|
||||
|
||||
# Lub przez Taskfile
|
||||
task scrape URL=https://example.com FETCHER=stealth
|
||||
task scrape:dynamic URL=https://spa.example.com
|
||||
```
|
||||
|
||||
Credentials CLI pobiera z pliku `.env.test` lub zmiennych środowiskowych:
|
||||
|
||||
```bash
|
||||
SCRAPLING_SERVICE_URL=http://localhost:8765
|
||||
SCRAPLING_API_KEY= # opcjonalnie
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Taskfile
|
||||
|
||||
| Task | Opis |
|
||||
|---------------------|---------------------------------------------------|
|
||||
| `task setup` | Instalacja npm + kopiowanie `.env.test.example` |
|
||||
| `task build` | Kompilacja TypeScript → `dist/` |
|
||||
| `task dev` | Watch mode |
|
||||
| `task test` | Uruchom testy jednostkowe |
|
||||
| `task test:coverage`| Testy z raportem pokrycia |
|
||||
| `task lint` | ESLint |
|
||||
| `task check` | Lint + test (pre-push) |
|
||||
| `task service:install` | Instalacja serwisu Python w `.venv` |
|
||||
| `task service:start` | Start serwisu na porcie 8765 |
|
||||
| `task service:health` | Sprawdź health endpoint |
|
||||
| `task scrape` | CLI scraper `[URL=...] [FETCHER=...] [FORMAT=...]`|
|
||||
| `task scrape:stealth` | Scrape w trybie stealth |
|
||||
| `task scrape:dynamic` | Scrape z przeglądarką |
|
||||
| `task docker:up` | Start Docker Compose (n8n + serwis) |
|
||||
| `task docker:down` | Stop Docker Compose |
|
||||
| `task docker:logs` | Tail logów |
|
||||
|
||||
---
|
||||
|
||||
## API Scrapling Service
|
||||
|
||||
### `POST /scrape`
|
||||
|
||||
Pobiera stronę i opcjonalnie ekstrahuje dane.
|
||||
|
||||
**Request body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"fetcher_type": "http",
|
||||
"selectors": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h1",
|
||||
"selector_type": "css",
|
||||
"attribute": null,
|
||||
"multiple": false
|
||||
}
|
||||
],
|
||||
"return_html": false,
|
||||
"timeout": 30000,
|
||||
"proxy": null,
|
||||
"headers": {},
|
||||
"wait_selector": null,
|
||||
"network_idle": false,
|
||||
"headless": true
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com",
|
||||
"status_code": 200,
|
||||
"html": null,
|
||||
"data": {
|
||||
"title": "Example Domain"
|
||||
},
|
||||
"fetcher_used": "http",
|
||||
"elapsed_ms": 245.3,
|
||||
"error": null
|
||||
}
|
||||
```
|
||||
|
||||
W przypadku błędu serwis zwraca HTTP 200 z wypełnionym polem `error` (zamiast rzucać wyjątek) — pozwala to n8n obsłużyć błąd przez opcję **Continue On Fail**.
|
||||
|
||||
### `GET /health`
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"version": "0.1.0",
|
||||
"dynamic_session_ready": true
|
||||
}
|
||||
```
|
||||
|
||||
Autoryzacja: jeśli zmienna `API_KEY` jest ustawiona w serwisie, każdy request do `/scrape` wymaga nagłówka `X-API-Key`.
|
||||
|
||||
---
|
||||
|
||||
## Zmienne środowiskowe
|
||||
|
||||
### Scrapling Service
|
||||
|
||||
| Zmienna | Opis | Domyślnie |
|
||||
|--------------|---------------------------------------------------|-----------|
|
||||
| `API_KEY` | Klucz API do autoryzacji requestów | _(puste — wyłączone)_ |
|
||||
|
||||
### Docker Compose
|
||||
|
||||
| Zmienna | Opis | Domyślnie |
|
||||
|----------------------|-----------------------------------|-------------|
|
||||
| `SCRAPLING_API_KEY` | API key serwisu (przekazywany) | _(puste)_ |
|
||||
| `N8N_USER` | Login do n8n basic auth | `admin` |
|
||||
| `N8N_PASSWORD` | Hasło do n8n basic auth | `changeme` |
|
||||
|
||||
---
|
||||
|
||||
## Struktura projektu
|
||||
|
||||
```
|
||||
scrapling_n8n/
|
||||
├── src/
|
||||
│ ├── credentials/
|
||||
│ │ └── ScraplingApi.credentials.ts # Credentials: serviceUrl + apiKey
|
||||
│ └── nodes/Scrapling/
|
||||
│ ├── Scrapling.node.ts # Główny node n8n
|
||||
│ ├── helpers.ts # HTTP client + typy
|
||||
│ ├── scrapling.svg # Ikona node
|
||||
│ └── __tests__/
|
||||
│ ├── helpers.test.ts # Testy helpera
|
||||
│ └── Scrapling.node.test.ts # Testy node (z mock)
|
||||
├── service/ # Python FastAPI microservice
|
||||
│ ├── main.py # Entry point + autoryzacja
|
||||
│ ├── routers/
|
||||
│ │ ├── scrape.py # POST /scrape
|
||||
│ │ └── health.py # GET /health
|
||||
│ ├── scrapers/
|
||||
│ │ ├── base.py # Abstrakcja + apply_selectors()
|
||||
│ │ ├── fetcher.py # Wrapper Fetcher
|
||||
│ │ ├── stealthy.py # Wrapper StealthyFetcher
|
||||
│ │ └── dynamic.py # Wrapper PlayWrightFetcher
|
||||
│ ├── models/
|
||||
│ │ ├── request.py # Pydantic ScrapeRequest
|
||||
│ │ └── response.py # Pydantic ScrapeResponse
|
||||
│ ├── pyproject.toml # Zależności Python
|
||||
│ └── Dockerfile
|
||||
├── scripts/
|
||||
│ └── scrapling-run.ts # CLI runner (bez n8n)
|
||||
├── dist/ # Skompilowany JS (po npm run build)
|
||||
├── docker-compose.yml
|
||||
├── package.json
|
||||
├── tsconfig.json
|
||||
├── jest.config.js
|
||||
├── Taskfile.yml
|
||||
└── .env.test.example
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Rozwój
|
||||
|
||||
```bash
|
||||
# Instalacja
|
||||
task setup
|
||||
|
||||
# Tryb watch (TypeScript)
|
||||
task dev
|
||||
|
||||
# Testy
|
||||
task test
|
||||
task test:coverage
|
||||
|
||||
# Lint
|
||||
task lint
|
||||
|
||||
# Uruchomienie serwisu lokalnie
|
||||
task service:install
|
||||
task service:start
|
||||
|
||||
# Szybki test end-to-end
|
||||
task scrape URL=https://httpbin.org/html
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Licencja
|
||||
|
||||
MIT
|
||||
138
Taskfile.yml
Normal file
138
Taskfile.yml
Normal file
@@ -0,0 +1,138 @@
|
||||
version: '3'
|
||||
|
||||
dotenv: ['.env.test']
|
||||
|
||||
vars:
|
||||
URL: 'https://example.com'
|
||||
FETCHER: 'http'
|
||||
FORMAT: 'pretty'
|
||||
SERVICE_URL: 'http://localhost:8765'
|
||||
|
||||
tasks:
|
||||
|
||||
# ── Build ──────────────────────────────────────────────────────────────────
|
||||
|
||||
build:
|
||||
desc: Compile TypeScript to dist/
|
||||
cmds:
|
||||
- npm run build
|
||||
|
||||
dev:
|
||||
desc: Watch mode — recompile on change
|
||||
cmds:
|
||||
- npm run dev
|
||||
|
||||
# ── Code quality ───────────────────────────────────────────────────────────
|
||||
|
||||
lint:
|
||||
desc: Run ESLint
|
||||
cmds:
|
||||
- npm run lint
|
||||
|
||||
format:
|
||||
desc: Format source with Prettier
|
||||
cmds:
|
||||
- npm run format
|
||||
|
||||
# ── Tests ──────────────────────────────────────────────────────────────────
|
||||
|
||||
test:
|
||||
desc: Run all unit tests
|
||||
cmds:
|
||||
- npm test
|
||||
|
||||
test:watch:
|
||||
desc: Run tests in watch mode
|
||||
cmds:
|
||||
- npm run test:watch
|
||||
|
||||
test:coverage:
|
||||
desc: Run tests with coverage report
|
||||
cmds:
|
||||
- npm run test:coverage
|
||||
|
||||
# ── Python service ─────────────────────────────────────────────────────────
|
||||
|
||||
service:install:
|
||||
desc: Install Python service dependencies (creates .venv in service/)
|
||||
dir: service
|
||||
cmds:
|
||||
- python3 -m venv .venv
|
||||
- .venv/bin/pip install -e ".[dev]"
|
||||
- .venv/bin/scrapling install
|
||||
|
||||
service:start:
|
||||
desc: Start Scrapling microservice on port 8765
|
||||
dir: service
|
||||
cmds:
|
||||
- .venv/bin/uvicorn main:app --host 0.0.0.0 --port 8765 --reload
|
||||
|
||||
service:health:
|
||||
desc: Check Scrapling service health
|
||||
cmds:
|
||||
- curl -s {{.SERVICE_URL}}/health | python3 -m json.tool
|
||||
|
||||
# ── Scrapling CLI runner ────────────────────────────────────────────────────
|
||||
|
||||
scrape:
|
||||
desc: "Scrape a URL [URL=https://example.com] [FETCHER=http|stealth|dynamic] [FORMAT=pretty|json]"
|
||||
cmds:
|
||||
- >
|
||||
npx ts-node scripts/scrapling-run.ts {{.URL}}
|
||||
--fetcher {{.FETCHER}}
|
||||
--format {{.FORMAT}}
|
||||
|
||||
scrape:html:
|
||||
desc: "Scrape and return raw HTML [URL=https://example.com]"
|
||||
cmds:
|
||||
- npx ts-node scripts/scrapling-run.ts {{.URL}} --html --format json
|
||||
|
||||
scrape:stealth:
|
||||
desc: "Scrape with stealth fetcher [URL=https://example.com]"
|
||||
cmds:
|
||||
- npx ts-node scripts/scrapling-run.ts {{.URL}} --fetcher stealth --format {{.FORMAT}}
|
||||
|
||||
scrape:dynamic:
|
||||
desc: "Scrape with Playwright browser [URL=https://example.com]"
|
||||
cmds:
|
||||
- npx ts-node scripts/scrapling-run.ts {{.URL}} --fetcher dynamic --format {{.FORMAT}}
|
||||
|
||||
# ── Docker ─────────────────────────────────────────────────────────────────
|
||||
|
||||
docker:build:
|
||||
desc: Build Docker image for Scrapling service
|
||||
cmds:
|
||||
- docker build -t scrapling-service ./service
|
||||
|
||||
docker:up:
|
||||
desc: Start full stack (n8n + scrapling-service) via Docker Compose
|
||||
cmds:
|
||||
- docker compose up -d
|
||||
|
||||
docker:down:
|
||||
desc: Stop Docker Compose stack
|
||||
cmds:
|
||||
- docker compose down
|
||||
|
||||
docker:logs:
|
||||
desc: Tail Docker Compose logs
|
||||
cmds:
|
||||
- docker compose logs -f
|
||||
|
||||
# ── Composite ─────────────────────────────────────────────────────────────
|
||||
|
||||
check:
|
||||
desc: Lint + test (pre-push safety check)
|
||||
cmds:
|
||||
- task: lint
|
||||
- task: test
|
||||
|
||||
setup:
|
||||
desc: Install all dependencies and copy .env.test.example if missing
|
||||
cmds:
|
||||
- npm install
|
||||
- |
|
||||
if [ ! -f .env.test ]; then
|
||||
cp .env.test.example .env.test
|
||||
echo ".env.test created — fill in your credentials"
|
||||
fi
|
||||
35
docker-compose.yml
Normal file
35
docker-compose.yml
Normal file
@@ -0,0 +1,35 @@
|
||||
services:
|
||||
scrapling-service:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: service/Dockerfile
|
||||
ports:
|
||||
- "8765:8765"
|
||||
environment:
|
||||
API_KEY: ${SCRAPLING_API_KEY:-}
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8765/health"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
n8n:
|
||||
image: n8nio/n8n:latest
|
||||
ports:
|
||||
- "5678:5678"
|
||||
environment:
|
||||
N8N_BASIC_AUTH_ACTIVE: "true"
|
||||
N8N_BASIC_AUTH_USER: ${N8N_USER:-admin}
|
||||
N8N_BASIC_AUTH_PASSWORD: ${N8N_PASSWORD:-changeme}
|
||||
NODE_ENV: production
|
||||
volumes:
|
||||
- n8n_data:/home/node/.n8n
|
||||
- ./dist:/home/node/.n8n/nodes/node_modules/n8n-nodes-scrapling
|
||||
depends_on:
|
||||
scrapling-service:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
n8n_data:
|
||||
2
index.ts
Normal file
2
index.ts
Normal file
@@ -0,0 +1,2 @@
|
||||
export { Scrapling } from './src/nodes/Scrapling/Scrapling.node';
|
||||
export { ScraplingApi } from './src/credentials/ScraplingApi.credentials';
|
||||
12
jest.config.js
Normal file
12
jest.config.js
Normal file
@@ -0,0 +1,12 @@
|
||||
/** @type {import('jest').Config} */
|
||||
module.exports = {
|
||||
preset: 'ts-jest',
|
||||
testEnvironment: 'node',
|
||||
roots: ['<rootDir>/src'],
|
||||
testMatch: ['**/__tests__/**/*.test.ts'],
|
||||
collectCoverageFrom: [
|
||||
'src/**/*.ts',
|
||||
'!src/**/*.d.ts',
|
||||
],
|
||||
moduleFileExtensions: ['ts', 'js', 'json'],
|
||||
};
|
||||
6543
package-lock.json
generated
Normal file
6543
package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
57
package.json
Normal file
57
package.json
Normal file
@@ -0,0 +1,57 @@
|
||||
{
|
||||
"name": "n8n-nodes-scrapling",
|
||||
"version": "0.1.0",
|
||||
"description": "n8n community node for Scrapling — fast, adaptive web scraping with HTTP, stealth and dynamic (Playwright) fetchers",
|
||||
"keywords": [
|
||||
"n8n-community-node-package",
|
||||
"scrapling",
|
||||
"scraping",
|
||||
"playwright",
|
||||
"web-scraping"
|
||||
],
|
||||
"license": "MIT",
|
||||
"homepage": "https://github.com/paramah/n8n-nodes-scrapling",
|
||||
"author": {
|
||||
"name": "paramah"
|
||||
},
|
||||
"main": "index.js",
|
||||
"scripts": {
|
||||
"build": "tsc && npm run copy-icons",
|
||||
"copy-icons": "copyfiles -u 2 'src/nodes/**/*.svg' dist/nodes/",
|
||||
"dev": "tsc --watch",
|
||||
"format": "prettier src --write",
|
||||
"lint": "eslint src --ext .ts",
|
||||
"prepublishOnly": "npm run build && npm run lint",
|
||||
"test": "jest",
|
||||
"test:watch": "jest --watch",
|
||||
"test:coverage": "jest --coverage",
|
||||
"scrape": "ts-node scripts/scrapling-run.ts"
|
||||
},
|
||||
"files": [
|
||||
"dist"
|
||||
],
|
||||
"n8n": {
|
||||
"n8nNodesApiVersion": 1,
|
||||
"credentials": [
|
||||
"dist/credentials/ScraplingApi.credentials.js"
|
||||
],
|
||||
"nodes": [
|
||||
"dist/nodes/Scrapling/Scrapling.node.js"
|
||||
]
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/jest": "^29.5.14",
|
||||
"@typescript-eslint/parser": "^6.0.0",
|
||||
"copyfiles": "^2.4.1",
|
||||
"dotenv": "^17.4.2",
|
||||
"jest": "^29.7.0",
|
||||
"n8n-workflow": "*",
|
||||
"prettier": "^3.0.0",
|
||||
"ts-jest": "^29.4.9",
|
||||
"ts-node": "^10.9.2",
|
||||
"typescript": "^5.0.0"
|
||||
},
|
||||
"peerDependencies": {
|
||||
"n8n-workflow": "*"
|
||||
}
|
||||
}
|
||||
213
scripts/scrapling-run.ts
Normal file
213
scripts/scrapling-run.ts
Normal file
@@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env ts-node
|
||||
/**
|
||||
* Standalone Scrapling CLI runner — no n8n needed.
|
||||
*
|
||||
* Usage:
|
||||
* npx ts-node scripts/scrapling-run.ts <url> [options]
|
||||
*
|
||||
* Options:
|
||||
* --fetcher http|stealth|dynamic (default: http)
|
||||
* --selector <name>:<css-selector> (can be repeated)
|
||||
* --html include raw HTML in output
|
||||
* --wait <css-selector> wait for selector (dynamic only)
|
||||
* --timeout <ms> (default: 30000)
|
||||
* --format pretty|json (default: pretty)
|
||||
* --service <url> service URL override
|
||||
*
|
||||
* Credentials (from .env.test or environment):
|
||||
* SCRAPLING_SERVICE_URL default: http://localhost:8765
|
||||
* SCRAPLING_API_KEY optional
|
||||
*
|
||||
* Examples:
|
||||
* npx ts-node scripts/scrapling-run.ts https://example.com
|
||||
* npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com --fetcher stealth --selector "title:title"
|
||||
* npx ts-node scripts/scrapling-run.ts https://spa.example.com --fetcher dynamic --wait "#app"
|
||||
*/
|
||||
|
||||
import * as https from 'https';
|
||||
import * as http from 'http';
|
||||
import * as path from 'path';
|
||||
import * as fs from 'fs';
|
||||
|
||||
// ── Load .env.test if present ─────────────────────────────────────────────────
|
||||
|
||||
const envFile = path.resolve(__dirname, '..', '.env.test');
|
||||
if (fs.existsSync(envFile)) {
|
||||
const lines = fs.readFileSync(envFile, 'utf-8').split('\n');
|
||||
for (const line of lines) {
|
||||
const trimmed = line.trim();
|
||||
if (!trimmed || trimmed.startsWith('#')) continue;
|
||||
const eq = trimmed.indexOf('=');
|
||||
if (eq === -1) continue;
|
||||
const key = trimmed.slice(0, eq).trim();
|
||||
const value = trimmed.slice(eq + 1).trim().replace(/^["']|["']$/g, '');
|
||||
if (!process.env[key]) process.env[key] = value;
|
||||
}
|
||||
}
|
||||
|
||||
// ── Parse CLI args ────────────────────────────────────────────────────────────
|
||||
|
||||
const args = process.argv.slice(2);
|
||||
const url = args.find((a) => !a.startsWith('--'));
|
||||
|
||||
if (!url) {
|
||||
console.error('Usage: scrapling-run.ts <url> [--fetcher http|stealth|dynamic] [--selector name:selector] ...');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
function getArg(flag: string): string | undefined {
|
||||
const idx = args.indexOf(flag);
|
||||
return idx !== -1 ? args[idx + 1] : undefined;
|
||||
}
|
||||
|
||||
function getFlag(flag: string): boolean {
|
||||
return args.includes(flag);
|
||||
}
|
||||
|
||||
function getArgs(flag: string): string[] {
|
||||
const result: string[] = [];
|
||||
for (let i = 0; i < args.length; i++) {
|
||||
if (args[i] === flag && args[i + 1]) {
|
||||
result.push(args[i + 1]);
|
||||
i++;
|
||||
}
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
const fetcherType = (getArg('--fetcher') ?? 'http') as 'http' | 'stealth' | 'dynamic';
|
||||
const returnHtml = getFlag('--html');
|
||||
const waitSelector = getArg('--wait');
|
||||
const timeout = parseInt(getArg('--timeout') ?? '30000', 10);
|
||||
const outputFormat = (getArg('--format') ?? 'pretty') as 'pretty' | 'json';
|
||||
const serviceUrl = (getArg('--service') ?? process.env.SCRAPLING_SERVICE_URL ?? 'http://localhost:8765').replace(/\/$/, '');
|
||||
const apiKey = process.env.SCRAPLING_API_KEY ?? '';
|
||||
|
||||
const rawSelectors = getArgs('--selector');
|
||||
const selectors = rawSelectors.map((raw) => {
|
||||
const colonIdx = raw.indexOf(':');
|
||||
if (colonIdx === -1) {
|
||||
console.error(`Invalid selector format: "${raw}". Expected "name:selector"`);
|
||||
process.exit(1);
|
||||
}
|
||||
return {
|
||||
name: raw.slice(0, colonIdx),
|
||||
selector: raw.slice(colonIdx + 1),
|
||||
selector_type: 'css' as const,
|
||||
multiple: false,
|
||||
};
|
||||
});
|
||||
|
||||
// ── Minimal HTTP client ───────────────────────────────────────────────────────
|
||||
|
||||
function postJson(reqUrl: string, body: unknown, headers: Record<string, string>): Promise<unknown> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const bodyStr = JSON.stringify(body);
|
||||
const parsed = new URL(reqUrl);
|
||||
const isHttps = parsed.protocol === 'https:';
|
||||
const transport = isHttps ? https : http;
|
||||
|
||||
const req = transport.request(
|
||||
{
|
||||
hostname: parsed.hostname,
|
||||
port: parsed.port || (isHttps ? 443 : 80),
|
||||
path: parsed.pathname + parsed.search,
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
'Content-Length': Buffer.byteLength(bodyStr).toString(),
|
||||
...headers,
|
||||
},
|
||||
},
|
||||
(res) => {
|
||||
const chunks: Buffer[] = [];
|
||||
res.on('data', (c: Buffer) => chunks.push(c));
|
||||
res.on('end', () => {
|
||||
const text = Buffer.concat(chunks).toString('utf-8');
|
||||
if (res.statusCode && res.statusCode >= 400) {
|
||||
reject(Object.assign(new Error(`HTTP ${res.statusCode}: ${text}`), { statusCode: res.statusCode }));
|
||||
} else {
|
||||
try {
|
||||
resolve(JSON.parse(text));
|
||||
} catch {
|
||||
resolve(text);
|
||||
}
|
||||
}
|
||||
});
|
||||
},
|
||||
);
|
||||
req.on('error', reject);
|
||||
req.write(bodyStr);
|
||||
req.end();
|
||||
});
|
||||
}
|
||||
|
||||
// ── Main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
interface ScrapeResponse {
|
||||
url: string;
|
||||
status_code: number;
|
||||
html?: string;
|
||||
data: Record<string, unknown>;
|
||||
fetcher_used: string;
|
||||
elapsed_ms: number;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
async function main(): Promise<void> {
|
||||
console.log(`Service: ${serviceUrl}`);
|
||||
console.log(`URL: ${url}`);
|
||||
console.log(`Fetcher: ${fetcherType}`);
|
||||
if (selectors.length) console.log(`Selectors: ${selectors.map((s) => `${s.name}:${s.selector}`).join(', ')}`);
|
||||
console.log();
|
||||
|
||||
const payload: Record<string, unknown> = {
|
||||
url,
|
||||
fetcher_type: fetcherType,
|
||||
return_html: returnHtml,
|
||||
timeout,
|
||||
selectors,
|
||||
};
|
||||
|
||||
if (waitSelector) payload.wait_selector = waitSelector;
|
||||
|
||||
const requestHeaders: Record<string, string> = {};
|
||||
if (apiKey) requestHeaders['X-API-Key'] = apiKey;
|
||||
|
||||
const response = (await postJson(`${serviceUrl}/scrape`, payload, requestHeaders)) as ScrapeResponse;
|
||||
|
||||
if (outputFormat === 'json') {
|
||||
console.log(JSON.stringify(response, null, 2));
|
||||
return;
|
||||
}
|
||||
|
||||
// Pretty output
|
||||
console.log(`Status: ${response.status_code}`);
|
||||
console.log(`Fetcher: ${response.fetcher_used}`);
|
||||
console.log(`Elapsed: ${response.elapsed_ms}ms`);
|
||||
|
||||
if (response.error) {
|
||||
console.error(`\nError: ${response.error}`);
|
||||
return;
|
||||
}
|
||||
|
||||
if (Object.keys(response.data).length > 0) {
|
||||
console.log('\nExtracted data:');
|
||||
console.log('─'.repeat(50));
|
||||
for (const [key, val] of Object.entries(response.data)) {
|
||||
const display = Array.isArray(val) ? `[${(val as unknown[]).length} items]` : String(val);
|
||||
console.log(` ${key.padEnd(25)} ${display}`);
|
||||
}
|
||||
}
|
||||
|
||||
if (response.html) {
|
||||
console.log('\nHTML preview (first 500 chars):');
|
||||
console.log('─'.repeat(50));
|
||||
console.log(response.html.slice(0, 500));
|
||||
}
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error('\nError:', (err as Error).message ?? err);
|
||||
process.exit(1);
|
||||
});
|
||||
18
service/Dockerfile
Normal file
18
service/Dockerfile
Normal file
@@ -0,0 +1,18 @@
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# System deps for Playwright
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
curl \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY pyproject.toml .
|
||||
RUN pip install --no-cache-dir -e . \
|
||||
&& scrapling install
|
||||
|
||||
COPY . .
|
||||
|
||||
EXPOSE 8765
|
||||
|
||||
CMD ["uvicorn", "service.main:app", "--host", "0.0.0.0", "--port", "8765"]
|
||||
0
service/__init__.py
Normal file
0
service/__init__.py
Normal file
29
service/main.py
Normal file
29
service/main.py
Normal file
@@ -0,0 +1,29 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
|
||||
from fastapi import Depends, FastAPI, HTTPException, Request, Security
|
||||
from fastapi.security.api_key import APIKeyHeader
|
||||
|
||||
from .routers import health_router, scrape_router
|
||||
|
||||
API_KEY = os.getenv("API_KEY", "")
|
||||
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
|
||||
|
||||
|
||||
async def verify_api_key(key: str | None = Security(api_key_header)) -> None:
|
||||
if API_KEY and key != API_KEY:
|
||||
raise HTTPException(status_code=401, detail="Invalid or missing API key")
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="Scrapling Service",
|
||||
description="HTTP microservice exposing Scrapling web-scraping fetcherss to n8n",
|
||||
version="0.1.0",
|
||||
)
|
||||
|
||||
app.include_router(health_router)
|
||||
app.include_router(
|
||||
scrape_router,
|
||||
dependencies=[Depends(verify_api_key)],
|
||||
)
|
||||
4
service/models/__init__.py
Normal file
4
service/models/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from .request import ScrapeRequest, SelectorDef
|
||||
from .response import ScrapeResponse, HealthResponse
|
||||
|
||||
__all__ = ["ScrapeRequest", "SelectorDef", "ScrapeResponse", "HealthResponse"]
|
||||
41
service/models/request.py
Normal file
41
service/models/request.py
Normal file
@@ -0,0 +1,41 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any, Literal
|
||||
|
||||
from pydantic import BaseModel, field_validator
|
||||
|
||||
|
||||
class SelectorDef(BaseModel):
|
||||
name: str
|
||||
selector: str
|
||||
selector_type: Literal["css", "xpath"] = "css"
|
||||
attribute: str | None = None # None = get text content
|
||||
multiple: bool = False
|
||||
|
||||
|
||||
class ScrapeRequest(BaseModel):
|
||||
url: str
|
||||
fetcher_type: Literal["http", "stealth", "dynamic"] = "http"
|
||||
selectors: list[SelectorDef] = []
|
||||
return_html: bool = False
|
||||
timeout: int = 30000
|
||||
proxy: str | None = None
|
||||
headers: dict[str, str] = {}
|
||||
# dynamic-fetcher specific
|
||||
wait_selector: str | None = None
|
||||
network_idle: bool = False
|
||||
headless: bool = True
|
||||
|
||||
@field_validator("url")
|
||||
@classmethod
|
||||
def url_must_have_scheme(cls, v: str) -> str:
|
||||
if not v.startswith(("http://", "https://")):
|
||||
raise ValueError("URL must start with http:// or https://")
|
||||
return v
|
||||
|
||||
@field_validator("timeout")
|
||||
@classmethod
|
||||
def timeout_range(cls, v: int) -> int:
|
||||
if not (1000 <= v <= 120_000):
|
||||
raise ValueError("timeout must be between 1000 and 120000 ms")
|
||||
return v
|
||||
21
service/models/response.py
Normal file
21
service/models/response.py
Normal file
@@ -0,0 +1,21 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
class ScrapeResponse(BaseModel):
|
||||
url: str
|
||||
status_code: int
|
||||
html: str | None = None
|
||||
data: dict[str, Any] = {}
|
||||
fetcher_used: str
|
||||
elapsed_ms: float
|
||||
error: str | None = None
|
||||
|
||||
|
||||
class HealthResponse(BaseModel):
|
||||
status: str
|
||||
version: str
|
||||
dynamic_session_ready: bool
|
||||
25
service/pyproject.toml
Normal file
25
service/pyproject.toml
Normal file
@@ -0,0 +1,25 @@
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
|
||||
[project]
|
||||
name = "scrapling-service"
|
||||
version = "0.1.0"
|
||||
description = "FastAPI microservice wrapping Scrapling for n8n integration"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"scrapling[fetchers]>=0.2",
|
||||
"fastapi>=0.115",
|
||||
"uvicorn[standard]>=0.30",
|
||||
"pydantic>=2.0",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-asyncio>=0.23",
|
||||
"httpx>=0.27",
|
||||
]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
asyncio_mode = "auto"
|
||||
4
service/routers/__init__.py
Normal file
4
service/routers/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from .scrape import router as scrape_router
|
||||
from .health import router as health_router
|
||||
|
||||
__all__ = ["scrape_router", "health_router"]
|
||||
15
service/routers/health.py
Normal file
15
service/routers/health.py
Normal file
@@ -0,0 +1,15 @@
|
||||
from fastapi import APIRouter
|
||||
from ..models.response import HealthResponse
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
VERSION = "0.1.0"
|
||||
|
||||
|
||||
@router.get("/health", response_model=HealthResponse)
|
||||
async def health() -> HealthResponse:
|
||||
return HealthResponse(
|
||||
status="ok",
|
||||
version=VERSION,
|
||||
dynamic_session_ready=True,
|
||||
)
|
||||
35
service/routers/scrape.py
Normal file
35
service/routers/scrape.py
Normal file
@@ -0,0 +1,35 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from fastapi import APIRouter, HTTPException
|
||||
|
||||
from ..models.request import ScrapeRequest
|
||||
from ..models.response import ScrapeResponse
|
||||
from ..scrapers import DynamicScraper, HttpScraper, StealthyScraper
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.post("/scrape", response_model=ScrapeResponse)
|
||||
async def scrape(req: ScrapeRequest) -> ScrapeResponse:
|
||||
try:
|
||||
if req.fetcher_type == "http":
|
||||
scraper = HttpScraper()
|
||||
elif req.fetcher_type == "stealth":
|
||||
scraper = StealthyScraper()
|
||||
elif req.fetcher_type == "dynamic":
|
||||
scraper = DynamicScraper()
|
||||
else:
|
||||
raise HTTPException(status_code=400, detail=f"Unknown fetcher_type: {req.fetcher_type}")
|
||||
|
||||
return await scraper.scrape(req)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as exc:
|
||||
return ScrapeResponse(
|
||||
url=req.url,
|
||||
status_code=0,
|
||||
fetcher_used=req.fetcher_type,
|
||||
elapsed_ms=0,
|
||||
error=str(exc),
|
||||
)
|
||||
5
service/scrapers/__init__.py
Normal file
5
service/scrapers/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
from .fetcher import HttpScraper
|
||||
from .stealthy import StealthyScraper
|
||||
from .dynamic import DynamicScraper
|
||||
|
||||
__all__ = ["HttpScraper", "StealthyScraper", "DynamicScraper"]
|
||||
68
service/scrapers/base.py
Normal file
68
service/scrapers/base.py
Normal file
@@ -0,0 +1,68 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any
|
||||
|
||||
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
|
||||
|
||||
from ..models.request import ScrapeRequest, SelectorDef
|
||||
from ..models.response import ScrapeResponse
|
||||
|
||||
|
||||
def apply_selectors(page: Any, selectors: list[SelectorDef]) -> dict[str, Any]:
|
||||
"""Extract data from a Scrapling page object using CSS/XPath selectors."""
|
||||
result: dict[str, Any] = {}
|
||||
for sel in selectors:
|
||||
try:
|
||||
if sel.selector_type == "css":
|
||||
if sel.multiple:
|
||||
elements = page.css(sel.selector)
|
||||
else:
|
||||
elements = [page.css_first(sel.selector)]
|
||||
else:
|
||||
if sel.multiple:
|
||||
elements = page.xpath(sel.selector)
|
||||
else:
|
||||
elements = [page.xpath_first(sel.selector)]
|
||||
|
||||
def extract_value(el: Any) -> str | None:
|
||||
if el is None:
|
||||
return None
|
||||
if sel.attribute:
|
||||
return el.attrib.get(sel.attribute)
|
||||
return el.text
|
||||
|
||||
if sel.multiple:
|
||||
result[sel.name] = [extract_value(el) for el in (elements or [])]
|
||||
else:
|
||||
result[sel.name] = extract_value(elements[0] if elements else None)
|
||||
except Exception as exc:
|
||||
result[sel.name] = None
|
||||
result[f"{sel.name}_error"] = str(exc)
|
||||
return result
|
||||
|
||||
|
||||
class BaseScraper(ABC):
|
||||
@abstractmethod
|
||||
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
|
||||
...
|
||||
|
||||
def _build_response(
|
||||
self,
|
||||
req: ScrapeRequest,
|
||||
page: Any,
|
||||
fetcher_name: str,
|
||||
start: float,
|
||||
) -> ScrapeResponse:
|
||||
elapsed = (time.perf_counter() - start) * 1000
|
||||
html = page.html if req.return_html else None
|
||||
data = apply_selectors(page, req.selectors) if req.selectors else {}
|
||||
return ScrapeResponse(
|
||||
url=req.url,
|
||||
status_code=page.status if hasattr(page, "status") else 200,
|
||||
html=html,
|
||||
data=data,
|
||||
fetcher_used=fetcher_name,
|
||||
elapsed_ms=round(elapsed, 2),
|
||||
)
|
||||
31
service/scrapers/dynamic.py
Normal file
31
service/scrapers/dynamic.py
Normal file
@@ -0,0 +1,31 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import time
|
||||
|
||||
from scrapling import PlayWrightFetcher
|
||||
|
||||
from ..models.request import ScrapeRequest
|
||||
from ..models.response import ScrapeResponse
|
||||
from .base import BaseScraper
|
||||
|
||||
|
||||
class DynamicScraper(BaseScraper):
|
||||
"""Wraps Scrapling's PlayWrightFetcher — full browser via Playwright."""
|
||||
|
||||
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
|
||||
start = time.perf_counter()
|
||||
|
||||
kwargs: dict = {
|
||||
"url": req.url,
|
||||
"headless": req.headless,
|
||||
"timeout": req.timeout,
|
||||
"network_idle": req.network_idle,
|
||||
}
|
||||
if req.wait_selector:
|
||||
kwargs["wait_selector"] = req.wait_selector
|
||||
if req.proxy:
|
||||
kwargs["proxy"] = req.proxy
|
||||
|
||||
fetcher = PlayWrightFetcher(auto_match=False)
|
||||
page = await fetcher.async_fetch(**kwargs)
|
||||
return self._build_response(req, page, "dynamic", start)
|
||||
30
service/scrapers/fetcher.py
Normal file
30
service/scrapers/fetcher.py
Normal file
@@ -0,0 +1,30 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
|
||||
from scrapling import Fetcher
|
||||
|
||||
from ..models.request import ScrapeRequest
|
||||
from ..models.response import ScrapeResponse
|
||||
from .base import BaseScraper
|
||||
|
||||
|
||||
class HttpScraper(BaseScraper):
|
||||
"""Wraps Scrapling's Fetcher — plain HTTP, fastest option."""
|
||||
|
||||
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
|
||||
start = time.perf_counter()
|
||||
fetcher = Fetcher(auto_match=False)
|
||||
|
||||
kwargs: dict = {
|
||||
"url": req.url,
|
||||
"timeout": req.timeout / 1000,
|
||||
}
|
||||
if req.headers:
|
||||
kwargs["headers"] = req.headers
|
||||
if req.proxy:
|
||||
kwargs["proxy"] = req.proxy
|
||||
|
||||
page = await asyncio.to_thread(fetcher.get, **kwargs)
|
||||
return self._build_response(req, page, "http", start)
|
||||
30
service/scrapers/stealthy.py
Normal file
30
service/scrapers/stealthy.py
Normal file
@@ -0,0 +1,30 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import time
|
||||
|
||||
from scrapling import StealthyFetcher
|
||||
|
||||
from ..models.request import ScrapeRequest
|
||||
from ..models.response import ScrapeResponse
|
||||
from .base import BaseScraper
|
||||
|
||||
|
||||
class StealthyScraper(BaseScraper):
|
||||
"""Wraps Scrapling's StealthyFetcher — TLS fingerprint impersonation."""
|
||||
|
||||
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
|
||||
start = time.perf_counter()
|
||||
fetcher = StealthyFetcher(auto_match=False)
|
||||
|
||||
kwargs: dict = {
|
||||
"url": req.url,
|
||||
"timeout": req.timeout / 1000,
|
||||
}
|
||||
if req.headers:
|
||||
kwargs["extra_headers"] = req.headers
|
||||
if req.proxy:
|
||||
kwargs["proxy"] = req.proxy
|
||||
|
||||
page = await asyncio.to_thread(fetcher.fetch, **kwargs)
|
||||
return self._build_response(req, page, "stealth", start)
|
||||
28
src/credentials/ScraplingApi.credentials.ts
Normal file
28
src/credentials/ScraplingApi.credentials.ts
Normal file
@@ -0,0 +1,28 @@
|
||||
import { ICredentialType, INodeProperties } from 'n8n-workflow';
|
||||
|
||||
export class ScraplingApi implements ICredentialType {
|
||||
name = 'scraplingApi';
|
||||
displayName = 'Scrapling API';
|
||||
documentationUrl = 'https://github.com/D4Vinci/Scrapling';
|
||||
|
||||
properties: INodeProperties[] = [
|
||||
{
|
||||
displayName: 'Service URL',
|
||||
name: 'serviceUrl',
|
||||
type: 'string',
|
||||
default: 'http://localhost:8765',
|
||||
placeholder: 'http://localhost:8765',
|
||||
description: 'URL of the Scrapling microservice (without trailing slash)',
|
||||
required: true,
|
||||
},
|
||||
{
|
||||
displayName: 'API Key',
|
||||
name: 'apiKey',
|
||||
type: 'string',
|
||||
typeOptions: { password: true },
|
||||
default: '',
|
||||
description: 'Optional API key for authenticating with the Scrapling service',
|
||||
required: false,
|
||||
},
|
||||
];
|
||||
}
|
||||
372
src/nodes/Scrapling/Scrapling.node.ts
Normal file
372
src/nodes/Scrapling/Scrapling.node.ts
Normal file
@@ -0,0 +1,372 @@
|
||||
import {
|
||||
IExecuteFunctions,
|
||||
INodeExecutionData,
|
||||
INodeType,
|
||||
INodeTypeDescription,
|
||||
NodeOperationError,
|
||||
IDataObject,
|
||||
} from 'n8n-workflow';
|
||||
|
||||
import { scraplingRequest, ScraplingRequestPayload } from './helpers';
|
||||
|
||||
export class Scrapling implements INodeType {
|
||||
description: INodeTypeDescription = {
|
||||
displayName: 'Scrapling',
|
||||
name: 'scrapling',
|
||||
icon: 'file:scrapling.svg',
|
||||
group: ['input'],
|
||||
version: 1,
|
||||
subtitle: '={{$parameter["operation"] + ": " + $parameter["resource"]}}',
|
||||
description: 'Scrape web pages using Scrapling — HTTP, stealth and Playwright fetchers',
|
||||
defaults: {
|
||||
name: 'Scrapling',
|
||||
},
|
||||
inputs: ['main'],
|
||||
outputs: ['main'],
|
||||
credentials: [
|
||||
{
|
||||
name: 'scraplingApi',
|
||||
required: true,
|
||||
},
|
||||
],
|
||||
properties: [
|
||||
// ── Resource ──────────────────────────────────────────────────────
|
||||
{
|
||||
displayName: 'Resource',
|
||||
name: 'resource',
|
||||
type: 'options',
|
||||
noDataExpression: true,
|
||||
options: [
|
||||
{ name: 'Page', value: 'page' },
|
||||
{ name: 'Data', value: 'data' },
|
||||
],
|
||||
default: 'page',
|
||||
},
|
||||
|
||||
// ── Page operations ───────────────────────────────────────────────
|
||||
{
|
||||
displayName: 'Operation',
|
||||
name: 'operation',
|
||||
type: 'options',
|
||||
noDataExpression: true,
|
||||
displayOptions: { show: { resource: ['page'] } },
|
||||
options: [
|
||||
{
|
||||
name: 'Fetch',
|
||||
value: 'fetch',
|
||||
description: 'Fast HTTP fetch (Fetcher)',
|
||||
action: 'Fetch a page via HTTP',
|
||||
},
|
||||
{
|
||||
name: 'Fetch Stealth',
|
||||
value: 'fetchStealth',
|
||||
description: 'TLS fingerprint impersonation (StealthyFetcher)',
|
||||
action: 'Fetch a page with stealth mode',
|
||||
},
|
||||
{
|
||||
name: 'Fetch Dynamic',
|
||||
value: 'fetchDynamic',
|
||||
description: 'Full browser via Playwright (PlayWrightFetcher)',
|
||||
action: 'Fetch a page with a real browser',
|
||||
},
|
||||
],
|
||||
default: 'fetch',
|
||||
},
|
||||
|
||||
// ── Data operations ───────────────────────────────────────────────
|
||||
{
|
||||
displayName: 'Operation',
|
||||
name: 'operation',
|
||||
type: 'options',
|
||||
noDataExpression: true,
|
||||
displayOptions: { show: { resource: ['data'] } },
|
||||
options: [
|
||||
{
|
||||
name: 'Extract',
|
||||
value: 'extract',
|
||||
description: 'Fetch a page and extract data with CSS/XPath selectors',
|
||||
action: 'Extract structured data from a page',
|
||||
},
|
||||
{
|
||||
name: 'Extract Tables',
|
||||
value: 'extractTables',
|
||||
description: 'Fetch a page and extract all HTML tables as JSON',
|
||||
action: 'Extract HTML tables from a page',
|
||||
},
|
||||
],
|
||||
default: 'extract',
|
||||
},
|
||||
|
||||
// ── URL (all operations) ──────────────────────────────────────────
|
||||
{
|
||||
displayName: 'URL',
|
||||
name: 'url',
|
||||
type: 'string',
|
||||
default: '',
|
||||
required: true,
|
||||
placeholder: 'https://example.com',
|
||||
description: 'URL of the page to scrape',
|
||||
},
|
||||
|
||||
// ── Fetcher type (data resource) ──────────────────────────────────
|
||||
{
|
||||
displayName: 'Fetcher',
|
||||
name: 'fetcherType',
|
||||
type: 'options',
|
||||
displayOptions: { show: { resource: ['data'] } },
|
||||
options: [
|
||||
{ name: 'HTTP (fastest)', value: 'http' },
|
||||
{ name: 'Stealth (TLS impersonation)', value: 'stealth' },
|
||||
{ name: 'Dynamic (Playwright browser)', value: 'dynamic' },
|
||||
],
|
||||
default: 'http',
|
||||
description: 'Which Scrapling fetcher to use for loading the page',
|
||||
},
|
||||
|
||||
// ── Selectors (extract operation) ─────────────────────────────────
|
||||
{
|
||||
displayName: 'Selectors',
|
||||
name: 'selectors',
|
||||
type: 'fixedCollection',
|
||||
typeOptions: { multipleValues: true },
|
||||
displayOptions: { show: { operation: ['extract'] } },
|
||||
default: {},
|
||||
options: [
|
||||
{
|
||||
name: 'selector',
|
||||
displayName: 'Selector',
|
||||
values: [
|
||||
{
|
||||
displayName: 'Field Name',
|
||||
name: 'name',
|
||||
type: 'string',
|
||||
default: '',
|
||||
required: true,
|
||||
description: 'Name for this field in the output',
|
||||
},
|
||||
{
|
||||
displayName: 'Selector',
|
||||
name: 'selector',
|
||||
type: 'string',
|
||||
default: '',
|
||||
required: true,
|
||||
placeholder: 'h1.title',
|
||||
description: 'CSS selector or XPath expression',
|
||||
},
|
||||
{
|
||||
displayName: 'Type',
|
||||
name: 'selectorType',
|
||||
type: 'options',
|
||||
options: [
|
||||
{ name: 'CSS', value: 'css' },
|
||||
{ name: 'XPath', value: 'xpath' },
|
||||
],
|
||||
default: 'css',
|
||||
},
|
||||
{
|
||||
displayName: 'Attribute',
|
||||
name: 'attribute',
|
||||
type: 'string',
|
||||
default: '',
|
||||
placeholder: 'href',
|
||||
description: 'HTML attribute to extract. Leave empty to get text content.',
|
||||
},
|
||||
{
|
||||
displayName: 'Return Multiple',
|
||||
name: 'multiple',
|
||||
type: 'boolean',
|
||||
default: false,
|
||||
description: 'Whether to return all matching elements as an array',
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
},
|
||||
|
||||
// ── Return HTML ────────────────────────────────────────────────────
|
||||
{
|
||||
displayName: 'Return Raw HTML',
|
||||
name: 'returnHtml',
|
||||
type: 'boolean',
|
||||
default: false,
|
||||
description: 'Whether to include the raw HTML in the response',
|
||||
},
|
||||
|
||||
// ── Timeout ────────────────────────────────────────────────────────
|
||||
{
|
||||
displayName: 'Timeout (ms)',
|
||||
name: 'timeout',
|
||||
type: 'number',
|
||||
default: 30000,
|
||||
typeOptions: { minValue: 1000, maxValue: 120000 },
|
||||
description: 'Request timeout in milliseconds',
|
||||
},
|
||||
|
||||
// ── Additional options (collapsible) ──────────────────────────────
|
||||
{
|
||||
displayName: 'Additional Options',
|
||||
name: 'additionalOptions',
|
||||
type: 'collection',
|
||||
placeholder: 'Add Option',
|
||||
default: {},
|
||||
options: [
|
||||
{
|
||||
displayName: 'Proxy',
|
||||
name: 'proxy',
|
||||
type: 'string',
|
||||
default: '',
|
||||
placeholder: 'http://user:pass@proxy.example.com:8080',
|
||||
description: 'Proxy URL to use for the request',
|
||||
},
|
||||
{
|
||||
displayName: 'Extra Headers',
|
||||
name: 'headers',
|
||||
type: 'json',
|
||||
default: '{}',
|
||||
description: 'Additional HTTP headers as a JSON object',
|
||||
},
|
||||
{
|
||||
displayName: 'Wait for Selector',
|
||||
name: 'waitSelector',
|
||||
type: 'string',
|
||||
default: '',
|
||||
placeholder: '#content',
|
||||
description: 'CSS selector to wait for before extracting (dynamic fetcher only)',
|
||||
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
|
||||
},
|
||||
{
|
||||
displayName: 'Wait for Network Idle',
|
||||
name: 'networkIdle',
|
||||
type: 'boolean',
|
||||
default: false,
|
||||
description: 'Whether to wait for network activity to cease (dynamic fetcher only)',
|
||||
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
|
||||
},
|
||||
{
|
||||
displayName: 'Headless Browser',
|
||||
name: 'headless',
|
||||
type: 'boolean',
|
||||
default: true,
|
||||
description: 'Whether to run the browser in headless mode (dynamic fetcher only)',
|
||||
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
|
||||
},
|
||||
],
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
async execute(this: IExecuteFunctions): Promise<INodeExecutionData[][]> {
|
||||
const items = this.getInputData();
|
||||
const returnData: INodeExecutionData[] = [];
|
||||
|
||||
const credentials = await this.getCredentials('scraplingApi');
|
||||
const serviceUrl = (credentials.serviceUrl as string).replace(/\/$/, '');
|
||||
const apiKey = (credentials.apiKey as string) || undefined;
|
||||
|
||||
for (let i = 0; i < items.length; i++) {
|
||||
const resource = this.getNodeParameter('resource', i) as string;
|
||||
const operation = this.getNodeParameter('operation', i) as string;
|
||||
const url = this.getNodeParameter('url', i) as string;
|
||||
const returnHtml = this.getNodeParameter('returnHtml', i, false) as boolean;
|
||||
const timeout = this.getNodeParameter('timeout', i, 30000) as number;
|
||||
const additionalOptions = this.getNodeParameter('additionalOptions', i, {}) as IDataObject;
|
||||
|
||||
try {
|
||||
const dataFetcherType = resource === 'data'
|
||||
? (this.getNodeParameter('fetcherType', i, 'http') as 'http' | 'stealth' | 'dynamic')
|
||||
: undefined;
|
||||
|
||||
const fetcherType = resolveFetcherType(resource, operation, dataFetcherType);
|
||||
|
||||
const payload: ScraplingRequestPayload = {
|
||||
url,
|
||||
fetcher_type: fetcherType,
|
||||
return_html: returnHtml,
|
||||
timeout,
|
||||
};
|
||||
|
||||
if (additionalOptions.proxy) {
|
||||
payload.proxy = additionalOptions.proxy as string;
|
||||
}
|
||||
|
||||
if (additionalOptions.headers) {
|
||||
const raw = additionalOptions.headers as string;
|
||||
try {
|
||||
payload.headers = JSON.parse(raw) as Record<string, string>;
|
||||
} catch {
|
||||
throw new NodeOperationError(this.getNode(), 'Extra Headers must be valid JSON', { itemIndex: i });
|
||||
}
|
||||
}
|
||||
|
||||
if (fetcherType === 'dynamic') {
|
||||
if (additionalOptions.waitSelector) {
|
||||
payload.wait_selector = additionalOptions.waitSelector as string;
|
||||
}
|
||||
payload.network_idle = (additionalOptions.networkIdle as boolean) ?? false;
|
||||
payload.headless = (additionalOptions.headless as boolean) ?? true;
|
||||
}
|
||||
|
||||
if (operation === 'extract') {
|
||||
const rawSelectors = (this.getNodeParameter('selectors', i, { selector: [] }) as IDataObject)
|
||||
.selector as IDataObject[];
|
||||
|
||||
if (rawSelectors && rawSelectors.length > 0) {
|
||||
payload.selectors = rawSelectors.map((s) => ({
|
||||
name: s.name as string,
|
||||
selector: s.selector as string,
|
||||
selector_type: (s.selectorType as 'css' | 'xpath') ?? 'css',
|
||||
attribute: (s.attribute as string) || undefined,
|
||||
multiple: (s.multiple as boolean) ?? false,
|
||||
}));
|
||||
}
|
||||
}
|
||||
|
||||
if (operation === 'extractTables') {
|
||||
// Inject built-in table selectors — Python side returns tables as data.tables[]
|
||||
payload.selectors = [
|
||||
{ name: '__tables__', selector: 'table', selector_type: 'css', multiple: true },
|
||||
];
|
||||
}
|
||||
|
||||
const result = await scraplingRequest(this, serviceUrl, apiKey, payload);
|
||||
|
||||
if (result.error) {
|
||||
if (this.continueOnFail()) {
|
||||
returnData.push({ json: { error: result.error }, pairedItem: { item: i } });
|
||||
continue;
|
||||
}
|
||||
throw new NodeOperationError(this.getNode(), `Scrapling error: ${result.error}`, { itemIndex: i });
|
||||
}
|
||||
|
||||
returnData.push({ json: result as unknown as IDataObject, pairedItem: { item: i } });
|
||||
|
||||
} catch (error) {
|
||||
if (this.continueOnFail()) {
|
||||
returnData.push({
|
||||
json: { error: (error as Error).message },
|
||||
pairedItem: { item: i },
|
||||
});
|
||||
continue;
|
||||
}
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
|
||||
return [returnData];
|
||||
}
|
||||
}
|
||||
|
||||
// ── Helpers ───────────────────────────────────────────────────────────────────
|
||||
|
||||
function resolveFetcherType(
|
||||
resource: string,
|
||||
operation: string,
|
||||
dataFetcherType?: 'http' | 'stealth' | 'dynamic',
|
||||
): 'http' | 'stealth' | 'dynamic' {
|
||||
if (resource === 'page') {
|
||||
if (operation === 'fetchStealth') return 'stealth';
|
||||
if (operation === 'fetchDynamic') return 'dynamic';
|
||||
return 'http';
|
||||
}
|
||||
return dataFetcherType ?? 'http';
|
||||
}
|
||||
107
src/nodes/Scrapling/__tests__/Scrapling.node.test.ts
Normal file
107
src/nodes/Scrapling/__tests__/Scrapling.node.test.ts
Normal file
@@ -0,0 +1,107 @@
|
||||
import { Scrapling } from '../Scrapling.node';
|
||||
import * as helpers from '../helpers';
|
||||
import { IExecuteFunctions, INodeExecutionData } from 'n8n-workflow';
|
||||
|
||||
jest.mock('../helpers');
|
||||
|
||||
const mockScraplingRequest = helpers.scraplingRequest as jest.MockedFunction<typeof helpers.scraplingRequest>;
|
||||
|
||||
function makeContext(overrides: Partial<Record<string, unknown>> = {}): IExecuteFunctions {
|
||||
const params: Record<string, unknown> = {
|
||||
resource: 'page',
|
||||
operation: 'fetch',
|
||||
url: 'https://example.com',
|
||||
returnHtml: false,
|
||||
timeout: 30000,
|
||||
additionalOptions: {},
|
||||
...overrides,
|
||||
};
|
||||
|
||||
return {
|
||||
getInputData: jest.fn().mockReturnValue([{ json: {} }]),
|
||||
getNodeParameter: jest.fn().mockImplementation((name: string) => params[name]),
|
||||
getCredentials: jest.fn().mockResolvedValue({
|
||||
serviceUrl: 'http://localhost:8765',
|
||||
apiKey: '',
|
||||
}),
|
||||
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
|
||||
continueOnFail: jest.fn().mockReturnValue(false),
|
||||
} as unknown as IExecuteFunctions;
|
||||
}
|
||||
|
||||
const successResponse = {
|
||||
url: 'https://example.com',
|
||||
status_code: 200,
|
||||
data: {},
|
||||
fetcher_used: 'http',
|
||||
elapsed_ms: 50,
|
||||
};
|
||||
|
||||
describe('Scrapling node', () => {
|
||||
beforeEach(() => jest.clearAllMocks());
|
||||
|
||||
it('calls scraplingRequest with http fetcher for page:fetch', async () => {
|
||||
mockScraplingRequest.mockResolvedValue(successResponse);
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext();
|
||||
const result = await node.execute.call(ctx);
|
||||
expect(mockScraplingRequest).toHaveBeenCalledTimes(1);
|
||||
const payload = mockScraplingRequest.mock.calls[0][3];
|
||||
expect(payload.fetcher_type).toBe('http');
|
||||
expect(payload.url).toBe('https://example.com');
|
||||
expect(result[0]).toHaveLength(1);
|
||||
});
|
||||
|
||||
it('calls scraplingRequest with stealth fetcher for page:fetchStealth', async () => {
|
||||
mockScraplingRequest.mockResolvedValue({ ...successResponse, fetcher_used: 'stealth' });
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext({ operation: 'fetchStealth' });
|
||||
await node.execute.call(ctx);
|
||||
const payload = mockScraplingRequest.mock.calls[0][3];
|
||||
expect(payload.fetcher_type).toBe('stealth');
|
||||
});
|
||||
|
||||
it('calls scraplingRequest with dynamic fetcher for page:fetchDynamic', async () => {
|
||||
mockScraplingRequest.mockResolvedValue({ ...successResponse, fetcher_used: 'dynamic' });
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext({ operation: 'fetchDynamic' });
|
||||
await node.execute.call(ctx);
|
||||
const payload = mockScraplingRequest.mock.calls[0][3];
|
||||
expect(payload.fetcher_type).toBe('dynamic');
|
||||
});
|
||||
|
||||
it('returns error json when continueOnFail is true and service returns error', async () => {
|
||||
mockScraplingRequest.mockResolvedValue({ ...successResponse, error: 'connection refused', status_code: 0 });
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext();
|
||||
(ctx.continueOnFail as jest.Mock).mockReturnValue(true);
|
||||
const result = await node.execute.call(ctx);
|
||||
expect((result[0][0].json as Record<string, unknown>).error).toBe('connection refused');
|
||||
});
|
||||
|
||||
it('throws when service returns error and continueOnFail is false', async () => {
|
||||
mockScraplingRequest.mockResolvedValue({ ...successResponse, error: 'timeout', status_code: 0 });
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext();
|
||||
await expect(node.execute.call(ctx)).rejects.toThrow('timeout');
|
||||
});
|
||||
|
||||
it('passes selectors to payload for data:extract', async () => {
|
||||
mockScraplingRequest.mockResolvedValue({ ...successResponse, data: { title: 'Hello' } });
|
||||
const node = new Scrapling();
|
||||
const ctx = makeContext({
|
||||
resource: 'data',
|
||||
operation: 'extract',
|
||||
fetcherType: 'http',
|
||||
selectors: {
|
||||
selector: [
|
||||
{ name: 'title', selector: 'h1', selectorType: 'css', attribute: '', multiple: false },
|
||||
],
|
||||
},
|
||||
});
|
||||
await node.execute.call(ctx);
|
||||
const payload = mockScraplingRequest.mock.calls[0][3];
|
||||
expect(payload.selectors).toHaveLength(1);
|
||||
expect(payload.selectors![0].name).toBe('title');
|
||||
});
|
||||
});
|
||||
79
src/nodes/Scrapling/__tests__/helpers.test.ts
Normal file
79
src/nodes/Scrapling/__tests__/helpers.test.ts
Normal file
@@ -0,0 +1,79 @@
|
||||
import { scraplingRequest, ScraplingRequestPayload } from '../helpers';
|
||||
import { IExecuteFunctions } from 'n8n-workflow';
|
||||
|
||||
function makeMockContext(responseBody: unknown, statusCode = 200): IExecuteFunctions {
|
||||
return {
|
||||
helpers: {
|
||||
request: jest.fn().mockResolvedValue(responseBody),
|
||||
},
|
||||
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
|
||||
} as unknown as IExecuteFunctions;
|
||||
}
|
||||
|
||||
function makeMockContextThrowing(statusCode: number): IExecuteFunctions {
|
||||
const err = Object.assign(new Error(`HTTP ${statusCode}`), { statusCode });
|
||||
return {
|
||||
helpers: {
|
||||
request: jest.fn().mockRejectedValue(err),
|
||||
},
|
||||
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
|
||||
} as unknown as IExecuteFunctions;
|
||||
}
|
||||
|
||||
const basePayload: ScraplingRequestPayload = {
|
||||
url: 'https://example.com',
|
||||
fetcher_type: 'http',
|
||||
};
|
||||
|
||||
describe('scraplingRequest', () => {
|
||||
it('returns parsed response on success', async () => {
|
||||
const mockResponse = {
|
||||
url: 'https://example.com',
|
||||
status_code: 200,
|
||||
data: {},
|
||||
fetcher_used: 'http',
|
||||
elapsed_ms: 42,
|
||||
};
|
||||
const ctx = makeMockContext(mockResponse);
|
||||
const result = await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
|
||||
expect(result.url).toBe('https://example.com');
|
||||
expect(result.fetcher_used).toBe('http');
|
||||
});
|
||||
|
||||
it('calls correct URL with POST', async () => {
|
||||
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
|
||||
await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
|
||||
const requestMock = (ctx.helpers.request as jest.Mock);
|
||||
expect(requestMock).toHaveBeenCalledTimes(1);
|
||||
const callArgs = requestMock.mock.calls[0][0];
|
||||
expect(callArgs.url).toBe('http://localhost:8765/scrape');
|
||||
expect(callArgs.method).toBe('POST');
|
||||
});
|
||||
|
||||
it('includes X-API-Key header when apiKey is provided', async () => {
|
||||
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
|
||||
await scraplingRequest(ctx, 'http://localhost:8765', 'secret123', basePayload);
|
||||
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
|
||||
expect(callArgs.headers['X-API-Key']).toBe('secret123');
|
||||
});
|
||||
|
||||
it('does not include X-API-Key header when apiKey is undefined', async () => {
|
||||
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
|
||||
await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
|
||||
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
|
||||
expect(callArgs.headers['X-API-Key']).toBeUndefined();
|
||||
});
|
||||
|
||||
it('throws NodeOperationError on 401', async () => {
|
||||
const ctx = makeMockContextThrowing(401);
|
||||
await expect(scraplingRequest(ctx, 'http://localhost:8765', 'bad-key', basePayload))
|
||||
.rejects.toThrow('401');
|
||||
});
|
||||
|
||||
it('strips trailing slash from serviceUrl', async () => {
|
||||
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
|
||||
await scraplingRequest(ctx, 'http://localhost:8765/', undefined, basePayload);
|
||||
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
|
||||
expect(callArgs.url).toBe('http://localhost:8765/scrape');
|
||||
});
|
||||
});
|
||||
92
src/nodes/Scrapling/helpers.ts
Normal file
92
src/nodes/Scrapling/helpers.ts
Normal file
@@ -0,0 +1,92 @@
|
||||
import {
|
||||
IExecuteFunctions,
|
||||
IDataObject,
|
||||
NodeOperationError,
|
||||
JsonObject,
|
||||
} from 'n8n-workflow';
|
||||
|
||||
export interface SelectorDef {
|
||||
name: string;
|
||||
selector: string;
|
||||
selectorType: 'css' | 'xpath';
|
||||
attribute?: string;
|
||||
multiple?: boolean;
|
||||
}
|
||||
|
||||
export interface ScraplingRequestPayload extends IDataObject {
|
||||
url: string;
|
||||
fetcher_type: 'http' | 'stealth' | 'dynamic';
|
||||
selectors?: Array<{
|
||||
name: string;
|
||||
selector: string;
|
||||
selector_type: 'css' | 'xpath';
|
||||
attribute?: string;
|
||||
multiple?: boolean;
|
||||
}>;
|
||||
return_html?: boolean;
|
||||
timeout?: number;
|
||||
proxy?: string;
|
||||
headers?: Record<string, string>;
|
||||
wait_selector?: string;
|
||||
network_idle?: boolean;
|
||||
headless?: boolean;
|
||||
}
|
||||
|
||||
export interface ScraplingResponse extends IDataObject {
|
||||
url: string;
|
||||
status_code: number;
|
||||
html?: string;
|
||||
data: IDataObject;
|
||||
fetcher_used: string;
|
||||
elapsed_ms: number;
|
||||
error?: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Make a POST /scrape request to the Scrapling microservice.
|
||||
*/
|
||||
export async function scraplingRequest(
|
||||
context: IExecuteFunctions,
|
||||
serviceUrl: string,
|
||||
apiKey: string | undefined,
|
||||
payload: ScraplingRequestPayload,
|
||||
): Promise<ScraplingResponse> {
|
||||
const url = `${serviceUrl.replace(/\/$/, '')}/scrape`;
|
||||
|
||||
const headers: Record<string, string> = {
|
||||
'Content-Type': 'application/json',
|
||||
};
|
||||
if (apiKey) {
|
||||
headers['X-API-Key'] = apiKey;
|
||||
}
|
||||
|
||||
try {
|
||||
const response = await context.helpers.request({
|
||||
method: 'POST',
|
||||
url,
|
||||
headers,
|
||||
body: JSON.stringify(payload),
|
||||
json: true,
|
||||
});
|
||||
|
||||
return response as ScraplingResponse;
|
||||
} catch (error) {
|
||||
const err = error as JsonObject;
|
||||
const statusCode = (err.statusCode as number) ?? 0;
|
||||
|
||||
if (statusCode === 401) {
|
||||
throw new NodeOperationError(
|
||||
context.getNode(),
|
||||
'Scrapling service returned 401 Unauthorized. Check your API key.',
|
||||
);
|
||||
}
|
||||
if (statusCode === 422) {
|
||||
throw new NodeOperationError(
|
||||
context.getNode(),
|
||||
`Scrapling service validation error: ${JSON.stringify((err.error as JsonObject)?.detail ?? err)}`,
|
||||
);
|
||||
}
|
||||
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
22
src/nodes/Scrapling/scrapling.svg
Normal file
22
src/nodes/Scrapling/scrapling.svg
Normal file
@@ -0,0 +1,22 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 60 60" fill="none">
|
||||
<circle cx="30" cy="30" r="28" fill="#1a1a2e" stroke="#4a90d9" stroke-width="2"/>
|
||||
<!-- Spider web lines -->
|
||||
<line x1="30" y1="4" x2="30" y2="56" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
|
||||
<line x1="4" y1="30" x2="56" y2="30" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
|
||||
<line x1="10" y1="10" x2="50" y2="50" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
|
||||
<line x1="50" y1="10" x2="10" y2="50" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
|
||||
<!-- Concentric circles (web rings) -->
|
||||
<circle cx="30" cy="30" r="8" stroke="#4a90d9" stroke-width="0.8" opacity="0.5" fill="none"/>
|
||||
<circle cx="30" cy="30" r="16" stroke="#4a90d9" stroke-width="0.8" opacity="0.4" fill="none"/>
|
||||
<circle cx="30" cy="30" r="24" stroke="#4a90d9" stroke-width="0.8" opacity="0.3" fill="none"/>
|
||||
<!-- Spider body -->
|
||||
<ellipse cx="30" cy="30" r="5" ry="6" fill="#4a90d9"/>
|
||||
<ellipse cx="30" cy="24" r="3.5" ry="3" fill="#5ba3e8"/>
|
||||
<!-- Legs -->
|
||||
<path d="M25 28 Q18 24 14 20" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
<path d="M25 30 Q17 29 13 28" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
<path d="M25 32 Q18 34 14 38" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
<path d="M35 28 Q42 24 46 20" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
<path d="M35 30 Q43 29 47 28" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
<path d="M35 32 Q42 34 46 38" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
|
||||
</svg>
|
||||
|
After Width: | Height: | Size: 1.6 KiB |
19
tsconfig.json
Normal file
19
tsconfig.json
Normal file
@@ -0,0 +1,19 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ES2019",
|
||||
"module": "commonjs",
|
||||
"lib": ["ES2019"],
|
||||
"strict": true,
|
||||
"esModuleInterop": true,
|
||||
"skipLibCheck": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"outDir": "dist",
|
||||
"rootDir": "src",
|
||||
"declaration": true,
|
||||
"declarationMap": true,
|
||||
"sourceMap": true,
|
||||
"resolveJsonModule": true
|
||||
},
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules", "dist"]
|
||||
}
|
||||
Reference in New Issue
Block a user