feat: initial commit

This commit is contained in:
2026-04-18 08:59:04 +02:00
commit 862c0d1703
32 changed files with 8492 additions and 0 deletions

8
.env.test.example Normal file
View File

@@ -0,0 +1,8 @@
# Scrapling service URL (default: http://localhost:8765)
SCRAPLING_SERVICE_URL=http://localhost:8765
# Optional API key for the Scrapling service
SCRAPLING_API_KEY=
# Test URL used in integration tests
SCRAPLING_TEST_URL=https://httpbin.org/html

10
.gitignore vendored Normal file
View File

@@ -0,0 +1,10 @@
node_modules/
dist/
.env.test
*.js.map
*.d.ts.map
coverage/
__pycache__/
*.pyc
.venv/
service/.venv/

399
README.md Normal file
View File

@@ -0,0 +1,399 @@
# n8n-nodes-scrapling
Community node dla [n8n](https://n8n.io) integrujący bibliotekę [Scrapling](https://github.com/D4Vinci/Scrapling) — szybki, adaptacyjny scraper stron internetowych z obsługą zwykłego HTTP, trybu stealth (TLS fingerprint impersonation) oraz pełnej przeglądarki Playwright.
---
## Architektura
```
n8n (Node.js / TypeScript)
│ HTTP POST /scrape
Scrapling Service (Python / FastAPI)
├── Fetcher — szybkie żądania HTTP
├── StealthyFetcher — impersonacja TLS (curl-impersonate)
└── PlayWrightFetcher — pełna przeglądarka Chromium
```
n8n node komunikuje się z Scrapling Service przez HTTP. Serwis Python zarządza instancjami scraperów i zwraca ustrukturyzowane JSON.
---
## Wymagania
| Komponent | Wersja |
|-----------------|----------------|
| n8n | ≥ 1.0 |
| Node.js | ≥ 18 |
| Python | ≥ 3.11 |
| Docker | opcjonalnie |
---
## Instalacja
### Opcja A — Docker Compose (zalecana)
Uruchamia n8n oraz Scrapling Service w jednym poleceniu.
```bash
# Skopiuj i uzupełnij zmienne środowiskowe
cp .env.test.example .env
# Zbuduj i uruchom
docker compose up -d
```
n8n dostępne pod adresem: `http://localhost:5678`
Scrapling Service: `http://localhost:8765`
### Opcja B — Ręczna instalacja
**1. Zainstaluj zależności n8n node:**
```bash
npm install
npm run build
```
**2. Zainstaluj serwis Python:**
```bash
cd service
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/scrapling install # pobiera przeglądarki Playwright
```
**3. Uruchom serwis:**
```bash
.venv/bin/uvicorn service.main:app --host 0.0.0.0 --port 8765
```
**4. Zainstaluj node w n8n:**
Skopiuj katalog `dist/` do katalogu community nodes n8n:
```bash
# Domyślna ścieżka dla n8n instalowanego przez npm
cp -r dist ~/.n8n/nodes/node_modules/n8n-nodes-scrapling
```
Następnie zrestartuj n8n.
---
## Konfiguracja credentials
W n8n dodaj nowy credential typu **Scrapling API**:
| Pole | Opis | Domyślnie |
|-------------|---------------------------------------------------|------------------------|
| Service URL | URL Scrapling Service (bez trailing slash) | `http://localhost:8765` |
| API Key | Opcjonalny klucz autoryzacyjny (header `X-API-Key`) | _(puste)_ |
Jeśli `API_KEY` w serwisie jest pusty, autoryzacja jest wyłączona.
---
## Użycie node
### Resource: Page
Służy do pobierania całych stron. Zwraca URL, status HTTP, czas pobierania i (opcjonalnie) surowy HTML.
#### Operacje
| Operacja | Fetcher | Opis |
|----------------|--------------------|---------------------------------------------------------|
| Fetch | `Fetcher` | Szybkie żądanie HTTP. Najszybsza opcja dla statycznych stron. |
| Fetch Stealth | `StealthyFetcher` | Impersonacja TLS (curl-impersonate). Omija podstawowe anti-bot. |
| Fetch Dynamic | `PlayWrightFetcher`| Pełna przeglądarka Chromium (Playwright). Dla SPA i stron wymagających JS. |
**Przykładowy output:**
```json
{
"url": "https://example.com",
"status_code": 200,
"html": "<html>...</html>",
"data": {},
"fetcher_used": "http",
"elapsed_ms": 312.5
}
```
---
### Resource: Data
Służy do ekstrakcji ustrukturyzowanych danych z pobranych stron.
#### Operacje
| Operacja | Opis |
|-----------------|------------------------------------------------------------------|
| Extract | Pobiera stronę i wyciąga dane za pomocą selektorów CSS lub XPath. |
| Extract Tables | Pobiera stronę i zwraca wszystkie tabele HTML jako tablice JSON. |
#### Konfiguracja selektorów (Extract)
Każdy selektor definiuje pole w wyjściowym obiekcie `data`:
| Pole | Opis |
|------------------|-------------------------------------------------------------------|
| Field Name | Nazwa klucza w wyjściowym JSON |
| Selector | Wyrażenie CSS lub XPath, np. `h1.title` lub `//h1[@class="title"]` |
| Type | `css` lub `xpath` |
| Attribute | Atrybut HTML do pobrania (np. `href`, `src`). Puste = tekst. |
| Return Multiple | Zwróć tablicę wszystkich pasujących elementów |
**Przykładowy output dla Extract:**
```json
{
"url": "https://news.ycombinator.com",
"status_code": 200,
"data": {
"titles": ["Show HN: ...", "Ask HN: ...", "..."],
"top_link": "https://..."
},
"fetcher_used": "http",
"elapsed_ms": 187.2
}
```
---
### Opcje wspólne
| Opcja | Opis | Domyślnie |
|------------------------|------------------------------------------------------------------|-----------|
| Return Raw HTML | Dołącz surowy HTML do odpowiedzi | `false` |
| Timeout (ms) | Maksymalny czas żądania (1000120000 ms) | `30000` |
| Proxy | URL proxy, np. `http://user:pass@proxy.example.com:8080` | _(puste)_ |
| Extra Headers | Dodatkowe nagłówki HTTP jako JSON, np. `{"Accept-Language": "pl"}` | `{}` |
| Wait for Selector | CSS selector, na który czeka Playwright przed ekstrakcją | _(puste)_ |
| Wait for Network Idle | Czekaj na zakończenie aktywności sieciowej (Playwright) | `false` |
| Headless Browser | Uruchom Playwright bez GUI | `true` |
> Opcje **Wait for Selector**, **Wait for Network Idle** i **Headless Browser** działają tylko z fetcherem `dynamic`.
---
## CLI Runner
Do testowania bez uruchamiania n8n:
```bash
# Podstawowe użycie
npx ts-node scripts/scrapling-run.ts https://example.com
# Tryb stealth
npx ts-node scripts/scrapling-run.ts https://example.com --fetcher stealth
# Ekstrakcja danych
npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com \
--selector "titles:.titleline a" \
--fetcher http \
--format json
# Playwright — czekaj na element
npx ts-node scripts/scrapling-run.ts https://spa.example.com \
--fetcher dynamic \
--wait "#content" \
--html
# Lub przez Taskfile
task scrape URL=https://example.com FETCHER=stealth
task scrape:dynamic URL=https://spa.example.com
```
Credentials CLI pobiera z pliku `.env.test` lub zmiennych środowiskowych:
```bash
SCRAPLING_SERVICE_URL=http://localhost:8765
SCRAPLING_API_KEY= # opcjonalnie
```
---
## Taskfile
| Task | Opis |
|---------------------|---------------------------------------------------|
| `task setup` | Instalacja npm + kopiowanie `.env.test.example` |
| `task build` | Kompilacja TypeScript → `dist/` |
| `task dev` | Watch mode |
| `task test` | Uruchom testy jednostkowe |
| `task test:coverage`| Testy z raportem pokrycia |
| `task lint` | ESLint |
| `task check` | Lint + test (pre-push) |
| `task service:install` | Instalacja serwisu Python w `.venv` |
| `task service:start` | Start serwisu na porcie 8765 |
| `task service:health` | Sprawdź health endpoint |
| `task scrape` | CLI scraper `[URL=...] [FETCHER=...] [FORMAT=...]`|
| `task scrape:stealth` | Scrape w trybie stealth |
| `task scrape:dynamic` | Scrape z przeglądarką |
| `task docker:up` | Start Docker Compose (n8n + serwis) |
| `task docker:down` | Stop Docker Compose |
| `task docker:logs` | Tail logów |
---
## API Scrapling Service
### `POST /scrape`
Pobiera stronę i opcjonalnie ekstrahuje dane.
**Request body:**
```json
{
"url": "https://example.com",
"fetcher_type": "http",
"selectors": [
{
"name": "title",
"selector": "h1",
"selector_type": "css",
"attribute": null,
"multiple": false
}
],
"return_html": false,
"timeout": 30000,
"proxy": null,
"headers": {},
"wait_selector": null,
"network_idle": false,
"headless": true
}
```
**Response:**
```json
{
"url": "https://example.com",
"status_code": 200,
"html": null,
"data": {
"title": "Example Domain"
},
"fetcher_used": "http",
"elapsed_ms": 245.3,
"error": null
}
```
W przypadku błędu serwis zwraca HTTP 200 z wypełnionym polem `error` (zamiast rzucać wyjątek) — pozwala to n8n obsłużyć błąd przez opcję **Continue On Fail**.
### `GET /health`
```json
{
"status": "ok",
"version": "0.1.0",
"dynamic_session_ready": true
}
```
Autoryzacja: jeśli zmienna `API_KEY` jest ustawiona w serwisie, każdy request do `/scrape` wymaga nagłówka `X-API-Key`.
---
## Zmienne środowiskowe
### Scrapling Service
| Zmienna | Opis | Domyślnie |
|--------------|---------------------------------------------------|-----------|
| `API_KEY` | Klucz API do autoryzacji requestów | _(puste — wyłączone)_ |
### Docker Compose
| Zmienna | Opis | Domyślnie |
|----------------------|-----------------------------------|-------------|
| `SCRAPLING_API_KEY` | API key serwisu (przekazywany) | _(puste)_ |
| `N8N_USER` | Login do n8n basic auth | `admin` |
| `N8N_PASSWORD` | Hasło do n8n basic auth | `changeme` |
---
## Struktura projektu
```
scrapling_n8n/
├── src/
│ ├── credentials/
│ │ └── ScraplingApi.credentials.ts # Credentials: serviceUrl + apiKey
│ └── nodes/Scrapling/
│ ├── Scrapling.node.ts # Główny node n8n
│ ├── helpers.ts # HTTP client + typy
│ ├── scrapling.svg # Ikona node
│ └── __tests__/
│ ├── helpers.test.ts # Testy helpera
│ └── Scrapling.node.test.ts # Testy node (z mock)
├── service/ # Python FastAPI microservice
│ ├── main.py # Entry point + autoryzacja
│ ├── routers/
│ │ ├── scrape.py # POST /scrape
│ │ └── health.py # GET /health
│ ├── scrapers/
│ │ ├── base.py # Abstrakcja + apply_selectors()
│ │ ├── fetcher.py # Wrapper Fetcher
│ │ ├── stealthy.py # Wrapper StealthyFetcher
│ │ └── dynamic.py # Wrapper PlayWrightFetcher
│ ├── models/
│ │ ├── request.py # Pydantic ScrapeRequest
│ │ └── response.py # Pydantic ScrapeResponse
│ ├── pyproject.toml # Zależności Python
│ └── Dockerfile
├── scripts/
│ └── scrapling-run.ts # CLI runner (bez n8n)
├── dist/ # Skompilowany JS (po npm run build)
├── docker-compose.yml
├── package.json
├── tsconfig.json
├── jest.config.js
├── Taskfile.yml
└── .env.test.example
```
---
## Rozwój
```bash
# Instalacja
task setup
# Tryb watch (TypeScript)
task dev
# Testy
task test
task test:coverage
# Lint
task lint
# Uruchomienie serwisu lokalnie
task service:install
task service:start
# Szybki test end-to-end
task scrape URL=https://httpbin.org/html
```
---
## Licencja
MIT

138
Taskfile.yml Normal file
View File

@@ -0,0 +1,138 @@
version: '3'
dotenv: ['.env.test']
vars:
URL: 'https://example.com'
FETCHER: 'http'
FORMAT: 'pretty'
SERVICE_URL: 'http://localhost:8765'
tasks:
# ── Build ──────────────────────────────────────────────────────────────────
build:
desc: Compile TypeScript to dist/
cmds:
- npm run build
dev:
desc: Watch mode — recompile on change
cmds:
- npm run dev
# ── Code quality ───────────────────────────────────────────────────────────
lint:
desc: Run ESLint
cmds:
- npm run lint
format:
desc: Format source with Prettier
cmds:
- npm run format
# ── Tests ──────────────────────────────────────────────────────────────────
test:
desc: Run all unit tests
cmds:
- npm test
test:watch:
desc: Run tests in watch mode
cmds:
- npm run test:watch
test:coverage:
desc: Run tests with coverage report
cmds:
- npm run test:coverage
# ── Python service ─────────────────────────────────────────────────────────
service:install:
desc: Install Python service dependencies (creates .venv in service/)
dir: service
cmds:
- python3 -m venv .venv
- .venv/bin/pip install -e ".[dev]"
- .venv/bin/scrapling install
service:start:
desc: Start Scrapling microservice on port 8765
dir: service
cmds:
- .venv/bin/uvicorn main:app --host 0.0.0.0 --port 8765 --reload
service:health:
desc: Check Scrapling service health
cmds:
- curl -s {{.SERVICE_URL}}/health | python3 -m json.tool
# ── Scrapling CLI runner ────────────────────────────────────────────────────
scrape:
desc: "Scrape a URL [URL=https://example.com] [FETCHER=http|stealth|dynamic] [FORMAT=pretty|json]"
cmds:
- >
npx ts-node scripts/scrapling-run.ts {{.URL}}
--fetcher {{.FETCHER}}
--format {{.FORMAT}}
scrape:html:
desc: "Scrape and return raw HTML [URL=https://example.com]"
cmds:
- npx ts-node scripts/scrapling-run.ts {{.URL}} --html --format json
scrape:stealth:
desc: "Scrape with stealth fetcher [URL=https://example.com]"
cmds:
- npx ts-node scripts/scrapling-run.ts {{.URL}} --fetcher stealth --format {{.FORMAT}}
scrape:dynamic:
desc: "Scrape with Playwright browser [URL=https://example.com]"
cmds:
- npx ts-node scripts/scrapling-run.ts {{.URL}} --fetcher dynamic --format {{.FORMAT}}
# ── Docker ─────────────────────────────────────────────────────────────────
docker:build:
desc: Build Docker image for Scrapling service
cmds:
- docker build -t scrapling-service ./service
docker:up:
desc: Start full stack (n8n + scrapling-service) via Docker Compose
cmds:
- docker compose up -d
docker:down:
desc: Stop Docker Compose stack
cmds:
- docker compose down
docker:logs:
desc: Tail Docker Compose logs
cmds:
- docker compose logs -f
# ── Composite ─────────────────────────────────────────────────────────────
check:
desc: Lint + test (pre-push safety check)
cmds:
- task: lint
- task: test
setup:
desc: Install all dependencies and copy .env.test.example if missing
cmds:
- npm install
- |
if [ ! -f .env.test ]; then
cp .env.test.example .env.test
echo ".env.test created — fill in your credentials"
fi

35
docker-compose.yml Normal file
View File

@@ -0,0 +1,35 @@
services:
scrapling-service:
build:
context: .
dockerfile: service/Dockerfile
ports:
- "8765:8765"
environment:
API_KEY: ${SCRAPLING_API_KEY:-}
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8765/health"]
interval: 30s
timeout: 10s
retries: 3
n8n:
image: n8nio/n8n:latest
ports:
- "5678:5678"
environment:
N8N_BASIC_AUTH_ACTIVE: "true"
N8N_BASIC_AUTH_USER: ${N8N_USER:-admin}
N8N_BASIC_AUTH_PASSWORD: ${N8N_PASSWORD:-changeme}
NODE_ENV: production
volumes:
- n8n_data:/home/node/.n8n
- ./dist:/home/node/.n8n/nodes/node_modules/n8n-nodes-scrapling
depends_on:
scrapling-service:
condition: service_healthy
restart: unless-stopped
volumes:
n8n_data:

2
index.ts Normal file
View File

@@ -0,0 +1,2 @@
export { Scrapling } from './src/nodes/Scrapling/Scrapling.node';
export { ScraplingApi } from './src/credentials/ScraplingApi.credentials';

12
jest.config.js Normal file
View File

@@ -0,0 +1,12 @@
/** @type {import('jest').Config} */
module.exports = {
preset: 'ts-jest',
testEnvironment: 'node',
roots: ['<rootDir>/src'],
testMatch: ['**/__tests__/**/*.test.ts'],
collectCoverageFrom: [
'src/**/*.ts',
'!src/**/*.d.ts',
],
moduleFileExtensions: ['ts', 'js', 'json'],
};

6543
package-lock.json generated Normal file

File diff suppressed because it is too large Load Diff

57
package.json Normal file
View File

@@ -0,0 +1,57 @@
{
"name": "n8n-nodes-scrapling",
"version": "0.1.0",
"description": "n8n community node for Scrapling — fast, adaptive web scraping with HTTP, stealth and dynamic (Playwright) fetchers",
"keywords": [
"n8n-community-node-package",
"scrapling",
"scraping",
"playwright",
"web-scraping"
],
"license": "MIT",
"homepage": "https://github.com/paramah/n8n-nodes-scrapling",
"author": {
"name": "paramah"
},
"main": "index.js",
"scripts": {
"build": "tsc && npm run copy-icons",
"copy-icons": "copyfiles -u 2 'src/nodes/**/*.svg' dist/nodes/",
"dev": "tsc --watch",
"format": "prettier src --write",
"lint": "eslint src --ext .ts",
"prepublishOnly": "npm run build && npm run lint",
"test": "jest",
"test:watch": "jest --watch",
"test:coverage": "jest --coverage",
"scrape": "ts-node scripts/scrapling-run.ts"
},
"files": [
"dist"
],
"n8n": {
"n8nNodesApiVersion": 1,
"credentials": [
"dist/credentials/ScraplingApi.credentials.js"
],
"nodes": [
"dist/nodes/Scrapling/Scrapling.node.js"
]
},
"devDependencies": {
"@types/jest": "^29.5.14",
"@typescript-eslint/parser": "^6.0.0",
"copyfiles": "^2.4.1",
"dotenv": "^17.4.2",
"jest": "^29.7.0",
"n8n-workflow": "*",
"prettier": "^3.0.0",
"ts-jest": "^29.4.9",
"ts-node": "^10.9.2",
"typescript": "^5.0.0"
},
"peerDependencies": {
"n8n-workflow": "*"
}
}

213
scripts/scrapling-run.ts Normal file
View File

@@ -0,0 +1,213 @@
#!/usr/bin/env ts-node
/**
* Standalone Scrapling CLI runner — no n8n needed.
*
* Usage:
* npx ts-node scripts/scrapling-run.ts <url> [options]
*
* Options:
* --fetcher http|stealth|dynamic (default: http)
* --selector <name>:<css-selector> (can be repeated)
* --html include raw HTML in output
* --wait <css-selector> wait for selector (dynamic only)
* --timeout <ms> (default: 30000)
* --format pretty|json (default: pretty)
* --service <url> service URL override
*
* Credentials (from .env.test or environment):
* SCRAPLING_SERVICE_URL default: http://localhost:8765
* SCRAPLING_API_KEY optional
*
* Examples:
* npx ts-node scripts/scrapling-run.ts https://example.com
* npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com --fetcher stealth --selector "title:title"
* npx ts-node scripts/scrapling-run.ts https://spa.example.com --fetcher dynamic --wait "#app"
*/
import * as https from 'https';
import * as http from 'http';
import * as path from 'path';
import * as fs from 'fs';
// ── Load .env.test if present ─────────────────────────────────────────────────
const envFile = path.resolve(__dirname, '..', '.env.test');
if (fs.existsSync(envFile)) {
const lines = fs.readFileSync(envFile, 'utf-8').split('\n');
for (const line of lines) {
const trimmed = line.trim();
if (!trimmed || trimmed.startsWith('#')) continue;
const eq = trimmed.indexOf('=');
if (eq === -1) continue;
const key = trimmed.slice(0, eq).trim();
const value = trimmed.slice(eq + 1).trim().replace(/^["']|["']$/g, '');
if (!process.env[key]) process.env[key] = value;
}
}
// ── Parse CLI args ────────────────────────────────────────────────────────────
const args = process.argv.slice(2);
const url = args.find((a) => !a.startsWith('--'));
if (!url) {
console.error('Usage: scrapling-run.ts <url> [--fetcher http|stealth|dynamic] [--selector name:selector] ...');
process.exit(1);
}
function getArg(flag: string): string | undefined {
const idx = args.indexOf(flag);
return idx !== -1 ? args[idx + 1] : undefined;
}
function getFlag(flag: string): boolean {
return args.includes(flag);
}
function getArgs(flag: string): string[] {
const result: string[] = [];
for (let i = 0; i < args.length; i++) {
if (args[i] === flag && args[i + 1]) {
result.push(args[i + 1]);
i++;
}
}
return result;
}
const fetcherType = (getArg('--fetcher') ?? 'http') as 'http' | 'stealth' | 'dynamic';
const returnHtml = getFlag('--html');
const waitSelector = getArg('--wait');
const timeout = parseInt(getArg('--timeout') ?? '30000', 10);
const outputFormat = (getArg('--format') ?? 'pretty') as 'pretty' | 'json';
const serviceUrl = (getArg('--service') ?? process.env.SCRAPLING_SERVICE_URL ?? 'http://localhost:8765').replace(/\/$/, '');
const apiKey = process.env.SCRAPLING_API_KEY ?? '';
const rawSelectors = getArgs('--selector');
const selectors = rawSelectors.map((raw) => {
const colonIdx = raw.indexOf(':');
if (colonIdx === -1) {
console.error(`Invalid selector format: "${raw}". Expected "name:selector"`);
process.exit(1);
}
return {
name: raw.slice(0, colonIdx),
selector: raw.slice(colonIdx + 1),
selector_type: 'css' as const,
multiple: false,
};
});
// ── Minimal HTTP client ───────────────────────────────────────────────────────
function postJson(reqUrl: string, body: unknown, headers: Record<string, string>): Promise<unknown> {
return new Promise((resolve, reject) => {
const bodyStr = JSON.stringify(body);
const parsed = new URL(reqUrl);
const isHttps = parsed.protocol === 'https:';
const transport = isHttps ? https : http;
const req = transport.request(
{
hostname: parsed.hostname,
port: parsed.port || (isHttps ? 443 : 80),
path: parsed.pathname + parsed.search,
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Content-Length': Buffer.byteLength(bodyStr).toString(),
...headers,
},
},
(res) => {
const chunks: Buffer[] = [];
res.on('data', (c: Buffer) => chunks.push(c));
res.on('end', () => {
const text = Buffer.concat(chunks).toString('utf-8');
if (res.statusCode && res.statusCode >= 400) {
reject(Object.assign(new Error(`HTTP ${res.statusCode}: ${text}`), { statusCode: res.statusCode }));
} else {
try {
resolve(JSON.parse(text));
} catch {
resolve(text);
}
}
});
},
);
req.on('error', reject);
req.write(bodyStr);
req.end();
});
}
// ── Main ──────────────────────────────────────────────────────────────────────
interface ScrapeResponse {
url: string;
status_code: number;
html?: string;
data: Record<string, unknown>;
fetcher_used: string;
elapsed_ms: number;
error?: string;
}
async function main(): Promise<void> {
console.log(`Service: ${serviceUrl}`);
console.log(`URL: ${url}`);
console.log(`Fetcher: ${fetcherType}`);
if (selectors.length) console.log(`Selectors: ${selectors.map((s) => `${s.name}:${s.selector}`).join(', ')}`);
console.log();
const payload: Record<string, unknown> = {
url,
fetcher_type: fetcherType,
return_html: returnHtml,
timeout,
selectors,
};
if (waitSelector) payload.wait_selector = waitSelector;
const requestHeaders: Record<string, string> = {};
if (apiKey) requestHeaders['X-API-Key'] = apiKey;
const response = (await postJson(`${serviceUrl}/scrape`, payload, requestHeaders)) as ScrapeResponse;
if (outputFormat === 'json') {
console.log(JSON.stringify(response, null, 2));
return;
}
// Pretty output
console.log(`Status: ${response.status_code}`);
console.log(`Fetcher: ${response.fetcher_used}`);
console.log(`Elapsed: ${response.elapsed_ms}ms`);
if (response.error) {
console.error(`\nError: ${response.error}`);
return;
}
if (Object.keys(response.data).length > 0) {
console.log('\nExtracted data:');
console.log('─'.repeat(50));
for (const [key, val] of Object.entries(response.data)) {
const display = Array.isArray(val) ? `[${(val as unknown[]).length} items]` : String(val);
console.log(` ${key.padEnd(25)} ${display}`);
}
}
if (response.html) {
console.log('\nHTML preview (first 500 chars):');
console.log('─'.repeat(50));
console.log(response.html.slice(0, 500));
}
}
main().catch((err) => {
console.error('\nError:', (err as Error).message ?? err);
process.exit(1);
});

18
service/Dockerfile Normal file
View File

@@ -0,0 +1,18 @@
FROM python:3.12-slim
WORKDIR /app
# System deps for Playwright
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY pyproject.toml .
RUN pip install --no-cache-dir -e . \
&& scrapling install
COPY . .
EXPOSE 8765
CMD ["uvicorn", "service.main:app", "--host", "0.0.0.0", "--port", "8765"]

0
service/__init__.py Normal file
View File

29
service/main.py Normal file
View File

@@ -0,0 +1,29 @@
from __future__ import annotations
import os
from fastapi import Depends, FastAPI, HTTPException, Request, Security
from fastapi.security.api_key import APIKeyHeader
from .routers import health_router, scrape_router
API_KEY = os.getenv("API_KEY", "")
api_key_header = APIKeyHeader(name="X-API-Key", auto_error=False)
async def verify_api_key(key: str | None = Security(api_key_header)) -> None:
if API_KEY and key != API_KEY:
raise HTTPException(status_code=401, detail="Invalid or missing API key")
app = FastAPI(
title="Scrapling Service",
description="HTTP microservice exposing Scrapling web-scraping fetcherss to n8n",
version="0.1.0",
)
app.include_router(health_router)
app.include_router(
scrape_router,
dependencies=[Depends(verify_api_key)],
)

View File

@@ -0,0 +1,4 @@
from .request import ScrapeRequest, SelectorDef
from .response import ScrapeResponse, HealthResponse
__all__ = ["ScrapeRequest", "SelectorDef", "ScrapeResponse", "HealthResponse"]

41
service/models/request.py Normal file
View File

@@ -0,0 +1,41 @@
from __future__ import annotations
from typing import Any, Literal
from pydantic import BaseModel, field_validator
class SelectorDef(BaseModel):
name: str
selector: str
selector_type: Literal["css", "xpath"] = "css"
attribute: str | None = None # None = get text content
multiple: bool = False
class ScrapeRequest(BaseModel):
url: str
fetcher_type: Literal["http", "stealth", "dynamic"] = "http"
selectors: list[SelectorDef] = []
return_html: bool = False
timeout: int = 30000
proxy: str | None = None
headers: dict[str, str] = {}
# dynamic-fetcher specific
wait_selector: str | None = None
network_idle: bool = False
headless: bool = True
@field_validator("url")
@classmethod
def url_must_have_scheme(cls, v: str) -> str:
if not v.startswith(("http://", "https://")):
raise ValueError("URL must start with http:// or https://")
return v
@field_validator("timeout")
@classmethod
def timeout_range(cls, v: int) -> int:
if not (1000 <= v <= 120_000):
raise ValueError("timeout must be between 1000 and 120000 ms")
return v

View File

@@ -0,0 +1,21 @@
from __future__ import annotations
from typing import Any
from pydantic import BaseModel
class ScrapeResponse(BaseModel):
url: str
status_code: int
html: str | None = None
data: dict[str, Any] = {}
fetcher_used: str
elapsed_ms: float
error: str | None = None
class HealthResponse(BaseModel):
status: str
version: str
dynamic_session_ready: bool

25
service/pyproject.toml Normal file
View File

@@ -0,0 +1,25 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "scrapling-service"
version = "0.1.0"
description = "FastAPI microservice wrapping Scrapling for n8n integration"
requires-python = ">=3.11"
dependencies = [
"scrapling[fetchers]>=0.2",
"fastapi>=0.115",
"uvicorn[standard]>=0.30",
"pydantic>=2.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0",
"pytest-asyncio>=0.23",
"httpx>=0.27",
]
[tool.pytest.ini_options]
asyncio_mode = "auto"

View File

@@ -0,0 +1,4 @@
from .scrape import router as scrape_router
from .health import router as health_router
__all__ = ["scrape_router", "health_router"]

15
service/routers/health.py Normal file
View File

@@ -0,0 +1,15 @@
from fastapi import APIRouter
from ..models.response import HealthResponse
router = APIRouter()
VERSION = "0.1.0"
@router.get("/health", response_model=HealthResponse)
async def health() -> HealthResponse:
return HealthResponse(
status="ok",
version=VERSION,
dynamic_session_ready=True,
)

35
service/routers/scrape.py Normal file
View File

@@ -0,0 +1,35 @@
from __future__ import annotations
from fastapi import APIRouter, HTTPException
from ..models.request import ScrapeRequest
from ..models.response import ScrapeResponse
from ..scrapers import DynamicScraper, HttpScraper, StealthyScraper
router = APIRouter()
@router.post("/scrape", response_model=ScrapeResponse)
async def scrape(req: ScrapeRequest) -> ScrapeResponse:
try:
if req.fetcher_type == "http":
scraper = HttpScraper()
elif req.fetcher_type == "stealth":
scraper = StealthyScraper()
elif req.fetcher_type == "dynamic":
scraper = DynamicScraper()
else:
raise HTTPException(status_code=400, detail=f"Unknown fetcher_type: {req.fetcher_type}")
return await scraper.scrape(req)
except HTTPException:
raise
except Exception as exc:
return ScrapeResponse(
url=req.url,
status_code=0,
fetcher_used=req.fetcher_type,
elapsed_ms=0,
error=str(exc),
)

View File

@@ -0,0 +1,5 @@
from .fetcher import HttpScraper
from .stealthy import StealthyScraper
from .dynamic import DynamicScraper
__all__ = ["HttpScraper", "StealthyScraper", "DynamicScraper"]

68
service/scrapers/base.py Normal file
View File

@@ -0,0 +1,68 @@
from __future__ import annotations
import time
from abc import ABC, abstractmethod
from typing import Any
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
from ..models.request import ScrapeRequest, SelectorDef
from ..models.response import ScrapeResponse
def apply_selectors(page: Any, selectors: list[SelectorDef]) -> dict[str, Any]:
"""Extract data from a Scrapling page object using CSS/XPath selectors."""
result: dict[str, Any] = {}
for sel in selectors:
try:
if sel.selector_type == "css":
if sel.multiple:
elements = page.css(sel.selector)
else:
elements = [page.css_first(sel.selector)]
else:
if sel.multiple:
elements = page.xpath(sel.selector)
else:
elements = [page.xpath_first(sel.selector)]
def extract_value(el: Any) -> str | None:
if el is None:
return None
if sel.attribute:
return el.attrib.get(sel.attribute)
return el.text
if sel.multiple:
result[sel.name] = [extract_value(el) for el in (elements or [])]
else:
result[sel.name] = extract_value(elements[0] if elements else None)
except Exception as exc:
result[sel.name] = None
result[f"{sel.name}_error"] = str(exc)
return result
class BaseScraper(ABC):
@abstractmethod
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
...
def _build_response(
self,
req: ScrapeRequest,
page: Any,
fetcher_name: str,
start: float,
) -> ScrapeResponse:
elapsed = (time.perf_counter() - start) * 1000
html = page.html if req.return_html else None
data = apply_selectors(page, req.selectors) if req.selectors else {}
return ScrapeResponse(
url=req.url,
status_code=page.status if hasattr(page, "status") else 200,
html=html,
data=data,
fetcher_used=fetcher_name,
elapsed_ms=round(elapsed, 2),
)

View File

@@ -0,0 +1,31 @@
from __future__ import annotations
import time
from scrapling import PlayWrightFetcher
from ..models.request import ScrapeRequest
from ..models.response import ScrapeResponse
from .base import BaseScraper
class DynamicScraper(BaseScraper):
"""Wraps Scrapling's PlayWrightFetcher — full browser via Playwright."""
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
start = time.perf_counter()
kwargs: dict = {
"url": req.url,
"headless": req.headless,
"timeout": req.timeout,
"network_idle": req.network_idle,
}
if req.wait_selector:
kwargs["wait_selector"] = req.wait_selector
if req.proxy:
kwargs["proxy"] = req.proxy
fetcher = PlayWrightFetcher(auto_match=False)
page = await fetcher.async_fetch(**kwargs)
return self._build_response(req, page, "dynamic", start)

View File

@@ -0,0 +1,30 @@
from __future__ import annotations
import asyncio
import time
from scrapling import Fetcher
from ..models.request import ScrapeRequest
from ..models.response import ScrapeResponse
from .base import BaseScraper
class HttpScraper(BaseScraper):
"""Wraps Scrapling's Fetcher — plain HTTP, fastest option."""
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
start = time.perf_counter()
fetcher = Fetcher(auto_match=False)
kwargs: dict = {
"url": req.url,
"timeout": req.timeout / 1000,
}
if req.headers:
kwargs["headers"] = req.headers
if req.proxy:
kwargs["proxy"] = req.proxy
page = await asyncio.to_thread(fetcher.get, **kwargs)
return self._build_response(req, page, "http", start)

View File

@@ -0,0 +1,30 @@
from __future__ import annotations
import asyncio
import time
from scrapling import StealthyFetcher
from ..models.request import ScrapeRequest
from ..models.response import ScrapeResponse
from .base import BaseScraper
class StealthyScraper(BaseScraper):
"""Wraps Scrapling's StealthyFetcher — TLS fingerprint impersonation."""
async def scrape(self, req: ScrapeRequest) -> ScrapeResponse:
start = time.perf_counter()
fetcher = StealthyFetcher(auto_match=False)
kwargs: dict = {
"url": req.url,
"timeout": req.timeout / 1000,
}
if req.headers:
kwargs["extra_headers"] = req.headers
if req.proxy:
kwargs["proxy"] = req.proxy
page = await asyncio.to_thread(fetcher.fetch, **kwargs)
return self._build_response(req, page, "stealth", start)

View File

@@ -0,0 +1,28 @@
import { ICredentialType, INodeProperties } from 'n8n-workflow';
export class ScraplingApi implements ICredentialType {
name = 'scraplingApi';
displayName = 'Scrapling API';
documentationUrl = 'https://github.com/D4Vinci/Scrapling';
properties: INodeProperties[] = [
{
displayName: 'Service URL',
name: 'serviceUrl',
type: 'string',
default: 'http://localhost:8765',
placeholder: 'http://localhost:8765',
description: 'URL of the Scrapling microservice (without trailing slash)',
required: true,
},
{
displayName: 'API Key',
name: 'apiKey',
type: 'string',
typeOptions: { password: true },
default: '',
description: 'Optional API key for authenticating with the Scrapling service',
required: false,
},
];
}

View File

@@ -0,0 +1,372 @@
import {
IExecuteFunctions,
INodeExecutionData,
INodeType,
INodeTypeDescription,
NodeOperationError,
IDataObject,
} from 'n8n-workflow';
import { scraplingRequest, ScraplingRequestPayload } from './helpers';
export class Scrapling implements INodeType {
description: INodeTypeDescription = {
displayName: 'Scrapling',
name: 'scrapling',
icon: 'file:scrapling.svg',
group: ['input'],
version: 1,
subtitle: '={{$parameter["operation"] + ": " + $parameter["resource"]}}',
description: 'Scrape web pages using Scrapling — HTTP, stealth and Playwright fetchers',
defaults: {
name: 'Scrapling',
},
inputs: ['main'],
outputs: ['main'],
credentials: [
{
name: 'scraplingApi',
required: true,
},
],
properties: [
// ── Resource ──────────────────────────────────────────────────────
{
displayName: 'Resource',
name: 'resource',
type: 'options',
noDataExpression: true,
options: [
{ name: 'Page', value: 'page' },
{ name: 'Data', value: 'data' },
],
default: 'page',
},
// ── Page operations ───────────────────────────────────────────────
{
displayName: 'Operation',
name: 'operation',
type: 'options',
noDataExpression: true,
displayOptions: { show: { resource: ['page'] } },
options: [
{
name: 'Fetch',
value: 'fetch',
description: 'Fast HTTP fetch (Fetcher)',
action: 'Fetch a page via HTTP',
},
{
name: 'Fetch Stealth',
value: 'fetchStealth',
description: 'TLS fingerprint impersonation (StealthyFetcher)',
action: 'Fetch a page with stealth mode',
},
{
name: 'Fetch Dynamic',
value: 'fetchDynamic',
description: 'Full browser via Playwright (PlayWrightFetcher)',
action: 'Fetch a page with a real browser',
},
],
default: 'fetch',
},
// ── Data operations ───────────────────────────────────────────────
{
displayName: 'Operation',
name: 'operation',
type: 'options',
noDataExpression: true,
displayOptions: { show: { resource: ['data'] } },
options: [
{
name: 'Extract',
value: 'extract',
description: 'Fetch a page and extract data with CSS/XPath selectors',
action: 'Extract structured data from a page',
},
{
name: 'Extract Tables',
value: 'extractTables',
description: 'Fetch a page and extract all HTML tables as JSON',
action: 'Extract HTML tables from a page',
},
],
default: 'extract',
},
// ── URL (all operations) ──────────────────────────────────────────
{
displayName: 'URL',
name: 'url',
type: 'string',
default: '',
required: true,
placeholder: 'https://example.com',
description: 'URL of the page to scrape',
},
// ── Fetcher type (data resource) ──────────────────────────────────
{
displayName: 'Fetcher',
name: 'fetcherType',
type: 'options',
displayOptions: { show: { resource: ['data'] } },
options: [
{ name: 'HTTP (fastest)', value: 'http' },
{ name: 'Stealth (TLS impersonation)', value: 'stealth' },
{ name: 'Dynamic (Playwright browser)', value: 'dynamic' },
],
default: 'http',
description: 'Which Scrapling fetcher to use for loading the page',
},
// ── Selectors (extract operation) ─────────────────────────────────
{
displayName: 'Selectors',
name: 'selectors',
type: 'fixedCollection',
typeOptions: { multipleValues: true },
displayOptions: { show: { operation: ['extract'] } },
default: {},
options: [
{
name: 'selector',
displayName: 'Selector',
values: [
{
displayName: 'Field Name',
name: 'name',
type: 'string',
default: '',
required: true,
description: 'Name for this field in the output',
},
{
displayName: 'Selector',
name: 'selector',
type: 'string',
default: '',
required: true,
placeholder: 'h1.title',
description: 'CSS selector or XPath expression',
},
{
displayName: 'Type',
name: 'selectorType',
type: 'options',
options: [
{ name: 'CSS', value: 'css' },
{ name: 'XPath', value: 'xpath' },
],
default: 'css',
},
{
displayName: 'Attribute',
name: 'attribute',
type: 'string',
default: '',
placeholder: 'href',
description: 'HTML attribute to extract. Leave empty to get text content.',
},
{
displayName: 'Return Multiple',
name: 'multiple',
type: 'boolean',
default: false,
description: 'Whether to return all matching elements as an array',
},
],
},
],
},
// ── Return HTML ────────────────────────────────────────────────────
{
displayName: 'Return Raw HTML',
name: 'returnHtml',
type: 'boolean',
default: false,
description: 'Whether to include the raw HTML in the response',
},
// ── Timeout ────────────────────────────────────────────────────────
{
displayName: 'Timeout (ms)',
name: 'timeout',
type: 'number',
default: 30000,
typeOptions: { minValue: 1000, maxValue: 120000 },
description: 'Request timeout in milliseconds',
},
// ── Additional options (collapsible) ──────────────────────────────
{
displayName: 'Additional Options',
name: 'additionalOptions',
type: 'collection',
placeholder: 'Add Option',
default: {},
options: [
{
displayName: 'Proxy',
name: 'proxy',
type: 'string',
default: '',
placeholder: 'http://user:pass@proxy.example.com:8080',
description: 'Proxy URL to use for the request',
},
{
displayName: 'Extra Headers',
name: 'headers',
type: 'json',
default: '{}',
description: 'Additional HTTP headers as a JSON object',
},
{
displayName: 'Wait for Selector',
name: 'waitSelector',
type: 'string',
default: '',
placeholder: '#content',
description: 'CSS selector to wait for before extracting (dynamic fetcher only)',
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
},
{
displayName: 'Wait for Network Idle',
name: 'networkIdle',
type: 'boolean',
default: false,
description: 'Whether to wait for network activity to cease (dynamic fetcher only)',
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
},
{
displayName: 'Headless Browser',
name: 'headless',
type: 'boolean',
default: true,
description: 'Whether to run the browser in headless mode (dynamic fetcher only)',
displayOptions: { show: { '/operation': ['fetchDynamic'] } },
},
],
},
],
};
async execute(this: IExecuteFunctions): Promise<INodeExecutionData[][]> {
const items = this.getInputData();
const returnData: INodeExecutionData[] = [];
const credentials = await this.getCredentials('scraplingApi');
const serviceUrl = (credentials.serviceUrl as string).replace(/\/$/, '');
const apiKey = (credentials.apiKey as string) || undefined;
for (let i = 0; i < items.length; i++) {
const resource = this.getNodeParameter('resource', i) as string;
const operation = this.getNodeParameter('operation', i) as string;
const url = this.getNodeParameter('url', i) as string;
const returnHtml = this.getNodeParameter('returnHtml', i, false) as boolean;
const timeout = this.getNodeParameter('timeout', i, 30000) as number;
const additionalOptions = this.getNodeParameter('additionalOptions', i, {}) as IDataObject;
try {
const dataFetcherType = resource === 'data'
? (this.getNodeParameter('fetcherType', i, 'http') as 'http' | 'stealth' | 'dynamic')
: undefined;
const fetcherType = resolveFetcherType(resource, operation, dataFetcherType);
const payload: ScraplingRequestPayload = {
url,
fetcher_type: fetcherType,
return_html: returnHtml,
timeout,
};
if (additionalOptions.proxy) {
payload.proxy = additionalOptions.proxy as string;
}
if (additionalOptions.headers) {
const raw = additionalOptions.headers as string;
try {
payload.headers = JSON.parse(raw) as Record<string, string>;
} catch {
throw new NodeOperationError(this.getNode(), 'Extra Headers must be valid JSON', { itemIndex: i });
}
}
if (fetcherType === 'dynamic') {
if (additionalOptions.waitSelector) {
payload.wait_selector = additionalOptions.waitSelector as string;
}
payload.network_idle = (additionalOptions.networkIdle as boolean) ?? false;
payload.headless = (additionalOptions.headless as boolean) ?? true;
}
if (operation === 'extract') {
const rawSelectors = (this.getNodeParameter('selectors', i, { selector: [] }) as IDataObject)
.selector as IDataObject[];
if (rawSelectors && rawSelectors.length > 0) {
payload.selectors = rawSelectors.map((s) => ({
name: s.name as string,
selector: s.selector as string,
selector_type: (s.selectorType as 'css' | 'xpath') ?? 'css',
attribute: (s.attribute as string) || undefined,
multiple: (s.multiple as boolean) ?? false,
}));
}
}
if (operation === 'extractTables') {
// Inject built-in table selectors — Python side returns tables as data.tables[]
payload.selectors = [
{ name: '__tables__', selector: 'table', selector_type: 'css', multiple: true },
];
}
const result = await scraplingRequest(this, serviceUrl, apiKey, payload);
if (result.error) {
if (this.continueOnFail()) {
returnData.push({ json: { error: result.error }, pairedItem: { item: i } });
continue;
}
throw new NodeOperationError(this.getNode(), `Scrapling error: ${result.error}`, { itemIndex: i });
}
returnData.push({ json: result as unknown as IDataObject, pairedItem: { item: i } });
} catch (error) {
if (this.continueOnFail()) {
returnData.push({
json: { error: (error as Error).message },
pairedItem: { item: i },
});
continue;
}
throw error;
}
}
return [returnData];
}
}
// ── Helpers ───────────────────────────────────────────────────────────────────
function resolveFetcherType(
resource: string,
operation: string,
dataFetcherType?: 'http' | 'stealth' | 'dynamic',
): 'http' | 'stealth' | 'dynamic' {
if (resource === 'page') {
if (operation === 'fetchStealth') return 'stealth';
if (operation === 'fetchDynamic') return 'dynamic';
return 'http';
}
return dataFetcherType ?? 'http';
}

View File

@@ -0,0 +1,107 @@
import { Scrapling } from '../Scrapling.node';
import * as helpers from '../helpers';
import { IExecuteFunctions, INodeExecutionData } from 'n8n-workflow';
jest.mock('../helpers');
const mockScraplingRequest = helpers.scraplingRequest as jest.MockedFunction<typeof helpers.scraplingRequest>;
function makeContext(overrides: Partial<Record<string, unknown>> = {}): IExecuteFunctions {
const params: Record<string, unknown> = {
resource: 'page',
operation: 'fetch',
url: 'https://example.com',
returnHtml: false,
timeout: 30000,
additionalOptions: {},
...overrides,
};
return {
getInputData: jest.fn().mockReturnValue([{ json: {} }]),
getNodeParameter: jest.fn().mockImplementation((name: string) => params[name]),
getCredentials: jest.fn().mockResolvedValue({
serviceUrl: 'http://localhost:8765',
apiKey: '',
}),
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
continueOnFail: jest.fn().mockReturnValue(false),
} as unknown as IExecuteFunctions;
}
const successResponse = {
url: 'https://example.com',
status_code: 200,
data: {},
fetcher_used: 'http',
elapsed_ms: 50,
};
describe('Scrapling node', () => {
beforeEach(() => jest.clearAllMocks());
it('calls scraplingRequest with http fetcher for page:fetch', async () => {
mockScraplingRequest.mockResolvedValue(successResponse);
const node = new Scrapling();
const ctx = makeContext();
const result = await node.execute.call(ctx);
expect(mockScraplingRequest).toHaveBeenCalledTimes(1);
const payload = mockScraplingRequest.mock.calls[0][3];
expect(payload.fetcher_type).toBe('http');
expect(payload.url).toBe('https://example.com');
expect(result[0]).toHaveLength(1);
});
it('calls scraplingRequest with stealth fetcher for page:fetchStealth', async () => {
mockScraplingRequest.mockResolvedValue({ ...successResponse, fetcher_used: 'stealth' });
const node = new Scrapling();
const ctx = makeContext({ operation: 'fetchStealth' });
await node.execute.call(ctx);
const payload = mockScraplingRequest.mock.calls[0][3];
expect(payload.fetcher_type).toBe('stealth');
});
it('calls scraplingRequest with dynamic fetcher for page:fetchDynamic', async () => {
mockScraplingRequest.mockResolvedValue({ ...successResponse, fetcher_used: 'dynamic' });
const node = new Scrapling();
const ctx = makeContext({ operation: 'fetchDynamic' });
await node.execute.call(ctx);
const payload = mockScraplingRequest.mock.calls[0][3];
expect(payload.fetcher_type).toBe('dynamic');
});
it('returns error json when continueOnFail is true and service returns error', async () => {
mockScraplingRequest.mockResolvedValue({ ...successResponse, error: 'connection refused', status_code: 0 });
const node = new Scrapling();
const ctx = makeContext();
(ctx.continueOnFail as jest.Mock).mockReturnValue(true);
const result = await node.execute.call(ctx);
expect((result[0][0].json as Record<string, unknown>).error).toBe('connection refused');
});
it('throws when service returns error and continueOnFail is false', async () => {
mockScraplingRequest.mockResolvedValue({ ...successResponse, error: 'timeout', status_code: 0 });
const node = new Scrapling();
const ctx = makeContext();
await expect(node.execute.call(ctx)).rejects.toThrow('timeout');
});
it('passes selectors to payload for data:extract', async () => {
mockScraplingRequest.mockResolvedValue({ ...successResponse, data: { title: 'Hello' } });
const node = new Scrapling();
const ctx = makeContext({
resource: 'data',
operation: 'extract',
fetcherType: 'http',
selectors: {
selector: [
{ name: 'title', selector: 'h1', selectorType: 'css', attribute: '', multiple: false },
],
},
});
await node.execute.call(ctx);
const payload = mockScraplingRequest.mock.calls[0][3];
expect(payload.selectors).toHaveLength(1);
expect(payload.selectors![0].name).toBe('title');
});
});

View File

@@ -0,0 +1,79 @@
import { scraplingRequest, ScraplingRequestPayload } from '../helpers';
import { IExecuteFunctions } from 'n8n-workflow';
function makeMockContext(responseBody: unknown, statusCode = 200): IExecuteFunctions {
return {
helpers: {
request: jest.fn().mockResolvedValue(responseBody),
},
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
} as unknown as IExecuteFunctions;
}
function makeMockContextThrowing(statusCode: number): IExecuteFunctions {
const err = Object.assign(new Error(`HTTP ${statusCode}`), { statusCode });
return {
helpers: {
request: jest.fn().mockRejectedValue(err),
},
getNode: jest.fn().mockReturnValue({ name: 'Scrapling', type: 'scrapling' }),
} as unknown as IExecuteFunctions;
}
const basePayload: ScraplingRequestPayload = {
url: 'https://example.com',
fetcher_type: 'http',
};
describe('scraplingRequest', () => {
it('returns parsed response on success', async () => {
const mockResponse = {
url: 'https://example.com',
status_code: 200,
data: {},
fetcher_used: 'http',
elapsed_ms: 42,
};
const ctx = makeMockContext(mockResponse);
const result = await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
expect(result.url).toBe('https://example.com');
expect(result.fetcher_used).toBe('http');
});
it('calls correct URL with POST', async () => {
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
const requestMock = (ctx.helpers.request as jest.Mock);
expect(requestMock).toHaveBeenCalledTimes(1);
const callArgs = requestMock.mock.calls[0][0];
expect(callArgs.url).toBe('http://localhost:8765/scrape');
expect(callArgs.method).toBe('POST');
});
it('includes X-API-Key header when apiKey is provided', async () => {
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
await scraplingRequest(ctx, 'http://localhost:8765', 'secret123', basePayload);
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
expect(callArgs.headers['X-API-Key']).toBe('secret123');
});
it('does not include X-API-Key header when apiKey is undefined', async () => {
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
await scraplingRequest(ctx, 'http://localhost:8765', undefined, basePayload);
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
expect(callArgs.headers['X-API-Key']).toBeUndefined();
});
it('throws NodeOperationError on 401', async () => {
const ctx = makeMockContextThrowing(401);
await expect(scraplingRequest(ctx, 'http://localhost:8765', 'bad-key', basePayload))
.rejects.toThrow('401');
});
it('strips trailing slash from serviceUrl', async () => {
const ctx = makeMockContext({ url: 'x', status_code: 200, data: {}, fetcher_used: 'http', elapsed_ms: 1 });
await scraplingRequest(ctx, 'http://localhost:8765/', undefined, basePayload);
const callArgs = (ctx.helpers.request as jest.Mock).mock.calls[0][0];
expect(callArgs.url).toBe('http://localhost:8765/scrape');
});
});

View File

@@ -0,0 +1,92 @@
import {
IExecuteFunctions,
IDataObject,
NodeOperationError,
JsonObject,
} from 'n8n-workflow';
export interface SelectorDef {
name: string;
selector: string;
selectorType: 'css' | 'xpath';
attribute?: string;
multiple?: boolean;
}
export interface ScraplingRequestPayload extends IDataObject {
url: string;
fetcher_type: 'http' | 'stealth' | 'dynamic';
selectors?: Array<{
name: string;
selector: string;
selector_type: 'css' | 'xpath';
attribute?: string;
multiple?: boolean;
}>;
return_html?: boolean;
timeout?: number;
proxy?: string;
headers?: Record<string, string>;
wait_selector?: string;
network_idle?: boolean;
headless?: boolean;
}
export interface ScraplingResponse extends IDataObject {
url: string;
status_code: number;
html?: string;
data: IDataObject;
fetcher_used: string;
elapsed_ms: number;
error?: string;
}
/**
* Make a POST /scrape request to the Scrapling microservice.
*/
export async function scraplingRequest(
context: IExecuteFunctions,
serviceUrl: string,
apiKey: string | undefined,
payload: ScraplingRequestPayload,
): Promise<ScraplingResponse> {
const url = `${serviceUrl.replace(/\/$/, '')}/scrape`;
const headers: Record<string, string> = {
'Content-Type': 'application/json',
};
if (apiKey) {
headers['X-API-Key'] = apiKey;
}
try {
const response = await context.helpers.request({
method: 'POST',
url,
headers,
body: JSON.stringify(payload),
json: true,
});
return response as ScraplingResponse;
} catch (error) {
const err = error as JsonObject;
const statusCode = (err.statusCode as number) ?? 0;
if (statusCode === 401) {
throw new NodeOperationError(
context.getNode(),
'Scrapling service returned 401 Unauthorized. Check your API key.',
);
}
if (statusCode === 422) {
throw new NodeOperationError(
context.getNode(),
`Scrapling service validation error: ${JSON.stringify((err.error as JsonObject)?.detail ?? err)}`,
);
}
throw error;
}
}

View File

@@ -0,0 +1,22 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 60 60" fill="none">
<circle cx="30" cy="30" r="28" fill="#1a1a2e" stroke="#4a90d9" stroke-width="2"/>
<!-- Spider web lines -->
<line x1="30" y1="4" x2="30" y2="56" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
<line x1="4" y1="30" x2="56" y2="30" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
<line x1="10" y1="10" x2="50" y2="50" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
<line x1="50" y1="10" x2="10" y2="50" stroke="#4a90d9" stroke-width="0.8" opacity="0.5"/>
<!-- Concentric circles (web rings) -->
<circle cx="30" cy="30" r="8" stroke="#4a90d9" stroke-width="0.8" opacity="0.5" fill="none"/>
<circle cx="30" cy="30" r="16" stroke="#4a90d9" stroke-width="0.8" opacity="0.4" fill="none"/>
<circle cx="30" cy="30" r="24" stroke="#4a90d9" stroke-width="0.8" opacity="0.3" fill="none"/>
<!-- Spider body -->
<ellipse cx="30" cy="30" r="5" ry="6" fill="#4a90d9"/>
<ellipse cx="30" cy="24" r="3.5" ry="3" fill="#5ba3e8"/>
<!-- Legs -->
<path d="M25 28 Q18 24 14 20" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
<path d="M25 30 Q17 29 13 28" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
<path d="M25 32 Q18 34 14 38" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
<path d="M35 28 Q42 24 46 20" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
<path d="M35 30 Q43 29 47 28" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
<path d="M35 32 Q42 34 46 38" stroke="#4a90d9" stroke-width="1.2" fill="none" stroke-linecap="round"/>
</svg>

After

Width:  |  Height:  |  Size: 1.6 KiB

19
tsconfig.json Normal file
View File

@@ -0,0 +1,19 @@
{
"compilerOptions": {
"target": "ES2019",
"module": "commonjs",
"lib": ["ES2019"],
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"outDir": "dist",
"rootDir": "src",
"declaration": true,
"declarationMap": true,
"sourceMap": true,
"resolveJsonModule": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}