feat: initial commit

2026-04-18 08:59:04 +02:00
commit 862c0d1703
32 changed files with 8492 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,399 @@
+# n8n-nodes-scrapling
+
+Community node dla [n8n](https://n8n.io) integrujący bibliotekę [Scrapling](https://github.com/D4Vinci/Scrapling) — szybki, adaptacyjny scraper stron internetowych z obsługą zwykłego HTTP, trybu stealth (TLS fingerprint impersonation) oraz pełnej przeglądarki Playwright.
+
+---
+
+## Architektura
+
+```
+n8n (Node.js / TypeScript)
+        │
+        │  HTTP POST /scrape
+        ▼
+Scrapling Service (Python / FastAPI)
+        │
+        ├── Fetcher          — szybkie żądania HTTP
+        ├── StealthyFetcher  — impersonacja TLS (curl-impersonate)
+        └── PlayWrightFetcher — pełna przeglądarka Chromium
+```
+
+n8n node komunikuje się z Scrapling Service przez HTTP. Serwis Python zarządza instancjami scraperów i zwraca ustrukturyzowane JSON.
+
+---
+
+## Wymagania
+
+| Komponent       | Wersja         |
+|-----------------|----------------|
+| n8n             | ≥ 1.0          |
+| Node.js         | ≥ 18           |
+| Python          | ≥ 3.11         |
+| Docker          | opcjonalnie    |
+
+---
+
+## Instalacja
+
+### Opcja A — Docker Compose (zalecana)
+
+Uruchamia n8n oraz Scrapling Service w jednym poleceniu.
+
+```bash
+# Skopiuj i uzupełnij zmienne środowiskowe
+cp .env.test.example .env
+
+# Zbuduj i uruchom
+docker compose up -d
+```
+
+n8n dostępne pod adresem: `http://localhost:5678`
+Scrapling Service: `http://localhost:8765`
+
+### Opcja B — Ręczna instalacja
+
+**1. Zainstaluj zależności n8n node:**
+
+```bash
+npm install
+npm run build
+```
+
+**2. Zainstaluj serwis Python:**
+
+```bash
+cd service
+python3 -m venv .venv
+.venv/bin/pip install -e ".[dev]"
+.venv/bin/scrapling install   # pobiera przeglądarki Playwright
+```
+
+**3. Uruchom serwis:**
+
+```bash
+.venv/bin/uvicorn service.main:app --host 0.0.0.0 --port 8765
+```
+
+**4. Zainstaluj node w n8n:**
+
+Skopiuj katalog `dist/` do katalogu community nodes n8n:
+
+```bash
+# Domyślna ścieżka dla n8n instalowanego przez npm
+cp -r dist ~/.n8n/nodes/node_modules/n8n-nodes-scrapling
+```
+
+Następnie zrestartuj n8n.
+
+---
+
+## Konfiguracja credentials
+
+W n8n dodaj nowy credential typu **Scrapling API**:
+
+| Pole        | Opis                                              | Domyślnie              |
+|-------------|---------------------------------------------------|------------------------|
+| Service URL | URL Scrapling Service (bez trailing slash)        | `http://localhost:8765` |
+| API Key     | Opcjonalny klucz autoryzacyjny (header `X-API-Key`) | _(puste)_             |
+
+Jeśli `API_KEY` w serwisie jest pusty, autoryzacja jest wyłączona.
+
+---
+
+## Użycie node
+
+### Resource: Page
+
+Służy do pobierania całych stron. Zwraca URL, status HTTP, czas pobierania i (opcjonalnie) surowy HTML.
+
+#### Operacje
+
+| Operacja       | Fetcher            | Opis                                                    |
+|----------------|--------------------|---------------------------------------------------------|
+| Fetch          | `Fetcher`          | Szybkie żądanie HTTP. Najszybsza opcja dla statycznych stron. |
+| Fetch Stealth  | `StealthyFetcher`  | Impersonacja TLS (curl-impersonate). Omija podstawowe anti-bot. |
+| Fetch Dynamic  | `PlayWrightFetcher`| Pełna przeglądarka Chromium (Playwright). Dla SPA i stron wymagających JS. |
+
+**Przykładowy output:**
+
+```json
+{
+  "url": "https://example.com",
+  "status_code": 200,
+  "html": "<html>...</html>",
+  "data": {},
+  "fetcher_used": "http",
+  "elapsed_ms": 312.5
+}
+```
+
+---
+
+### Resource: Data
+
+Służy do ekstrakcji ustrukturyzowanych danych z pobranych stron.
+
+#### Operacje
+
+| Operacja        | Opis                                                             |
+|-----------------|------------------------------------------------------------------|
+| Extract         | Pobiera stronę i wyciąga dane za pomocą selektorów CSS lub XPath. |
+| Extract Tables  | Pobiera stronę i zwraca wszystkie tabele HTML jako tablice JSON.  |
+
+#### Konfiguracja selektorów (Extract)
+
+Każdy selektor definiuje pole w wyjściowym obiekcie `data`:
+
+| Pole             | Opis                                                              |
+|------------------|-------------------------------------------------------------------|
+| Field Name       | Nazwa klucza w wyjściowym JSON                                    |
+| Selector         | Wyrażenie CSS lub XPath, np. `h1.title` lub `//h1[@class="title"]` |
+| Type             | `css` lub `xpath`                                                 |
+| Attribute        | Atrybut HTML do pobrania (np. `href`, `src`). Puste = tekst.     |
+| Return Multiple  | Zwróć tablicę wszystkich pasujących elementów                     |
+
+**Przykładowy output dla Extract:**
+
+```json
+{
+  "url": "https://news.ycombinator.com",
+  "status_code": 200,
+  "data": {
+    "titles": ["Show HN: ...", "Ask HN: ...", "..."],
+    "top_link": "https://..."
+  },
+  "fetcher_used": "http",
+  "elapsed_ms": 187.2
+}
+```
+
+---
+
+### Opcje wspólne
+
+| Opcja                  | Opis                                                             | Domyślnie |
+|------------------------|------------------------------------------------------------------|-----------|
+| Return Raw HTML        | Dołącz surowy HTML do odpowiedzi                                | `false`   |
+| Timeout (ms)           | Maksymalny czas żądania (1000–120000 ms)                        | `30000`   |
+| Proxy                  | URL proxy, np. `http://user:pass@proxy.example.com:8080`         | _(puste)_ |
+| Extra Headers          | Dodatkowe nagłówki HTTP jako JSON, np. `{"Accept-Language": "pl"}` | `{}`   |
+| Wait for Selector      | CSS selector, na który czeka Playwright przed ekstrakcją        | _(puste)_ |
+| Wait for Network Idle  | Czekaj na zakończenie aktywności sieciowej (Playwright)          | `false`   |
+| Headless Browser       | Uruchom Playwright bez GUI                                       | `true`    |
+
+> Opcje **Wait for Selector**, **Wait for Network Idle** i **Headless Browser** działają tylko z fetcherem `dynamic`.
+
+---
+
+## CLI Runner
+
+Do testowania bez uruchamiania n8n:
+
+```bash
+# Podstawowe użycie
+npx ts-node scripts/scrapling-run.ts https://example.com
+
+# Tryb stealth
+npx ts-node scripts/scrapling-run.ts https://example.com --fetcher stealth
+
+# Ekstrakcja danych
+npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com \
+  --selector "titles:.titleline a" \
+  --fetcher http \
+  --format json
+
+# Playwright — czekaj na element
+npx ts-node scripts/scrapling-run.ts https://spa.example.com \
+  --fetcher dynamic \
+  --wait "#content" \
+  --html
+
+# Lub przez Taskfile
+task scrape URL=https://example.com FETCHER=stealth
+task scrape:dynamic URL=https://spa.example.com
+```
+
+Credentials CLI pobiera z pliku `.env.test` lub zmiennych środowiskowych:
+
+```bash
+SCRAPLING_SERVICE_URL=http://localhost:8765
+SCRAPLING_API_KEY=                          # opcjonalnie
+```
+
+---
+
+## Taskfile
+
+| Task                | Opis                                              |
+|---------------------|---------------------------------------------------|
+| `task setup`        | Instalacja npm + kopiowanie `.env.test.example`  |
+| `task build`        | Kompilacja TypeScript → `dist/`                  |
+| `task dev`          | Watch mode                                        |
+| `task test`         | Uruchom testy jednostkowe                         |
+| `task test:coverage`| Testy z raportem pokrycia                         |
+| `task lint`         | ESLint                                            |
+| `task check`        | Lint + test (pre-push)                            |
+| `task service:install` | Instalacja serwisu Python w `.venv`           |
+| `task service:start`   | Start serwisu na porcie 8765                  |
+| `task service:health`  | Sprawdź health endpoint                       |
+| `task scrape`       | CLI scraper `[URL=...] [FETCHER=...] [FORMAT=...]`|
+| `task scrape:stealth`  | Scrape w trybie stealth                       |
+| `task scrape:dynamic`  | Scrape z przeglądarką                         |
+| `task docker:up`    | Start Docker Compose (n8n + serwis)              |
+| `task docker:down`  | Stop Docker Compose                               |
+| `task docker:logs`  | Tail logów                                        |
+
+---
+
+## API Scrapling Service
+
+### `POST /scrape`
+
+Pobiera stronę i opcjonalnie ekstrahuje dane.
+
+**Request body:**
+
+```json
+{
+  "url": "https://example.com",
+  "fetcher_type": "http",
+  "selectors": [
+    {
+      "name": "title",
+      "selector": "h1",
+      "selector_type": "css",
+      "attribute": null,
+      "multiple": false
+    }
+  ],
+  "return_html": false,
+  "timeout": 30000,
+  "proxy": null,
+  "headers": {},
+  "wait_selector": null,
+  "network_idle": false,
+  "headless": true
+}
+```
+
+**Response:**
+
+```json
+{
+  "url": "https://example.com",
+  "status_code": 200,
+  "html": null,
+  "data": {
+    "title": "Example Domain"
+  },
+  "fetcher_used": "http",
+  "elapsed_ms": 245.3,
+  "error": null
+}
+```
+
+W przypadku błędu serwis zwraca HTTP 200 z wypełnionym polem `error` (zamiast rzucać wyjątek) — pozwala to n8n obsłużyć błąd przez opcję **Continue On Fail**.
+
+### `GET /health`
+
+```json
+{
+  "status": "ok",
+  "version": "0.1.0",
+  "dynamic_session_ready": true
+}
+```
+
+Autoryzacja: jeśli zmienna `API_KEY` jest ustawiona w serwisie, każdy request do `/scrape` wymaga nagłówka `X-API-Key`.
+
+---
+
+## Zmienne środowiskowe
+
+### Scrapling Service
+
+| Zmienna      | Opis                                              | Domyślnie |
+|--------------|---------------------------------------------------|-----------|
+| `API_KEY`    | Klucz API do autoryzacji requestów               | _(puste — wyłączone)_ |
+
+### Docker Compose
+
+| Zmienna              | Opis                              | Domyślnie   |
+|----------------------|-----------------------------------|-------------|
+| `SCRAPLING_API_KEY`  | API key serwisu (przekazywany)    | _(puste)_   |
+| `N8N_USER`           | Login do n8n basic auth           | `admin`     |
+| `N8N_PASSWORD`       | Hasło do n8n basic auth           | `changeme`  |
+
+---
+
+## Struktura projektu
+
+```
+scrapling_n8n/
+├── src/
+│   ├── credentials/
+│   │   └── ScraplingApi.credentials.ts   # Credentials: serviceUrl + apiKey
+│   └── nodes/Scrapling/
+│       ├── Scrapling.node.ts             # Główny node n8n
+│       ├── helpers.ts                    # HTTP client + typy
+│       ├── scrapling.svg                 # Ikona node
+│       └── __tests__/
+│           ├── helpers.test.ts           # Testy helpera
+│           └── Scrapling.node.test.ts    # Testy node (z mock)
+├── service/                              # Python FastAPI microservice
+│   ├── main.py                           # Entry point + autoryzacja
+│   ├── routers/
+│   │   ├── scrape.py                     # POST /scrape
+│   │   └── health.py                     # GET /health
+│   ├── scrapers/
+│   │   ├── base.py                       # Abstrakcja + apply_selectors()
+│   │   ├── fetcher.py                    # Wrapper Fetcher
+│   │   ├── stealthy.py                   # Wrapper StealthyFetcher
+│   │   └── dynamic.py                    # Wrapper PlayWrightFetcher
+│   ├── models/
+│   │   ├── request.py                    # Pydantic ScrapeRequest
+│   │   └── response.py                   # Pydantic ScrapeResponse
+│   ├── pyproject.toml                    # Zależności Python
+│   └── Dockerfile
+├── scripts/
+│   └── scrapling-run.ts                  # CLI runner (bez n8n)
+├── dist/                                 # Skompilowany JS (po npm run build)
+├── docker-compose.yml
+├── package.json
+├── tsconfig.json
+├── jest.config.js
+├── Taskfile.yml
+└── .env.test.example
+```
+
+---
+
+## Rozwój
+
+```bash
+# Instalacja
+task setup
+
+# Tryb watch (TypeScript)
+task dev
+
+# Testy
+task test
+task test:coverage
+
+# Lint
+task lint
+
+# Uruchomienie serwisu lokalnie
+task service:install
+task service:start
+
+# Szybki test end-to-end
+task scrape URL=https://httpbin.org/html
+```
+
+---
+
+## Licencja
+
+MIT