scrapling/README.md

# n8n-nodes-scrapling

Community node dla [n8n](https://n8n.io) integrujący bibliotekę [Scrapling](https://github.com/D4Vinci/Scrapling) — szybki, adaptacyjny scraper stron internetowych z obsługą zwykłego HTTP, trybu stealth (TLS fingerprint impersonation) oraz pełnej przeglądarki Playwright.

---

## Architektura

```
n8n (Node.js / TypeScript)
        │
        │  HTTP POST /scrape
        ▼
Scrapling Service (Python / FastAPI)
        │
        ├── Fetcher          — szybkie żądania HTTP
        ├── StealthyFetcher  — impersonacja TLS (curl-impersonate)
        └── PlayWrightFetcher — pełna przeglądarka Chromium
```

n8n node komunikuje się z Scrapling Service przez HTTP. Serwis Python zarządza instancjami scraperów i zwraca ustrukturyzowane JSON.

---

## Wymagania

| Komponent       | Wersja         |
|-----------------|----------------|
| n8n             | ≥ 1.0          |
| Node.js         | ≥ 18           |
| Python          | ≥ 3.11         |
| Docker          | opcjonalnie    |

---

## Instalacja

### Opcja A — Docker Compose (zalecana)

Uruchamia n8n oraz Scrapling Service w jednym poleceniu.

```bash
# Skopiuj i uzupełnij zmienne środowiskowe
cp .env.test.example .env

# Zbuduj i uruchom
docker compose up -d
```

n8n dostępne pod adresem: `http://localhost:5678`
Scrapling Service: `http://localhost:8765`

### Opcja B — Ręczna instalacja

**1. Zainstaluj zależności n8n node:**

```bash
npm install
npm run build
```

**2. Zainstaluj serwis Python:**

```bash
cd service
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/scrapling install   # pobiera przeglądarki Playwright
```

**3. Uruchom serwis:**

```bash
.venv/bin/uvicorn service.main:app --host 0.0.0.0 --port 8765
```

**4. Zainstaluj node w n8n:**

Skopiuj katalog `dist/` do katalogu community nodes n8n:

```bash
# Domyślna ścieżka dla n8n instalowanego przez npm
cp -r dist ~/.n8n/nodes/node_modules/n8n-nodes-scrapling
```

Następnie zrestartuj n8n.

---

## Konfiguracja credentials

W n8n dodaj nowy credential typu **Scrapling API**:

| Pole        | Opis                                              | Domyślnie              |
|-------------|---------------------------------------------------|------------------------|
| Service URL | URL Scrapling Service (bez trailing slash)        | `http://localhost:8765` |
| API Key     | Opcjonalny klucz autoryzacyjny (header `X-API-Key`) | _(puste)_             |

Jeśli `API_KEY` w serwisie jest pusty, autoryzacja jest wyłączona.

---

## Użycie node

### Resource: Page

Służy do pobierania całych stron. Zwraca URL, status HTTP, czas pobierania i (opcjonalnie) surowy HTML.

#### Operacje

| Operacja       | Fetcher            | Opis                                                    |
|----------------|--------------------|---------------------------------------------------------|
| Fetch          | `Fetcher`          | Szybkie żądanie HTTP. Najszybsza opcja dla statycznych stron. |
| Fetch Stealth  | `StealthyFetcher`  | Impersonacja TLS (curl-impersonate). Omija podstawowe anti-bot. |
| Fetch Dynamic  | `PlayWrightFetcher`| Pełna przeglądarka Chromium (Playwright). Dla SPA i stron wymagających JS. |

**Przykładowy output:**

```json
{
  "url": "https://example.com",
  "status_code": 200,
  "html": "<html>...</html>",
  "data": {},
  "fetcher_used": "http",
  "elapsed_ms": 312.5
}
```

---

### Resource: Data

Służy do ekstrakcji ustrukturyzowanych danych z pobranych stron.

#### Operacje

| Operacja        | Opis                                                             |
|-----------------|------------------------------------------------------------------|
| Extract         | Pobiera stronę i wyciąga dane za pomocą selektorów CSS lub XPath. |
| Extract Tables  | Pobiera stronę i zwraca wszystkie tabele HTML jako tablice JSON.  |

#### Konfiguracja selektorów (Extract)

Każdy selektor definiuje pole w wyjściowym obiekcie `data`:

| Pole             | Opis                                                              |
|------------------|-------------------------------------------------------------------|
| Field Name       | Nazwa klucza w wyjściowym JSON                                    |
| Selector         | Wyrażenie CSS lub XPath, np. `h1.title` lub `//h1[@class="title"]` |
| Type             | `css` lub `xpath`                                                 |
| Attribute        | Atrybut HTML do pobrania (np. `href`, `src`). Puste = tekst.     |
| Return Multiple  | Zwróć tablicę wszystkich pasujących elementów                     |

**Przykładowy output dla Extract:**

```json
{
  "url": "https://news.ycombinator.com",
  "status_code": 200,
  "data": {
    "titles": ["Show HN: ...", "Ask HN: ...", "..."],
    "top_link": "https://..."
  },
  "fetcher_used": "http",
  "elapsed_ms": 187.2
}
```

---

### Opcje wspólne

| Opcja                  | Opis                                                             | Domyślnie |
|------------------------|------------------------------------------------------------------|-----------|
| Return Raw HTML        | Dołącz surowy HTML do odpowiedzi                                | `false`   |
| Timeout (ms)           | Maksymalny czas żądania (1000–120000 ms)                        | `30000`   |
| Proxy                  | URL proxy, np. `http://user:pass@proxy.example.com:8080`         | _(puste)_ |
| Extra Headers          | Dodatkowe nagłówki HTTP jako JSON, np. `{"Accept-Language": "pl"}` | `{}`   |
| Wait for Selector      | CSS selector, na który czeka Playwright przed ekstrakcją        | _(puste)_ |
| Wait for Network Idle  | Czekaj na zakończenie aktywności sieciowej (Playwright)          | `false`   |
| Headless Browser       | Uruchom Playwright bez GUI                                       | `true`    |

> Opcje **Wait for Selector**, **Wait for Network Idle** i **Headless Browser** działają tylko z fetcherem `dynamic`.

---

## CLI Runner

Do testowania bez uruchamiania n8n:

```bash
# Podstawowe użycie
npx ts-node scripts/scrapling-run.ts https://example.com

# Tryb stealth
npx ts-node scripts/scrapling-run.ts https://example.com --fetcher stealth

# Ekstrakcja danych
npx ts-node scripts/scrapling-run.ts https://news.ycombinator.com \
  --selector "titles:.titleline a" \
  --fetcher http \
  --format json

# Playwright — czekaj na element
npx ts-node scripts/scrapling-run.ts https://spa.example.com \
  --fetcher dynamic \
  --wait "#content" \
  --html

# Lub przez Taskfile
task scrape URL=https://example.com FETCHER=stealth
task scrape:dynamic URL=https://spa.example.com
```

Credentials CLI pobiera z pliku `.env.test` lub zmiennych środowiskowych:

```bash
SCRAPLING_SERVICE_URL=http://localhost:8765
SCRAPLING_API_KEY=                          # opcjonalnie
```

---

## Taskfile

| Task                | Opis                                              |
|---------------------|---------------------------------------------------|
| `task setup`        | Instalacja npm + kopiowanie `.env.test.example`  |
| `task build`        | Kompilacja TypeScript → `dist/`                  |
| `task dev`          | Watch mode                                        |
| `task test`         | Uruchom testy jednostkowe                         |
| `task test:coverage`| Testy z raportem pokrycia                         |
| `task lint`         | ESLint                                            |
| `task check`        | Lint + test (pre-push)                            |
| `task service:install` | Instalacja serwisu Python w `.venv`           |
| `task service:start`   | Start serwisu na porcie 8765                  |
| `task service:health`  | Sprawdź health endpoint                       |
| `task scrape`       | CLI scraper `[URL=...] [FETCHER=...] [FORMAT=...]`|
| `task scrape:stealth`  | Scrape w trybie stealth                       |
| `task scrape:dynamic`  | Scrape z przeglądarką                         |
| `task docker:up`    | Start Docker Compose (n8n + serwis)              |
| `task docker:down`  | Stop Docker Compose                               |
| `task docker:logs`  | Tail logów                                        |

---

## API Scrapling Service

### `POST /scrape`

Pobiera stronę i opcjonalnie ekstrahuje dane.

**Request body:**

```json
{
  "url": "https://example.com",
  "fetcher_type": "http",
  "selectors": [
    {
      "name": "title",
      "selector": "h1",
      "selector_type": "css",
      "attribute": null,
      "multiple": false
    }
  ],
  "return_html": false,
  "timeout": 30000,
  "proxy": null,
  "headers": {},
  "wait_selector": null,
  "network_idle": false,
  "headless": true
}
```

**Response:**

```json
{
  "url": "https://example.com",
  "status_code": 200,
  "html": null,
  "data": {
    "title": "Example Domain"
  },
  "fetcher_used": "http",
  "elapsed_ms": 245.3,
  "error": null
}
```

W przypadku błędu serwis zwraca HTTP 200 z wypełnionym polem `error` (zamiast rzucać wyjątek) — pozwala to n8n obsłużyć błąd przez opcję **Continue On Fail**.

### `GET /health`

```json
{
  "status": "ok",
  "version": "0.1.0",
  "dynamic_session_ready": true
}
```

Autoryzacja: jeśli zmienna `API_KEY` jest ustawiona w serwisie, każdy request do `/scrape` wymaga nagłówka `X-API-Key`.

---

## Zmienne środowiskowe

### Scrapling Service

| Zmienna      | Opis                                              | Domyślnie |
|--------------|---------------------------------------------------|-----------|
| `API_KEY`    | Klucz API do autoryzacji requestów               | _(puste — wyłączone)_ |

### Docker Compose

| Zmienna              | Opis                              | Domyślnie   |
|----------------------|-----------------------------------|-------------|
| `SCRAPLING_API_KEY`  | API key serwisu (przekazywany)    | _(puste)_   |
| `N8N_USER`           | Login do n8n basic auth           | `admin`     |
| `N8N_PASSWORD`       | Hasło do n8n basic auth           | `changeme`  |

---

## Struktura projektu

```
scrapling_n8n/
├── src/
│   ├── credentials/
│   │   └── ScraplingApi.credentials.ts   # Credentials: serviceUrl + apiKey
│   └── nodes/Scrapling/
│       ├── Scrapling.node.ts             # Główny node n8n
│       ├── helpers.ts                    # HTTP client + typy
│       ├── scrapling.svg                 # Ikona node
│       └── __tests__/
│           ├── helpers.test.ts           # Testy helpera
│           └── Scrapling.node.test.ts    # Testy node (z mock)
├── service/                              # Python FastAPI microservice
│   ├── main.py                           # Entry point + autoryzacja
│   ├── routers/
│   │   ├── scrape.py                     # POST /scrape
│   │   └── health.py                     # GET /health
│   ├── scrapers/
│   │   ├── base.py                       # Abstrakcja + apply_selectors()
│   │   ├── fetcher.py                    # Wrapper Fetcher
│   │   ├── stealthy.py                   # Wrapper StealthyFetcher
│   │   └── dynamic.py                    # Wrapper PlayWrightFetcher
│   ├── models/
│   │   ├── request.py                    # Pydantic ScrapeRequest
│   │   └── response.py                   # Pydantic ScrapeResponse
│   ├── pyproject.toml                    # Zależności Python
│   └── Dockerfile
├── scripts/
│   └── scrapling-run.ts                  # CLI runner (bez n8n)
├── dist/                                 # Skompilowany JS (po npm run build)
├── docker-compose.yml
├── package.json
├── tsconfig.json
├── jest.config.js
├── Taskfile.yml
└── .env.test.example
```

---

## Rozwój

```bash
# Instalacja
task setup

# Tryb watch (TypeScript)
task dev

# Testy
task test
task test:coverage

# Lint
task lint

# Uruchomienie serwisu lokalnie
task service:install
task service:start

# Szybki test end-to-end
task scrape URL=https://httpbin.org/html
```

---

## Licencja

MIT