feat: add data-breach-blast-radius skill for pre-breach impact analysis (#1487)

* feat: add data-breach-blast-radius skill for pre-breach impact analysis

* fix: resolve codespell false positives (ZAR currency code, SME abbreviation)

* fix: remove ZAR abbreviation to pass codespell check
This commit is contained in:
Shubham Jiyani
2026-04-27 21:26:20 -07:00
committed by GitHub
parent 8d182ae78d
commit 8ca38ffb9e
8 changed files with 2023 additions and 0 deletions

View File

@@ -0,0 +1,250 @@
# Data Classification Taxonomy
A comprehensive taxonomy for identifying sensitive data in codebases. Every field, column, model property, or variable matching these patterns should be inventoried and assigned the appropriate sensitivity tier.
---
## Tier 1 — Catastrophic (Irreversible harm if exposed)
### Biometric Data
**Detection patterns (field names / column names):**
- `fingerprint`, `thumbprint`, `retina_scan`, `iris_scan`, `face_id`, `facial_recognition`
- `voice_print`, `voice_biometric`, `gait_analysis`, `dna_profile`, `genetic_data`
- `biometric_template`, `biometric_hash`, `faceEmbedding`, `face_vector`
**Detection patterns (data values / format):**
- Base64-encoded blobs > 512 bytes in biometric-named fields
- Binary columns in tables named `biometric_*`, `face_*`, `fingerprint_*`
### Government-Issued Identifiers
**Detection patterns:**
- `ssn`, `social_security_number`, `social_security`, `sin` (Canada), `nino` (UK), `tfn` (Australia)
- `passport_number`, `passport_no`, `passport_id`
- `drivers_license`, `drivers_licence`, `dl_number`, `license_number`
- `national_id`, `national_identification`, `id_number`, `id_card_number`
- `tax_id`, `tin`, `ein`, `itin`, `vat_number`, `fiscal_code`
- `aadhaar`, `pan_number` (India), `cpf`, `cnpj` (Brazil), `rut` (Chile/Colombia)
- `nric`, `fin` (Singapore), `my_kad` (Malaysia), `nik` (Indonesia)
**Regex patterns for values:**
```
SSN: \b\d{3}-\d{2}-\d{4}\b
UK NINO: \b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b
CPF (Brazil): \b\d{3}\.\d{3}\.\d{3}-\d{2}\b
Aadhaar: \b\d{4}\s\d{4}\s\d{4}\b
```
### Health & Medical Data (PHI under HIPAA)
**Detection patterns:**
- `diagnosis`, `icd_code`, `icd10`, `icd11`, `snomed`, `loinc_code`
- `medication`, `prescription`, `drug_name`, `dosage`, `treatment`
- `medical_record_number`, `mrn`, `patient_id`, `encounter_id`
- `lab_result`, `test_result`, `pathology`, `radiology`
- `mental_health`, `psychiatric`, `therapy_notes`, `counseling`
- `hiv_status`, `std_status`, `substance_abuse`, `addiction`
- `insurance_id`, `insurance_member_id`, `health_plan_id`, `claim_number`
- `fhir_resource`, `hl7_message`, `dicom_data`
- `disability`, `handicap`, `chronic_condition`
- `pregnancy`, `reproductive_health`, `fertility`
### Authentication Credentials
**Detection patterns:**
- `password`, `passwd`, `pwd`, `hashed_password`, `password_hash`, `password_digest`
- `private_key`, `secret_key`, `api_key`, `api_secret`, `api_token`
- `access_token`, `refresh_token`, `bearer_token`, `id_token`, `jwt_token`
- `oauth_token`, `oauth_secret`, `oauth_access_token`
- `mfa_secret`, `totp_secret`, `otp_secret`, `backup_codes`
- `session_token`, `session_id`, `auth_token`
- `client_secret`, `client_credential`
- `private_key_pem`, `rsa_private`, `ecdsa_private`
---
## Tier 2 — Critical (High regulatory exposure)
### Payment Card Data (PCI-DSS)
**Detection patterns:**
- `card_number`, `pan`, `primary_account_number`, `credit_card`, `debit_card`
- `cvv`, `cvc`, `cvv2`, `card_verification`, `security_code`
- `card_expiry`, `expiration_date`, `exp_date`, `expiry_month`, `expiry_year`
- `cardholder_name`, `card_holder`
- `iban`, `bic`, `swift_code`, `routing_number`, `account_number`, `sort_code`
- `bank_account`, `bank_details`, `wire_transfer`
**Regex patterns for values:**
```
Visa: \b4[0-9]{12}(?:[0-9]{3})?\b
Mastercard: \b5[1-5][0-9]{14}\b
Amex: \b3[47][0-9]{13}\b
Generic PAN: \b[0-9]{13,19}\b (in a PAN-named field)
CVV: \b[0-9]{3,4}\b (in a cvv-named field)
IBAN: \b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b
```
### Identity Combinations (High re-identification risk when combined)
**Combinations that together constitute Tier 2:**
- Full name + date of birth
- Full name + address (street level)
- Email + date of birth + gender
- Phone number + address
**Detection patterns:**
- `full_name`, `first_name` + `last_name` (as separate fields — note both present)
- `date_of_birth`, `dob`, `birth_date`, `birthdate`, `birthday`
- `home_address`, `street_address`, `address_line1`, `postal_address`
- `gender`, `sex`, `pronoun` (when combined with other identifiers)
---
## Tier 3 — High (Regulatory notification triggers)
### Contact Information
**Detection patterns:**
- `email`, `email_address`, `user_email`, `contact_email`, `primary_email`
- `phone`, `phone_number`, `mobile`, `mobile_number`, `cell_phone`, `telephone`
- `whatsapp_number`, `signal_number`
**Regex patterns:**
```
Email: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Phone: \+?[0-9\s\-\(\)]{7,20} (in a phone-named field)
```
### Precise Location Data
**Detection patterns:**
- `latitude`, `longitude`, `lat`, `lng`, `lat_lng`, `coordinates`, `geo_point`
- `gps_location`, `precise_location`, `real_time_location`
- `home_location`, `work_location`
**Note:** City-level location is Tier 4; street-level or GPS coordinates are Tier 3.
### Network Identifiers
**Detection patterns:**
- `ip_address`, `ip`, `client_ip`, `remote_addr`, `x_forwarded_for`
- `mac_address`, `device_mac`, `hardware_id`
- `imei`, `imsi`, `device_id`, `advertising_id`, `idfa`, `gaid`
### Authentication Artifacts
**Detection patterns:**
- `session_id`, `cookie_value`, `csrf_token` (if long-lived and user-identifying)
- `remember_me_token`, `persistent_session`
---
## Tier 4 — Elevated (Privacy relevant)
### Partial Personal Identifiers
**Detection patterns:**
- `first_name`, `last_name`, `display_name`, `username` (when alone)
- `profile_picture`, `avatar_url`
- `city`, `state`, `country`, `region`, `zip_code`, `postal_code`
- `time_zone`, `locale`, `language_preference`
### Behavioral & Analytics Data
**Detection patterns:**
- `user_agent`, `browser`, `device_type`, `os`
- `search_query`, `search_history`, `browsing_history`
- `purchase_history`, `order_history`, `transaction_history`
- `click_event`, `page_view`, `session_duration`
- `preferences`, `interests`, `tags`, `segments`
### Financial Context (non-card)
**Detection patterns:**
- `salary`, `income`, `net_worth`, `credit_score`, `credit_rating`
- `account_balance`, `wallet_balance`, `subscription_tier`
---
## Tier 5 — Standard (No direct privacy impact)
- System configuration values (non-secret)
- Public user-facing content (blog posts, public profiles)
- Anonymized aggregated statistics
- Non-personal reference data (product catalog, country codes)
- Internal system identifiers with no external exposure
---
## Detection Guidance for AI Analysis
### Framework-Specific Patterns
**Django / Python:**
```python
# Sensitive fields typically appear in models.py
class User(models.Model):
email = models.EmailField() # Tier 3
date_of_birth = models.DateField() # Tier 2 (combined with name)
ssn = models.CharField(max_length=11) # Tier 1
```
**TypeScript / Prisma:**
```prisma
model User {
email String // Tier 3
phoneNumber String? // Tier 3
dateOfBirth DateTime? // Tier 2 (when combined)
cardNumber String? // Tier 2 PCI-DSS
}
```
**Java / Spring / JPA:**
```java
@Entity
public class Patient {
@Column(name = "diagnosis") // Tier 1 PHI
private String diagnosis;
@Column(name = "ssn") // Tier 1
private String ssn;
}
```
**C# / EF Core:**
```csharp
public class UserProfile {
public string Email { get; set; } // Tier 3
public string PassportNumber { get; set; } // Tier 1
public DateTime DateOfBirth { get; set; } // Tier 2
}
```
### Log Statement Patterns (High Risk — often overlooked)
```python
# BAD — logs PII
logger.info(f"User {user.email} logged in from {request.remote_addr}")
logger.debug(f"Payment for card {card_number}")
# Look for these in logging calls:
# .info(), .debug(), .warn(), .error(), console.log(), System.out.println()
```
### API Response Leakage (Serializer/DTO patterns)
```typescript
// Check if these fields are included in response objects
// even if not requested — over-fetching is a common exposure vector
{
"id": "...",
"email": "...", // Tier 3
"phone": "...", // Tier 3
"dateOfBirth": "...", // Tier 2 — should this be returned?
"passwordHash": "...", // Tier 1 — should NEVER be returned
"ssn": "...", // Tier 1 — should NEVER be returned
}
```
---
## Aggregation Risk Assessment
Combination attacks — data that becomes more sensitive when combined:
| Alone | Combined With | Combined Tier | Risk |
|-------|--------------|---------------|------|
| Email (T3) | Password hash (T1) | T1 | Account takeover |
| Name (T4) | DOB (T2) + Address (T2) | T2 | Full identity reconstruction |
| IP address (T3) | Timestamps + User ID | T2 | Behavioral profiling |
| City (T4) | Purchase history (T4) | T3 | De-anonymization risk |
| Health category (T4) | Name + Email | T1 | HIPAA triggering |
**Rule:** Always assess fields in combination, not just in isolation.