# Data Classification Taxonomy

A comprehensive taxonomy for identifying sensitive data in codebases. Every field, column, model property, or variable matching these patterns should be inventoried and assigned the appropriate sensitivity tier.

---

## Tier 1 — Catastrophic (Irreversible harm if exposed)

### Biometric Data
**Detection patterns (field names / column names):**
- `fingerprint`, `thumbprint`, `retina_scan`, `iris_scan`, `face_id`, `facial_recognition`
- `voice_print`, `voice_biometric`, `gait_analysis`, `dna_profile`, `genetic_data`
- `biometric_template`, `biometric_hash`, `faceEmbedding`, `face_vector`

**Detection patterns (data values / format):**
- Base64-encoded blobs > 512 bytes in biometric-named fields
- Binary columns in tables named `biometric_*`, `face_*`, `fingerprint_*`

### Government-Issued Identifiers
**Detection patterns:**
- `ssn`, `social_security_number`, `social_security`, `sin` (Canada), `nino` (UK), `tfn` (Australia)
- `passport_number`, `passport_no`, `passport_id`
- `drivers_license`, `drivers_licence`, `dl_number`, `license_number`
- `national_id`, `national_identification`, `id_number`, `id_card_number`
- `tax_id`, `tin`, `ein`, `itin`, `vat_number`, `fiscal_code`
- `aadhaar`, `pan_number` (India), `cpf`, `cnpj` (Brazil), `rut` (Chile/Colombia)
- `nric`, `fin` (Singapore), `my_kad` (Malaysia), `nik` (Indonesia)

**Regex patterns for values:**
```
SSN:          \b\d{3}-\d{2}-\d{4}\b
UK NINO:      \b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b
CPF (Brazil): \b\d{3}\.\d{3}\.\d{3}-\d{2}\b
Aadhaar:      \b\d{4}\s\d{4}\s\d{4}\b
```

### Health & Medical Data (PHI under HIPAA)
**Detection patterns:**
- `diagnosis`, `icd_code`, `icd10`, `icd11`, `snomed`, `loinc_code`
- `medication`, `prescription`, `drug_name`, `dosage`, `treatment`
- `medical_record_number`, `mrn`, `patient_id`, `encounter_id`
- `lab_result`, `test_result`, `pathology`, `radiology`
- `mental_health`, `psychiatric`, `therapy_notes`, `counseling`
- `hiv_status`, `std_status`, `substance_abuse`, `addiction`
- `insurance_id`, `insurance_member_id`, `health_plan_id`, `claim_number`
- `fhir_resource`, `hl7_message`, `dicom_data`
- `disability`, `handicap`, `chronic_condition`
- `pregnancy`, `reproductive_health`, `fertility`

### Authentication Credentials
**Detection patterns:**
- `password`, `passwd`, `pwd`, `hashed_password`, `password_hash`, `password_digest`
- `private_key`, `secret_key`, `api_key`, `api_secret`, `api_token`
- `access_token`, `refresh_token`, `bearer_token`, `id_token`, `jwt_token`
- `oauth_token`, `oauth_secret`, `oauth_access_token`
- `mfa_secret`, `totp_secret`, `otp_secret`, `backup_codes`
- `session_token`, `session_id`, `auth_token`
- `client_secret`, `client_credential`
- `private_key_pem`, `rsa_private`, `ecdsa_private`

---

## Tier 2 — Critical (High regulatory exposure)

### Payment Card Data (PCI-DSS)
**Detection patterns:**
- `card_number`, `pan`, `primary_account_number`, `credit_card`, `debit_card`
- `cvv`, `cvc`, `cvv2`, `card_verification`, `security_code`
- `card_expiry`, `expiration_date`, `exp_date`, `expiry_month`, `expiry_year`
- `cardholder_name`, `card_holder`
- `iban`, `bic`, `swift_code`, `routing_number`, `account_number`, `sort_code`
- `bank_account`, `bank_details`, `wire_transfer`

**Regex patterns for values:**
```
Visa:            \b4[0-9]{12}(?:[0-9]{3})?\b
Mastercard:      \b5[1-5][0-9]{14}\b
Amex:            \b3[47][0-9]{13}\b
Generic PAN:     \b[0-9]{13,19}\b (in a PAN-named field)
CVV:             \b[0-9]{3,4}\b (in a cvv-named field)
IBAN:            \b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b
```

### Identity Combinations (High re-identification risk when combined)
**Combinations that together constitute Tier 2:**
- Full name + date of birth
- Full name + address (street level)
- Email + date of birth + gender
- Phone number + address

**Detection patterns:**
- `full_name`, `first_name` + `last_name` (as separate fields — note both present)
- `date_of_birth`, `dob`, `birth_date`, `birthdate`, `birthday`
- `home_address`, `street_address`, `address_line1`, `postal_address`
- `gender`, `sex`, `pronoun` (when combined with other identifiers)

---

## Tier 3 — High (Regulatory notification triggers)

### Contact Information
**Detection patterns:**
- `email`, `email_address`, `user_email`, `contact_email`, `primary_email`
- `phone`, `phone_number`, `mobile`, `mobile_number`, `cell_phone`, `telephone`
- `whatsapp_number`, `signal_number`

**Regex patterns:**
```
Email:  \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Phone:  \+?[0-9\s\-\(\)]{7,20}  (in a phone-named field)
```

### Precise Location Data
**Detection patterns:**
- `latitude`, `longitude`, `lat`, `lng`, `lat_lng`, `coordinates`, `geo_point`
- `gps_location`, `precise_location`, `real_time_location`
- `home_location`, `work_location`

**Note:** City-level location is Tier 4; street-level or GPS coordinates are Tier 3.

### Network Identifiers
**Detection patterns:**
- `ip_address`, `ip`, `client_ip`, `remote_addr`, `x_forwarded_for`
- `mac_address`, `device_mac`, `hardware_id`
- `imei`, `imsi`, `device_id`, `advertising_id`, `idfa`, `gaid`

### Authentication Artifacts
**Detection patterns:**
- `session_id`, `cookie_value`, `csrf_token` (if long-lived and user-identifying)
- `remember_me_token`, `persistent_session`

---

## Tier 4 — Elevated (Privacy relevant)

### Partial Personal Identifiers
**Detection patterns:**
- `first_name`, `last_name`, `display_name`, `username` (when alone)
- `profile_picture`, `avatar_url`
- `city`, `state`, `country`, `region`, `zip_code`, `postal_code`
- `time_zone`, `locale`, `language_preference`

### Behavioral & Analytics Data
**Detection patterns:**
- `user_agent`, `browser`, `device_type`, `os`
- `search_query`, `search_history`, `browsing_history`
- `purchase_history`, `order_history`, `transaction_history`
- `click_event`, `page_view`, `session_duration`
- `preferences`, `interests`, `tags`, `segments`

### Financial Context (non-card)
**Detection patterns:**
- `salary`, `income`, `net_worth`, `credit_score`, `credit_rating`
- `account_balance`, `wallet_balance`, `subscription_tier`

---

## Tier 5 — Standard (No direct privacy impact)

- System configuration values (non-secret)
- Public user-facing content (blog posts, public profiles)
- Anonymized aggregated statistics
- Non-personal reference data (product catalog, country codes)
- Internal system identifiers with no external exposure

---

## Detection Guidance for AI Analysis

### Framework-Specific Patterns

**Django / Python:**
```python
# Sensitive fields typically appear in models.py
class User(models.Model):
    email = models.EmailField()           # Tier 3
    date_of_birth = models.DateField()    # Tier 2 (combined with name)
    ssn = models.CharField(max_length=11) # Tier 1
```

**TypeScript / Prisma:**
```prisma
model User {
  email       String    // Tier 3
  phoneNumber String?   // Tier 3
  dateOfBirth DateTime? // Tier 2 (when combined)
  cardNumber  String?   // Tier 2 PCI-DSS
}
```

**Java / Spring / JPA:**
```java
@Entity
public class Patient {
    @Column(name = "diagnosis")  // Tier 1 PHI
    private String diagnosis;
    
    @Column(name = "ssn")        // Tier 1
    private String ssn;
}
```

**C# / EF Core:**
```csharp
public class UserProfile {
    public string Email { get; set; }        // Tier 3
    public string PassportNumber { get; set; } // Tier 1
    public DateTime DateOfBirth { get; set; }  // Tier 2
}
```

### Log Statement Patterns (High Risk — often overlooked)
```python
# BAD — logs PII
logger.info(f"User {user.email} logged in from {request.remote_addr}")
logger.debug(f"Payment for card {card_number}")

# Look for these in logging calls:
# .info(), .debug(), .warn(), .error(), console.log(), System.out.println()
```

### API Response Leakage (Serializer/DTO patterns)
```typescript
// Check if these fields are included in response objects
// even if not requested — over-fetching is a common exposure vector
{
  "id": "...",
  "email": "...",          // Tier 3
  "phone": "...",          // Tier 3 
  "dateOfBirth": "...",    // Tier 2 — should this be returned?
  "passwordHash": "...",   // Tier 1 — should NEVER be returned
  "ssn": "...",            // Tier 1 — should NEVER be returned
}
```

---

## Aggregation Risk Assessment

Combination attacks — data that becomes more sensitive when combined:

| Alone | Combined With | Combined Tier | Risk |
|-------|--------------|---------------|------|
| Email (T3) | Password hash (T1) | T1 | Account takeover |
| Name (T4) | DOB (T2) + Address (T2) | T2 | Full identity reconstruction |
| IP address (T3) | Timestamps + User ID | T2 | Behavioral profiling |
| City (T4) | Purchase history (T4) | T3 | De-anonymization risk |
| Health category (T4) | Name + Email | T1 | HIPAA triggering |

**Rule:** Always assess fields in combination, not just in isolation.