Files
awesome-copilot/skills/data-breach-blast-radius/references/data-classification.md
Shubham Jiyani 8ca38ffb9e feat: add data-breach-blast-radius skill for pre-breach impact analysis (#1487)
* feat: add data-breach-blast-radius skill for pre-breach impact analysis

* fix: resolve codespell false positives (ZAR currency code, SME abbreviation)

* fix: remove ZAR abbreviation to pass codespell check
2026-04-28 14:26:20 +10:00

8.7 KiB

Data Classification Taxonomy

A comprehensive taxonomy for identifying sensitive data in codebases. Every field, column, model property, or variable matching these patterns should be inventoried and assigned the appropriate sensitivity tier.


Tier 1 — Catastrophic (Irreversible harm if exposed)

Biometric Data

Detection patterns (field names / column names):

  • fingerprint, thumbprint, retina_scan, iris_scan, face_id, facial_recognition
  • voice_print, voice_biometric, gait_analysis, dna_profile, genetic_data
  • biometric_template, biometric_hash, faceEmbedding, face_vector

Detection patterns (data values / format):

  • Base64-encoded blobs > 512 bytes in biometric-named fields
  • Binary columns in tables named biometric_*, face_*, fingerprint_*

Government-Issued Identifiers

Detection patterns:

  • ssn, social_security_number, social_security, sin (Canada), nino (UK), tfn (Australia)
  • passport_number, passport_no, passport_id
  • drivers_license, drivers_licence, dl_number, license_number
  • national_id, national_identification, id_number, id_card_number
  • tax_id, tin, ein, itin, vat_number, fiscal_code
  • aadhaar, pan_number (India), cpf, cnpj (Brazil), rut (Chile/Colombia)
  • nric, fin (Singapore), my_kad (Malaysia), nik (Indonesia)

Regex patterns for values:

SSN:          \b\d{3}-\d{2}-\d{4}\b
UK NINO:      \b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b
CPF (Brazil): \b\d{3}\.\d{3}\.\d{3}-\d{2}\b
Aadhaar:      \b\d{4}\s\d{4}\s\d{4}\b

Health & Medical Data (PHI under HIPAA)

Detection patterns:

  • diagnosis, icd_code, icd10, icd11, snomed, loinc_code
  • medication, prescription, drug_name, dosage, treatment
  • medical_record_number, mrn, patient_id, encounter_id
  • lab_result, test_result, pathology, radiology
  • mental_health, psychiatric, therapy_notes, counseling
  • hiv_status, std_status, substance_abuse, addiction
  • insurance_id, insurance_member_id, health_plan_id, claim_number
  • fhir_resource, hl7_message, dicom_data
  • disability, handicap, chronic_condition
  • pregnancy, reproductive_health, fertility

Authentication Credentials

Detection patterns:

  • password, passwd, pwd, hashed_password, password_hash, password_digest
  • private_key, secret_key, api_key, api_secret, api_token
  • access_token, refresh_token, bearer_token, id_token, jwt_token
  • oauth_token, oauth_secret, oauth_access_token
  • mfa_secret, totp_secret, otp_secret, backup_codes
  • session_token, session_id, auth_token
  • client_secret, client_credential
  • private_key_pem, rsa_private, ecdsa_private

Tier 2 — Critical (High regulatory exposure)

Payment Card Data (PCI-DSS)

Detection patterns:

  • card_number, pan, primary_account_number, credit_card, debit_card
  • cvv, cvc, cvv2, card_verification, security_code
  • card_expiry, expiration_date, exp_date, expiry_month, expiry_year
  • cardholder_name, card_holder
  • iban, bic, swift_code, routing_number, account_number, sort_code
  • bank_account, bank_details, wire_transfer

Regex patterns for values:

Visa:            \b4[0-9]{12}(?:[0-9]{3})?\b
Mastercard:      \b5[1-5][0-9]{14}\b
Amex:            \b3[47][0-9]{13}\b
Generic PAN:     \b[0-9]{13,19}\b (in a PAN-named field)
CVV:             \b[0-9]{3,4}\b (in a cvv-named field)
IBAN:            \b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b

Identity Combinations (High re-identification risk when combined)

Combinations that together constitute Tier 2:

  • Full name + date of birth
  • Full name + address (street level)
  • Email + date of birth + gender
  • Phone number + address

Detection patterns:

  • full_name, first_name + last_name (as separate fields — note both present)
  • date_of_birth, dob, birth_date, birthdate, birthday
  • home_address, street_address, address_line1, postal_address
  • gender, sex, pronoun (when combined with other identifiers)

Tier 3 — High (Regulatory notification triggers)

Contact Information

Detection patterns:

  • email, email_address, user_email, contact_email, primary_email
  • phone, phone_number, mobile, mobile_number, cell_phone, telephone
  • whatsapp_number, signal_number

Regex patterns:

Email:  \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Phone:  \+?[0-9\s\-\(\)]{7,20}  (in a phone-named field)

Precise Location Data

Detection patterns:

  • latitude, longitude, lat, lng, lat_lng, coordinates, geo_point
  • gps_location, precise_location, real_time_location
  • home_location, work_location

Note: City-level location is Tier 4; street-level or GPS coordinates are Tier 3.

Network Identifiers

Detection patterns:

  • ip_address, ip, client_ip, remote_addr, x_forwarded_for
  • mac_address, device_mac, hardware_id
  • imei, imsi, device_id, advertising_id, idfa, gaid

Authentication Artifacts

Detection patterns:

  • session_id, cookie_value, csrf_token (if long-lived and user-identifying)
  • remember_me_token, persistent_session

Tier 4 — Elevated (Privacy relevant)

Partial Personal Identifiers

Detection patterns:

  • first_name, last_name, display_name, username (when alone)
  • profile_picture, avatar_url
  • city, state, country, region, zip_code, postal_code
  • time_zone, locale, language_preference

Behavioral & Analytics Data

Detection patterns:

  • user_agent, browser, device_type, os
  • search_query, search_history, browsing_history
  • purchase_history, order_history, transaction_history
  • click_event, page_view, session_duration
  • preferences, interests, tags, segments

Financial Context (non-card)

Detection patterns:

  • salary, income, net_worth, credit_score, credit_rating
  • account_balance, wallet_balance, subscription_tier

Tier 5 — Standard (No direct privacy impact)

  • System configuration values (non-secret)
  • Public user-facing content (blog posts, public profiles)
  • Anonymized aggregated statistics
  • Non-personal reference data (product catalog, country codes)
  • Internal system identifiers with no external exposure

Detection Guidance for AI Analysis

Framework-Specific Patterns

Django / Python:

# Sensitive fields typically appear in models.py
class User(models.Model):
    email = models.EmailField()           # Tier 3
    date_of_birth = models.DateField()    # Tier 2 (combined with name)
    ssn = models.CharField(max_length=11) # Tier 1

TypeScript / Prisma:

model User {
  email       String    // Tier 3
  phoneNumber String?   // Tier 3
  dateOfBirth DateTime? // Tier 2 (when combined)
  cardNumber  String?   // Tier 2 PCI-DSS
}

Java / Spring / JPA:

@Entity
public class Patient {
    @Column(name = "diagnosis")  // Tier 1 PHI
    private String diagnosis;
    
    @Column(name = "ssn")        // Tier 1
    private String ssn;
}

C# / EF Core:

public class UserProfile {
    public string Email { get; set; }        // Tier 3
    public string PassportNumber { get; set; } // Tier 1
    public DateTime DateOfBirth { get; set; }  // Tier 2
}

Log Statement Patterns (High Risk — often overlooked)

# BAD — logs PII
logger.info(f"User {user.email} logged in from {request.remote_addr}")
logger.debug(f"Payment for card {card_number}")

# Look for these in logging calls:
# .info(), .debug(), .warn(), .error(), console.log(), System.out.println()

API Response Leakage (Serializer/DTO patterns)

// Check if these fields are included in response objects
// even if not requested — over-fetching is a common exposure vector
{
  "id": "...",
  "email": "...",          // Tier 3
  "phone": "...",          // Tier 3 
  "dateOfBirth": "...",    // Tier 2 — should this be returned?
  "passwordHash": "...",   // Tier 1 — should NEVER be returned
  "ssn": "...",            // Tier 1 — should NEVER be returned
}

Aggregation Risk Assessment

Combination attacks — data that becomes more sensitive when combined:

Alone Combined With Combined Tier Risk
Email (T3) Password hash (T1) T1 Account takeover
Name (T4) DOB (T2) + Address (T2) T2 Full identity reconstruction
IP address (T3) Timestamps + User ID T2 Behavioral profiling
City (T4) Purchase history (T4) T3 De-anonymization risk
Health category (T4) Name + Email T1 HIPAA triggering

Rule: Always assess fields in combination, not just in isolation.