* feat: add data-breach-blast-radius skill for pre-breach impact analysis * fix: resolve codespell false positives (ZAR currency code, SME abbreviation) * fix: remove ZAR abbreviation to pass codespell check
8.7 KiB
Data Classification Taxonomy
A comprehensive taxonomy for identifying sensitive data in codebases. Every field, column, model property, or variable matching these patterns should be inventoried and assigned the appropriate sensitivity tier.
Tier 1 — Catastrophic (Irreversible harm if exposed)
Biometric Data
Detection patterns (field names / column names):
fingerprint,thumbprint,retina_scan,iris_scan,face_id,facial_recognitionvoice_print,voice_biometric,gait_analysis,dna_profile,genetic_databiometric_template,biometric_hash,faceEmbedding,face_vector
Detection patterns (data values / format):
- Base64-encoded blobs > 512 bytes in biometric-named fields
- Binary columns in tables named
biometric_*,face_*,fingerprint_*
Government-Issued Identifiers
Detection patterns:
ssn,social_security_number,social_security,sin(Canada),nino(UK),tfn(Australia)passport_number,passport_no,passport_iddrivers_license,drivers_licence,dl_number,license_numbernational_id,national_identification,id_number,id_card_numbertax_id,tin,ein,itin,vat_number,fiscal_codeaadhaar,pan_number(India),cpf,cnpj(Brazil),rut(Chile/Colombia)nric,fin(Singapore),my_kad(Malaysia),nik(Indonesia)
Regex patterns for values:
SSN: \b\d{3}-\d{2}-\d{4}\b
UK NINO: \b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b
CPF (Brazil): \b\d{3}\.\d{3}\.\d{3}-\d{2}\b
Aadhaar: \b\d{4}\s\d{4}\s\d{4}\b
Health & Medical Data (PHI under HIPAA)
Detection patterns:
diagnosis,icd_code,icd10,icd11,snomed,loinc_codemedication,prescription,drug_name,dosage,treatmentmedical_record_number,mrn,patient_id,encounter_idlab_result,test_result,pathology,radiologymental_health,psychiatric,therapy_notes,counselinghiv_status,std_status,substance_abuse,addictioninsurance_id,insurance_member_id,health_plan_id,claim_numberfhir_resource,hl7_message,dicom_datadisability,handicap,chronic_conditionpregnancy,reproductive_health,fertility
Authentication Credentials
Detection patterns:
password,passwd,pwd,hashed_password,password_hash,password_digestprivate_key,secret_key,api_key,api_secret,api_tokenaccess_token,refresh_token,bearer_token,id_token,jwt_tokenoauth_token,oauth_secret,oauth_access_tokenmfa_secret,totp_secret,otp_secret,backup_codessession_token,session_id,auth_tokenclient_secret,client_credentialprivate_key_pem,rsa_private,ecdsa_private
Tier 2 — Critical (High regulatory exposure)
Payment Card Data (PCI-DSS)
Detection patterns:
card_number,pan,primary_account_number,credit_card,debit_cardcvv,cvc,cvv2,card_verification,security_codecard_expiry,expiration_date,exp_date,expiry_month,expiry_yearcardholder_name,card_holderiban,bic,swift_code,routing_number,account_number,sort_codebank_account,bank_details,wire_transfer
Regex patterns for values:
Visa: \b4[0-9]{12}(?:[0-9]{3})?\b
Mastercard: \b5[1-5][0-9]{14}\b
Amex: \b3[47][0-9]{13}\b
Generic PAN: \b[0-9]{13,19}\b (in a PAN-named field)
CVV: \b[0-9]{3,4}\b (in a cvv-named field)
IBAN: \b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b
Identity Combinations (High re-identification risk when combined)
Combinations that together constitute Tier 2:
- Full name + date of birth
- Full name + address (street level)
- Email + date of birth + gender
- Phone number + address
Detection patterns:
full_name,first_name+last_name(as separate fields — note both present)date_of_birth,dob,birth_date,birthdate,birthdayhome_address,street_address,address_line1,postal_addressgender,sex,pronoun(when combined with other identifiers)
Tier 3 — High (Regulatory notification triggers)
Contact Information
Detection patterns:
email,email_address,user_email,contact_email,primary_emailphone,phone_number,mobile,mobile_number,cell_phone,telephonewhatsapp_number,signal_number
Regex patterns:
Email: \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Phone: \+?[0-9\s\-\(\)]{7,20} (in a phone-named field)
Precise Location Data
Detection patterns:
latitude,longitude,lat,lng,lat_lng,coordinates,geo_pointgps_location,precise_location,real_time_locationhome_location,work_location
Note: City-level location is Tier 4; street-level or GPS coordinates are Tier 3.
Network Identifiers
Detection patterns:
ip_address,ip,client_ip,remote_addr,x_forwarded_formac_address,device_mac,hardware_idimei,imsi,device_id,advertising_id,idfa,gaid
Authentication Artifacts
Detection patterns:
session_id,cookie_value,csrf_token(if long-lived and user-identifying)remember_me_token,persistent_session
Tier 4 — Elevated (Privacy relevant)
Partial Personal Identifiers
Detection patterns:
first_name,last_name,display_name,username(when alone)profile_picture,avatar_urlcity,state,country,region,zip_code,postal_codetime_zone,locale,language_preference
Behavioral & Analytics Data
Detection patterns:
user_agent,browser,device_type,ossearch_query,search_history,browsing_historypurchase_history,order_history,transaction_historyclick_event,page_view,session_durationpreferences,interests,tags,segments
Financial Context (non-card)
Detection patterns:
salary,income,net_worth,credit_score,credit_ratingaccount_balance,wallet_balance,subscription_tier
Tier 5 — Standard (No direct privacy impact)
- System configuration values (non-secret)
- Public user-facing content (blog posts, public profiles)
- Anonymized aggregated statistics
- Non-personal reference data (product catalog, country codes)
- Internal system identifiers with no external exposure
Detection Guidance for AI Analysis
Framework-Specific Patterns
Django / Python:
# Sensitive fields typically appear in models.py
class User(models.Model):
email = models.EmailField() # Tier 3
date_of_birth = models.DateField() # Tier 2 (combined with name)
ssn = models.CharField(max_length=11) # Tier 1
TypeScript / Prisma:
model User {
email String // Tier 3
phoneNumber String? // Tier 3
dateOfBirth DateTime? // Tier 2 (when combined)
cardNumber String? // Tier 2 PCI-DSS
}
Java / Spring / JPA:
@Entity
public class Patient {
@Column(name = "diagnosis") // Tier 1 PHI
private String diagnosis;
@Column(name = "ssn") // Tier 1
private String ssn;
}
C# / EF Core:
public class UserProfile {
public string Email { get; set; } // Tier 3
public string PassportNumber { get; set; } // Tier 1
public DateTime DateOfBirth { get; set; } // Tier 2
}
Log Statement Patterns (High Risk — often overlooked)
# BAD — logs PII
logger.info(f"User {user.email} logged in from {request.remote_addr}")
logger.debug(f"Payment for card {card_number}")
# Look for these in logging calls:
# .info(), .debug(), .warn(), .error(), console.log(), System.out.println()
API Response Leakage (Serializer/DTO patterns)
// Check if these fields are included in response objects
// even if not requested — over-fetching is a common exposure vector
{
"id": "...",
"email": "...", // Tier 3
"phone": "...", // Tier 3
"dateOfBirth": "...", // Tier 2 — should this be returned?
"passwordHash": "...", // Tier 1 — should NEVER be returned
"ssn": "...", // Tier 1 — should NEVER be returned
}
Aggregation Risk Assessment
Combination attacks — data that becomes more sensitive when combined:
| Alone | Combined With | Combined Tier | Risk |
|---|---|---|---|
| Email (T3) | Password hash (T1) | T1 | Account takeover |
| Name (T4) | DOB (T2) + Address (T2) | T2 | Full identity reconstruction |
| IP address (T3) | Timestamps + User ID | T2 | Behavioral profiling |
| City (T4) | Purchase history (T4) | T3 | De-anonymization risk |
| Health category (T4) | Name + Email | T1 | HIPAA triggering |
Rule: Always assess fields in combination, not just in isolation.