mirror of https://github.com/github/awesome-copilot.git synced 2026-04-30 12:15:56 +00:00

Files

Shubham Jiyani 8ca38ffb9e feat: add data-breach-blast-radius skill for pre-breach impact analysis (#1487 )

* feat: add data-breach-blast-radius skill for pre-breach impact analysis

* fix: resolve codespell false positives (ZAR currency code, SME abbreviation)

* fix: remove ZAR abbreviation to pass codespell check

2026-04-28 14:26:20 +10:00

8.7 KiB

Raw Blame History

Data Classification Taxonomy

A comprehensive taxonomy for identifying sensitive data in codebases. Every field, column, model property, or variable matching these patterns should be inventoried and assigned the appropriate sensitivity tier.

Tier 1 — Catastrophic (Irreversible harm if exposed)

Biometric Data

Detection patterns (field names / column names):

fingerprint, thumbprint, retina_scan, iris_scan, face_id, facial_recognition
voice_print, voice_biometric, gait_analysis, dna_profile, genetic_data
biometric_template, biometric_hash, faceEmbedding, face_vector

Detection patterns (data values / format):

Base64-encoded blobs > 512 bytes in biometric-named fields
Binary columns in tables named biometric_*, face_*, fingerprint_*

Government-Issued Identifiers

Detection patterns:

ssn, social_security_number, social_security, sin (Canada), nino (UK), tfn (Australia)
passport_number, passport_no, passport_id
drivers_license, drivers_licence, dl_number, license_number
national_id, national_identification, id_number, id_card_number
tax_id, tin, ein, itin, vat_number, fiscal_code
aadhaar, pan_number (India), cpf, cnpj (Brazil), rut (Chile/Colombia)
nric, fin (Singapore), my_kad (Malaysia), nik (Indonesia)

Regex patterns for values:

SSN:          \b\d{3}-\d{2}-\d{4}\b
UK NINO:      \b[A-CEGHJ-PR-TW-Z]{2}\d{6}[A-D]\b
CPF (Brazil): \b\d{3}\.\d{3}\.\d{3}-\d{2}\b
Aadhaar:      \b\d{4}\s\d{4}\s\d{4}\b

Health & Medical Data (PHI under HIPAA)

Detection patterns:

diagnosis, icd_code, icd10, icd11, snomed, loinc_code
medication, prescription, drug_name, dosage, treatment
medical_record_number, mrn, patient_id, encounter_id
lab_result, test_result, pathology, radiology
mental_health, psychiatric, therapy_notes, counseling
hiv_status, std_status, substance_abuse, addiction
insurance_id, insurance_member_id, health_plan_id, claim_number
fhir_resource, hl7_message, dicom_data
disability, handicap, chronic_condition
pregnancy, reproductive_health, fertility

Authentication Credentials

Detection patterns:

password, passwd, pwd, hashed_password, password_hash, password_digest
private_key, secret_key, api_key, api_secret, api_token
access_token, refresh_token, bearer_token, id_token, jwt_token
oauth_token, oauth_secret, oauth_access_token
mfa_secret, totp_secret, otp_secret, backup_codes
session_token, session_id, auth_token
client_secret, client_credential
private_key_pem, rsa_private, ecdsa_private

Tier 2 — Critical (High regulatory exposure)

Payment Card Data (PCI-DSS)

Detection patterns:

card_number, pan, primary_account_number, credit_card, debit_card
cvv, cvc, cvv2, card_verification, security_code
card_expiry, expiration_date, exp_date, expiry_month, expiry_year
cardholder_name, card_holder
iban, bic, swift_code, routing_number, account_number, sort_code
bank_account, bank_details, wire_transfer

Regex patterns for values:

Visa:            \b4[0-9]{12}(?:[0-9]{3})?\b
Mastercard:      \b5[1-5][0-9]{14}\b
Amex:            \b3[47][0-9]{13}\b
Generic PAN:     \b[0-9]{13,19}\b (in a PAN-named field)
CVV:             \b[0-9]{3,4}\b (in a cvv-named field)
IBAN:            \b[A-Z]{2}\d{2}[A-Z0-9]{4}\d{7}([A-Z0-9]?){0,16}\b

Identity Combinations (High re-identification risk when combined)

Combinations that together constitute Tier 2:

Full name + date of birth
Full name + address (street level)
Email + date of birth + gender
Phone number + address

Detection patterns:

full_name, first_name + last_name (as separate fields — note both present)
date_of_birth, dob, birth_date, birthdate, birthday
home_address, street_address, address_line1, postal_address
gender, sex, pronoun (when combined with other identifiers)

Tier 3 — High (Regulatory notification triggers)

Contact Information

Detection patterns:

email, email_address, user_email, contact_email, primary_email
phone, phone_number, mobile, mobile_number, cell_phone, telephone
whatsapp_number, signal_number

Regex patterns:

Email:  \b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
Phone:  \+?[0-9\s\-\(\)]{7,20}  (in a phone-named field)

Precise Location Data

Detection patterns:

latitude, longitude, lat, lng, lat_lng, coordinates, geo_point
gps_location, precise_location, real_time_location
home_location, work_location

Note: City-level location is Tier 4; street-level or GPS coordinates are Tier 3.

Network Identifiers

Detection patterns:

ip_address, ip, client_ip, remote_addr, x_forwarded_for
mac_address, device_mac, hardware_id
imei, imsi, device_id, advertising_id, idfa, gaid

Authentication Artifacts

Detection patterns:

session_id, cookie_value, csrf_token (if long-lived and user-identifying)
remember_me_token, persistent_session

Tier 4 — Elevated (Privacy relevant)

Partial Personal Identifiers

Detection patterns:

first_name, last_name, display_name, username (when alone)
profile_picture, avatar_url
city, state, country, region, zip_code, postal_code
time_zone, locale, language_preference

Behavioral & Analytics Data

Detection patterns:

user_agent, browser, device_type, os
search_query, search_history, browsing_history
purchase_history, order_history, transaction_history
click_event, page_view, session_duration
preferences, interests, tags, segments

Financial Context (non-card)

Detection patterns:

salary, income, net_worth, credit_score, credit_rating
account_balance, wallet_balance, subscription_tier

Tier 5 — Standard (No direct privacy impact)

System configuration values (non-secret)
Public user-facing content (blog posts, public profiles)
Anonymized aggregated statistics
Non-personal reference data (product catalog, country codes)
Internal system identifiers with no external exposure

Detection Guidance for AI Analysis

Framework-Specific Patterns

Django / Python:

# Sensitive fields typically appear in models.py
class User(models.Model):
    email = models.EmailField()           # Tier 3
    date_of_birth = models.DateField()    # Tier 2 (combined with name)
    ssn = models.CharField(max_length=11) # Tier 1

TypeScript / Prisma:

model User {
  email       String    // Tier 3
  phoneNumber String?   // Tier 3
  dateOfBirth DateTime? // Tier 2 (when combined)
  cardNumber  String?   // Tier 2 PCI-DSS
}

Java / Spring / JPA:

@Entity
public class Patient {
    @Column(name = "diagnosis")  // Tier 1 PHI
    private String diagnosis;
    
    @Column(name = "ssn")        // Tier 1
    private String ssn;
}

C# / EF Core:

public class UserProfile {
    public string Email { get; set; }        // Tier 3
    public string PassportNumber { get; set; } // Tier 1
    public DateTime DateOfBirth { get; set; }  // Tier 2
}

Log Statement Patterns (High Risk — often overlooked)

# BAD — logs PII
logger.info(f"User {user.email} logged in from {request.remote_addr}")
logger.debug(f"Payment for card {card_number}")

# Look for these in logging calls:
# .info(), .debug(), .warn(), .error(), console.log(), System.out.println()

API Response Leakage (Serializer/DTO patterns)

// Check if these fields are included in response objects
// even if not requested — over-fetching is a common exposure vector
{
  "id": "...",
  "email": "...",          // Tier 3
  "phone": "...",          // Tier 3 
  "dateOfBirth": "...",    // Tier 2 — should this be returned?
  "passwordHash": "...",   // Tier 1 — should NEVER be returned
  "ssn": "...",            // Tier 1 — should NEVER be returned
}

Aggregation Risk Assessment

Combination attacks — data that becomes more sensitive when combined:

Alone	Combined With	Combined Tier	Risk
Email (T3)	Password hash (T1)	T1	Account takeover
Name (T4)	DOB (T2) + Address (T2)	T2	Full identity reconstruction
IP address (T3)	Timestamps + User ID	T2	Behavioral profiling
City (T4)	Purchase history (T4)	T3	De-anonymization risk
Health category (T4)	Name + Email	T1	HIPAA triggering

Rule: Always assess fields in combination, not just in isolation.

8.7 KiB Raw Blame History

Data Classification Taxonomy

Tier 1 — Catastrophic (Irreversible harm if exposed)

Biometric Data

Government-Issued Identifiers

Health & Medical Data (PHI under HIPAA)

Authentication Credentials

Tier 2 — Critical (High regulatory exposure)

Payment Card Data (PCI-DSS)

Identity Combinations (High re-identification risk when combined)

Tier 3 — High (Regulatory notification triggers)

Contact Information

Precise Location Data

Network Identifiers

Authentication Artifacts

Tier 4 — Elevated (Privacy relevant)

Partial Personal Identifiers

Behavioral & Analytics Data

Financial Context (non-card)

Tier 5 — Standard (No direct privacy impact)

Detection Guidance for AI Analysis

Framework-Specific Patterns

Log Statement Patterns (High Risk — often overlooked)

API Response Leakage (Serializer/DTO patterns)

Aggregation Risk Assessment

8.7 KiB

Raw Blame History