Internationalized Domains and Punycode Analysis
Purpose and Scope
This document explains in detail how ZoneFeeds processes Internationalized Domain Names, with a specific focus on Punycode to Unicode conversion, Unicode normalization, and security aware interpretation of multilingual domain labels. General platform architecture, ingestion pipelines, and supported TLD coverage are intentionally excluded to avoid duplication with other documentation.
Why Punycode Exists
The Domain Name System is limited to a restricted ASCII character set. To support non ASCII scripts, the IDNA standard defines Punycode, an encoding that converts Unicode domain labels into an ASCII compatible format.
All internationalized labels are stored in DNS using the format:
xn--
Example:
- Unicode label:
银行 - Punycode label:
xn--t6qv86b
ZoneFeeds must decode this representation to correctly analyze domain intent and detect abuse.
Punycode to Unicode Conversion Process
ZoneFeeds follows the IDNA standard conversion flow and applies additional safety checks.
Step 1: Label Level Segmentation
Domains are split into individual labels at the dot boundary.
Example: xn--e1afmkfd.xn--p1ai
Labels processed independently:
- xn--e1afmkfd
- xn--p1ai
Step 2: Punycode Prefix Detection
Each label is inspected for the xn-- prefix.
- Labels without the prefix are treated as ASCII or Unicode literals
- Labels with the prefix are routed to the Punycode decoder
This ensures mixed domains are handled correctly.
Step 3: Punycode Decoding Algorithm
ZoneFeeds implements the standard Punycode decoding algorithm defined in RFC 3492.
Key operations include: - Base 36 decoding of encoded characters - Bias adaptation for variable length encoding - Reconstruction of Unicode code points - Validation of output code point ranges
The output is a Unicode string representing the original label.
Step 4: Unicode Validation
Decoded Unicode labels are validated to ensure: - Characters belong to allowed Unicode blocks - No prohibited or deprecated code points are used - Labels comply with IDNA contextual rules
Invalid or malformed labels are flagged for further inspection.
Unicode Normalization and Canonical Form
Unicode allows multiple representations for visually identical characters. To prevent evasion through encoding tricks, ZoneFeeds applies canonical normalization.
Normalization Steps
- Convert to Unicode Normalization Form C (NFC)
- Apply lowercase transformation using Unicode case folding
- Strip or normalize compatibility characters where applicable
This ensures consistent comparison and hashing of domain labels.
Mixed Script Detection
After normalization, ZoneFeeds analyzes script usage within each label.
Detected patterns include: - Single script labels - Mixed scripts within a label - Script mixing across label boundaries
Mixed script domains are inherently higher risk and are treated accordingly.
Unicode Confusable Mapping
ZoneFeeds maintains a confusable character map derived from Unicode standards.
This map identifies: - Characters from different scripts with similar glyph shapes - Characters frequently abused for impersonation
Examples:
- Latin a vs Cyrillic а
- Latin o vs Greek ο
These mappings are used to generate visual similarity fingerprints for each domain.
Unicode to ASCII Re Encoding
For comparison against existing ASCII brand assets, ZoneFeeds also performs reverse normalization.
- Unicode labels are mapped to ASCII lookalike equivalents
- Confusable substitutions are applied
- Simplified ASCII fingerprints are generated
This allows direct comparison between Unicode domains and ASCII brand dictionaries.
Error Handling and Edge Cases
ZoneFeeds explicitly handles: - Invalid Punycode sequences - Overlong or malformed encodings - Label collisions after normalization - Homoglyph amplification attacks
Domains triggering these conditions receive elevated risk indicators.
Output Metadata Fields
Each processed IDN includes:
- Original domain string
- Punycode representation per label
- Decoded Unicode domain
- Normalized Unicode form
- Script classification per label
- Mixed script flag
- Confusable similarity score
These fields are exposed via APIs and data feeds.
Security Implications
Punycode conversion is not just a decoding step. Improper handling can hide malicious intent.
ZoneFeeds treats encoding as a security boundary and enforces strict, deterministic transformations to ensure: - No loss of semantic meaning - No bypass through alternate encodings - Accurate multilingual threat detection
Summary
ZoneFeeds applies a rigorous and standards compliant approach to Punycode and Unicode proc