Skip to content

Internationalized Domains and Punycode Analysis

Purpose and Scope

This document explains in detail how ZoneFeeds processes Internationalized Domain Names, with a specific focus on Punycode to Unicode conversion, Unicode normalization, and security aware interpretation of multilingual domain labels. General platform architecture, ingestion pipelines, and supported TLD coverage are intentionally excluded to avoid duplication with other documentation.


Why Punycode Exists

The Domain Name System is limited to a restricted ASCII character set. To support non ASCII scripts, the IDNA standard defines Punycode, an encoding that converts Unicode domain labels into an ASCII compatible format.

All internationalized labels are stored in DNS using the format:

xn--

Example:

  • Unicode label: 银行
  • Punycode label: xn--t6qv86b

ZoneFeeds must decode this representation to correctly analyze domain intent and detect abuse.


Punycode to Unicode Conversion Process

ZoneFeeds follows the IDNA standard conversion flow and applies additional safety checks.

Step 1: Label Level Segmentation

Domains are split into individual labels at the dot boundary.

Example: xn--e1afmkfd.xn--p1ai

Labels processed independently: - xn--e1afmkfd - xn--p1ai


Step 2: Punycode Prefix Detection

Each label is inspected for the xn-- prefix.

  • Labels without the prefix are treated as ASCII or Unicode literals
  • Labels with the prefix are routed to the Punycode decoder

This ensures mixed domains are handled correctly.


Step 3: Punycode Decoding Algorithm

ZoneFeeds implements the standard Punycode decoding algorithm defined in RFC 3492.

Key operations include: - Base 36 decoding of encoded characters - Bias adaptation for variable length encoding - Reconstruction of Unicode code points - Validation of output code point ranges

The output is a Unicode string representing the original label.


Step 4: Unicode Validation

Decoded Unicode labels are validated to ensure: - Characters belong to allowed Unicode blocks - No prohibited or deprecated code points are used - Labels comply with IDNA contextual rules

Invalid or malformed labels are flagged for further inspection.


Unicode Normalization and Canonical Form

Unicode allows multiple representations for visually identical characters. To prevent evasion through encoding tricks, ZoneFeeds applies canonical normalization.

Normalization Steps

  • Convert to Unicode Normalization Form C (NFC)
  • Apply lowercase transformation using Unicode case folding
  • Strip or normalize compatibility characters where applicable

This ensures consistent comparison and hashing of domain labels.


Mixed Script Detection

After normalization, ZoneFeeds analyzes script usage within each label.

Detected patterns include: - Single script labels - Mixed scripts within a label - Script mixing across label boundaries

Mixed script domains are inherently higher risk and are treated accordingly.


Unicode Confusable Mapping

ZoneFeeds maintains a confusable character map derived from Unicode standards.

This map identifies: - Characters from different scripts with similar glyph shapes - Characters frequently abused for impersonation

Examples: - Latin a vs Cyrillic а - Latin o vs Greek ο

These mappings are used to generate visual similarity fingerprints for each domain.


Unicode to ASCII Re Encoding

For comparison against existing ASCII brand assets, ZoneFeeds also performs reverse normalization.

  • Unicode labels are mapped to ASCII lookalike equivalents
  • Confusable substitutions are applied
  • Simplified ASCII fingerprints are generated

This allows direct comparison between Unicode domains and ASCII brand dictionaries.


Error Handling and Edge Cases

ZoneFeeds explicitly handles: - Invalid Punycode sequences - Overlong or malformed encodings - Label collisions after normalization - Homoglyph amplification attacks

Domains triggering these conditions receive elevated risk indicators.


Output Metadata Fields

Each processed IDN includes:

  • Original domain string
  • Punycode representation per label
  • Decoded Unicode domain
  • Normalized Unicode form
  • Script classification per label
  • Mixed script flag
  • Confusable similarity score

These fields are exposed via APIs and data feeds.


Security Implications

Punycode conversion is not just a decoding step. Improper handling can hide malicious intent.

ZoneFeeds treats encoding as a security boundary and enforces strict, deterministic transformations to ensure: - No loss of semantic meaning - No bypass through alternate encodings - Accurate multilingual threat detection


Summary

ZoneFeeds applies a rigorous and standards compliant approach to Punycode and Unicode proc