UTF-8 Validator — Check Encoding, Free Online
Validate UTF-8 encoding of text or hex bytes. Detect invalid sequences, BOM, mixed encodings. Browser-only.
About UTF-8 Validator
UTF-8 is the dominant text encoding on the modern web — backwards-compatible with ASCII, supports all Unicode characters, variable-width (1-4 bytes per code point). Invalid UTF-8 sequences cause "mojibake" (garbled text) when interpreted with a different encoding, or outright errors in strict parsers (Python, Rust, JSON parsers). The ZTools UTF-8 Validator checks input bytes (paste hex or upload binary) for valid UTF-8 sequencing, flags invalid bytes, detects byte-order marks, and identifies likely alternative encodings (Latin-1, Windows-1252, GB2312).
Use cases
- Debug "mojibake" in scraped text. Text shows "é" instead of "é". Validator reveals the bytes are valid UTF-8 but were decoded as Latin-1.
- Validate user uploads. CSV upload with mixed encodings. Validator detects which rows have invalid UTF-8.
- Check for BOM (Byte Order Mark). Some tools add a UTF-8 BOM (0xEF 0xBB 0xBF). Most parsers handle it; some choke. Validator flags it.
- Verify a binary file is text. Random binary won't parse as valid UTF-8. Validator confirms or refutes.
How it works
- Paste text or hex bytes. Auto-detect: if hex chars only, treat as bytes; otherwise treat as already-decoded text.
- Validate. Walk bytes. UTF-8 rules: ASCII (0xxxxxxx), 2-byte (110xxxxx 10xxxxxx), 3-byte, 4-byte. Each continuation byte starts with 10.
- Flag issues. Invalid lead bytes, missing continuation bytes, overlong encodings, invalid surrogates.
- Display. Valid: ✓ + decoded text. Invalid: byte position + suggested encoding (often Latin-1 / Windows-1252).
Examples
Input: Hex "C3 A9" (é in UTF-8)
Output: Valid. Decodes to "é".
Input: Hex "E9" (é in Latin-1)
Output: Invalid UTF-8 (orphan continuation byte). Likely Latin-1 / Windows-1252.
Input: Hex "EF BB BF E2 9C 93"
Output: Valid UTF-8 with BOM (EF BB BF) prefix. Decodes to "✓".
Frequently asked questions
What's mojibake?
Garbled text caused by encoding/decoding mismatches. "é" is "é" decoded as Latin-1 then re-encoded as UTF-8. Validator helps trace the chain.
BOM — should I keep it?
For UTF-8, BOM is optional and most modern parsers handle it. Some (Python's csv module) don't. Strip if unsure.
Overlong encoding?
Older UTF-8 spec allowed encoding ASCII as multi-byte (e.g. 0x2F as C0 AF). Modern strict UTF-8 forbids — security risk (path traversal). Validator flags.
Privacy?
All in browser.
Pro tips
- For "mojibake" debugging, the chain is usually: source encoding → mis-decoded as X → re-encoded as Y. Validator helps identify each step.
- Always strip BOM when concatenating UTF-8 files — multiple BOMs in one file confuse parsers.
- For user uploads, validate before processing. Reject invalid UTF-8 at the boundary, not deep in the pipeline.
- For round-trip safety, use modern strict UTF-8 (no overlong, no surrogates) — the modern spec.
Reviewed by Ahsan Mahmood · Last updated 2026-05-06 · Part of ZTools.
For the full,
formatted version of this page, please enable JavaScript and reload
https://ztools.zaions.com/utf8-validator.