PHP Classes

PHP UTF-8 Validation: Validate and repair strings in UTF-8 encoding

Recommend this page to a friend!
  Info   View files Example   View files View files (3)   DownloadInstall with Composer Download .zip   Reputation   Support forum (2)   Blog    
Ratings Unique User Downloads Download Rankings
Not enough user ratingsTotal: 325 All time: 7,181 This week: 107Up
Version License PHP version Categories
utf8validation 1.0.0Public Domain7PHP 5, Text processing, Validation
Description 

Author

This class can validate and repair strings in UTF-8 encoding.

It takes a text string and checks if the characters are valid in UTF-8.

The class can also repair an invalid string by removing some invalid UTF-8 characters sequences and Byte-Order Marks.

The class can return an object instance of itself with the string, byte length, character count, and the position of any encoding errors.

Innovation Award
PHP Programming Innovation award nominee
February 2019
Number 2
Sometimes flawed applications can generate text encoded using UTF-8 that maybe malformed.

This class can check a given text string to validate if it has any issues regarding the encoding of text in UTF-8.

If the encoding is malformed, the class can also fix the problem eliminating any problems with the text encoding.

Manuel Lemos
Picture of Ray Paseur
  Performance   Level  
Name: Ray Paseur is available for providing paid consulting. Contact Ray Paseur .
Classes: 8 packages by
Country: United States United States
Age: 73
All time rank: 2240311 in United States United States
Week rank: 312 Up38 in United States United States Up
Innovation award
Innovation award
Nominee: 5x

Winner: 1x

Recommendations

Detect file encoding and convert it to UTF-8 without BOM
I am unable to detect file encoding, that needs to be converted

Example

<?php // classes/demo_UTF8.php
/**
 * This script uses class_UTF8 to determine if a string is UTF-8 compatible.
 *
 * The constructor receives a string and returns an object containing the
 * string and a validity indicator. If the string fails UTF-8 validation,
 * the offset location of the failures will be provided in an array in the
 * "error" property.
 *
 * The class can also attempt to repair damaged encodings, but the outcome
 * of repairs is less certain. PHP converts extended ASCII into UTF-8 by
 * putting hex C0 in front of the extended ASCII characters, thus
 *
 */
error_reporting(E_ALL);
require_once(
'class_UTF8.php');


echo
'<meta charset="utf-8" />';
echo
'<pre>';


// Some UTF-8 test data - both good and bad
$arr =
[
'ABCDEF'
, '14°F is cold!'
, 'Größe'
, '©'
, chr(0xC3) . chr(0x86) // AE Ligature in UTF-8
, chr(0xE2) . chr(0x82) . chr(0xAC) // Euro in UTF-8

// These are examples of bad UTF-8 because they have code points in 127 < char < 256
, chr(0xC6) . ' AE Ligature'
, 'Accented "a" ' . chr(0xE0) . ' in this string'
, 'Several ' . chr(0x80) . ' Euro ' . chr(0x80) . ' symbols ' . chr(0x80) . ' in ' . chr(0x80) . ' text'

// A UTF-8 nemesis from MSFT Notepad
, chr(0xEF) . chr(0xBB) . chr(0xBF) . 'Thanks for the BOM, Notepad'

// A Bogus character that should not be translated
, 'Bogus 0x81: ' . chr(0x81)

// Anthony Ferrara test data
, chr(0xC0) . chr(0x80) // Overlong encoding of code point 0
, chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) // Overlong encoding of 5 byte encoding
, chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) // Overlong encoding of 6 byte encoding
, chr(0xD0) . chr(0x01) // High code-point without trailing characters
, chr(0x01) . chr(0x01) . chr(0x01) // Actually valid ;-)

];


echo
'<h3>Data Not Repaired</h3>';
foreach (
$arr as $str) {
   
hexdump($str);
    echo
PHP_EOL;

   
$obj = new UTF8($str);
   
hexdump($obj->str);
   
print_r($obj);
    echo
PHP_EOL;
}


// Some Bad UTF-8 test data that we attempt to repair
$bad =
[
'AE Ligature at end: ' . chr(0xC6)
,
'Pound at end: ' . chr(0xA3)
,
'The ' . chr(0x80) . ' Euro symbol'
, 'Several ' . chr(0x80) . ' Euro ' . chr(0x80) . ' symbols ' . chr(0x80) . ' in ' . chr(0x80) . ' text'

// A Bogus character that cannot be translated
, 'Bogus 0x81: ' . chr(0x81)
];

echo
'<h3>Data Repair Attempted</h3>';

foreach (
$bad as $str) {
   
hexdump($str);
    echo
PHP_EOL;

   
$obj = new UTF8($str, TRUE);
   
hexdump($obj->str);
   
print_r($obj);
    echo
PHP_EOL;
}



// Unrelated utility function to show us the hex byte values
function hexdump($str, $br=PHP_EOL)
{
    if (empty(
$str)) return FALSE;

   
// Get the hex byte values in a string
   
$hex = str_split(implode(NULL, unpack('H*', $str)));

   
// Allocate bytes into hi and lo nibbles
   
$hi = NULL;
   
$lo = NULL;
   
$mod = 0;
    foreach (
$hex as $nib)
    {
       
$mod++;
       
$mod = $mod % 2;
        if (
$mod) {
           
$hi .= $nib;
        }
        else {
           
$lo .= $nib;
        }
    }

   
// Show the scale, the string and the hex
   
$num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));
    echo
$br . $num;
    echo
$br . $str;
    echo
$br . $hi;
    echo
$br . $lo;
    echo
$br;
}


Details

Class UTF8 Readme Discussion: UTF-8 is a widely accepted character encoding scheme. Its genius lies in two special characteristics: It encompasses ASCII (7-bit) encoding without any changes, thus making it backward-compatible with the overwhelming majority of western data sets, both modern and ancient. And it is self-evident requiring no special programming to use. UTF-8 is amazingly expansive, offering so many character interpretations that it can represent any character in any human language. UTF-8 characters may "collide" with extended-ASCII (also called ANSI) because the extended-ASCII uses one-byte characters above code point 7F. The high order bit of a byte is of significance in the UTF-8 encoding scheme. UTF-8, therefore, has different (multi-byte) encoding for the ANSI characters in the range from 80 to FF (128 to 255). For example, the copyright symbol, a little letter "c" in a circle, is produced at ANSI code point hexadecimal A9 (169). This same symbol is represented by a two-byte encoding in UTF8: C2A9. The overwhelming majority of UTF-8 errors arise when extended-ASCII characters are passed to algorithms that expect UTF-8. Many European accented letters and common symbols are represented in ISO-8859-1 via the one-byte range from hex 80 to hex FF. These characters cannot be used in XML or JSON. They must either be converted to entities or converted to UTF-8 multi-byte characters. PHP native functions exist to convert between extended-ASCII and UTF-8, (and other encoding schemes), but these native functions do not understand the encoding scheme inherent in their input. It is our obligation as programmers to know the encoding scheme of any data we receive. It is our obligation as programmers to produce our data in a well-identified and predictable encoding scheme. The best and most widely accepted scheme is UTF-8. PHP has had internal support for UTF-8 since PHP 5.6+, and it is now the default character encoding. Operation: This class constructor receives three arguments: (1) a string, (2) a boolean telling whether to attempt to decode ISO-8859-1 (default FALSE), (3) a boolean telling whether to remove any Byte-Order Mark (default TRUE). The constructor returns an object containing the string and a validity indicator. If the string fails UTF-8 validation, the offset location of the failures may be provided in an array in the "error" property. The byte length and character count are also returned. If the "error" property is empty, the "str" property is valid UTF-8, and the byte length and character count are probably accurate. However if the class is given unpredictable data and is asked to decode ISO-8859-1, garbled output may occur. This is an unavoidable artifact of changing character set encoding without an understanding of the existing character set encoding. UTF-8 does not require or benefit from a Byte-Order Mark, yet some programs (eg: Microsoft Notepad) will still put a BOM into their files. This class will, by default, remove the unnecessary and unwanted BOM(s), if any, from the input strings. A method of the class, "extended_ascii_to_utf8()" provides a conversion that is more accurate than the native PHP functions. See the "demo" script for examples. References: https://www.joelonsoftware.com/articles/Unicode.html (Old but wonderful) https://iconoun.com/articles/collisions/ (My take on the issues) https://stackoverflow.com/a/11709412 (Tony Ferrara did good work here) https://www.unicode.org/versions/Unicode11.0.0/ https://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences http://php.net/manual/en/book.mbstring.php http://php.net/manual/en/function.utf8-encode.php http://php.net/manual/en/function.chr.php http://www.asciitable.com/ http://en.wikipedia.org/wiki/UTF-8 http://php.net/manual/en/function.mb-detect-encoding.php#112391

  Files folder image Files  
File Role Description
Plain text file class_UTF8.php Class Class_UTF8 Source
Accessible without login Plain text file demo_UTF8.php Example Demonstration Script
Accessible without login Plain text file readme_UTF8.txt Doc. Readme text file

 Version Control Unique User Downloads Download Rankings  
 0%
Total:325
This week:0
All time:7,181
This week:107Up
User Comments (1)
Thats a very good and useful class !
5 years ago (José Filipe Lopes Santos)
80%StarStarStarStarStar