PHP UTF-8 Validation: Validate and repair strings in UTF-8 encoding

Recommend this page to a friend!

Download

Info

Example

Files

Install with Composer

Download

Reputation

Support forum

Blog

Links

Ratings				Unique User Downloads		Download Rankings
Not enough user ratings				Total: 326		All time: 7,186 This week: 46

Version		License		PHP version		Categories
`utf8validation` 1.0.0		Public Domain		7		PHP 5, Text processing, Validation

Description

Author

Ray Paseur

This class can validate and repair strings in UTF-8 encoding.

It takes a text string and checks if the characters are valid in UTF-8.

The class can also repair an invalid string by removing some invalid UTF-8 characters sequences and Byte-Order Marks.

The class can return an object instance of itself with the string, byte length, character count, and the position of any encoding errors.

Innovation Award

February 2019
Number 2

Sometimes flawed applications can generate text encoded using UTF-8 that maybe malformed.

This class can check a given text string to validate if it has any issues regarding the encoding of text in UTF-8.

If the encoding is malformed, the class can also fix the problem eliminating any problems with the text encoding.

Manuel Lemos

Ray Paseur

Performance

Level

Name:	Ray Paseur is available for providing paid consulting. Contact Ray Paseur .
Classes:	8 packages by Ray Paseur
Country:	United States
Age:	74
All time rank:	2234	313 in United States
Week rank:	46	5 in United States

Level 1

Innovation award

Nominee: 5x

Winner: 1x

Recommendations

Detect file encoding and convert it to UTF-8 without BOM
I am unable to detect file encoding, that needs to be converted

Example


<?php // classes/demo_UTF8.php

/**

 * This script uses class_UTF8 to determine if a string is UTF-8 compatible.

 *

 * The constructor receives a string and returns an object containing the

 * string and a validity indicator.  If the string fails UTF-8 validation,

 * the offset location of the failures will be provided in an array in the

 * "error" property.

 *

 * The class can also attempt to repair damaged encodings, but the outcome

 * of repairs is less certain.  PHP converts extended ASCII into UTF-8 by

 * putting hex C0 in front of the extended ASCII characters, thus

 *

 */

error_reporting(E_ALL);

require_once('class_UTF8.php');





echo '<meta charset="utf-8" />';

echo '<pre>';





// Some UTF-8 test data - both good and bad

$arr =

[ 'ABCDEF'

, '14�F is cold!'

, 'Gr��e'

, '�'

, chr(0xC3) . chr(0x86)               // AE Ligature in UTF-8

, chr(0xE2) . chr(0x82) . chr(0xAC)   // Euro in UTF-8



// These are examples of bad UTF-8 because they have code points in 127 < char < 256

, chr(0xC6) . ' AE Ligature'

, 'Accented "a" ' . chr(0xE0) . ' in this string'

, 'Several ' . chr(0x80) . ' Euro ' . chr(0x80) . ' symbols ' . chr(0x80) . ' in ' . chr(0x80) . ' text'



// A UTF-8 nemesis from MSFT Notepad

, chr(0xEF) . chr(0xBB) . chr(0xBF) . 'Thanks for the BOM, Notepad'



// A Bogus character that should not be translated

, 'Bogus 0x81: ' . chr(0x81)



// Anthony Ferrara test data

, chr(0xC0) . chr(0x80)          // Overlong encoding of code point 0

, chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80)  // Overlong encoding of 5 byte encoding

, chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80)  // Overlong encoding of 6 byte encoding

, chr(0xD0) . chr(0x01)          // High code-point without trailing characters

, chr(0x01) . chr(0x01) . chr(0x01) // Actually valid ;-)



];





echo '<h3>Data Not Repaired</h3>';

foreach ($arr as $str) {

    hexdump($str);

    echo PHP_EOL;



    $obj = new UTF8($str);

    hexdump($obj->str);

    print_r($obj);

    echo PHP_EOL;

}





// Some Bad UTF-8 test data that we attempt to repair

$bad =

[ 'AE Ligature at end: ' . chr(0xC6)

, 'Pound at end: ' . chr(0xA3)

, 'The ' . chr(0x80) . ' Euro symbol'

, 'Several ' . chr(0x80) . ' Euro ' . chr(0x80) . ' symbols ' . chr(0x80) . ' in ' . chr(0x80) . ' text'



// A Bogus character that cannot be translated

, 'Bogus 0x81: ' . chr(0x81)

];



echo '<h3>Data Repair Attempted</h3>';



foreach ($bad as $str) {

    hexdump($str);

    echo PHP_EOL;



    $obj = new UTF8($str, TRUE);

    hexdump($obj->str);

    print_r($obj);

    echo PHP_EOL;

}







// Unrelated utility function to show us the hex byte values

function hexdump($str, $br=PHP_EOL)

{

    if (empty($str)) return FALSE;



    // Get the hex byte values in a string

    $hex = str_split(implode(NULL, unpack('H*', $str)));



    // Allocate bytes into hi and lo nibbles

    $hi  = NULL;

    $lo  = NULL;

    $mod = 0;

    foreach ($hex as $nib)

    {

        $mod++;

        $mod = $mod % 2;

        if ($mod) {

            $hi .= $nib;

        }

        else {

            $lo .= $nib;

        }

    }



    // Show the scale, the string and the hex

    $num = substr('1...5...10...15...20...25...30...35...40...45...50...55...60...65...70...75...80...85...90...95..100..105..110..115..120..125..130', 0, strlen($str));

    echo $br . $num;

    echo $br . $str;

    echo $br . $hi;

    echo $br . $lo;

    echo $br;

}

Details

Class UTF8 Readme

Discussion:

UTF-8 is a widely accepted character encoding scheme.  Its genius lies in two
special characteristics: It encompasses ASCII (7-bit) encoding without any
changes, thus making it backward-compatible with the overwhelming majority of
western data sets, both modern and ancient.  And it is self-evident requiring
no special programming to use.  UTF-8 is amazingly expansive, offering so 
many character interpretations that it can represent any character in any 
human language.

UTF-8 characters may "collide" with extended-ASCII (also called ANSI) because
the extended-ASCII uses one-byte characters above code point 7F.  The high
order bit of a byte is of significance in the UTF-8 encoding scheme.  UTF-8,
therefore, has different (multi-byte) encoding for the ANSI characters in the
range from 80 to FF (128 to 255).  For example, the copyright symbol, a little
letter "c" in a circle, is produced at ANSI code point hexadecimal A9 (169).  
This same symbol is represented by a two-byte encoding in UTF8: C2A9.

The overwhelming majority of UTF-8 errors arise when extended-ASCII characters
are passed to algorithms that expect UTF-8.  Many European accented letters
and common symbols are represented in ISO-8859-1 via the one-byte range from 
hex 80 to hex FF.  These characters cannot be used in XML or JSON.  They must 
either be converted to entities or converted to UTF-8 multi-byte characters.

PHP native functions exist to convert between extended-ASCII and UTF-8, (and
other encoding schemes), but these native functions do not understand the 
encoding scheme inherent in their input.  It is our obligation as programmers
to know the encoding scheme of any data we receive.  It is our obligation as
programmers to produce our data in a well-identified and predictable encoding
scheme.  The best and most widely accepted scheme is UTF-8.

PHP has had internal support for UTF-8 since PHP 5.6+, and it is now the 
default character encoding.

Operation:

This class constructor receives three arguments: (1) a string, (2) a boolean
telling whether to attempt to decode ISO-8859-1 (default FALSE), (3) a 
boolean telling whether to remove any Byte-Order Mark (default TRUE).  The
constructor returns an object containing the string and a validity indicator.  
If the string fails UTF-8 validation, the offset location of the failures 
may be provided in an array in the "error" property.  The byte length and 
character count are also returned.  If the "error" property is empty, the 
"str" property is valid UTF-8, and the byte length and character count are 
probably accurate.  However if the class is given unpredictable data and is
asked to decode ISO-8859-1, garbled output may occur.  This is an unavoidable
artifact of changing character set encoding without an understanding of the
existing character set encoding.

UTF-8 does not require or benefit from a Byte-Order Mark, yet some programs
(eg: Microsoft Notepad) will still put a BOM into their files.  This class
will, by default, remove the unnecessary and unwanted BOM(s), if any, from 
the input strings.

A method of the class, "extended_ascii_to_utf8()" provides a conversion
that is more accurate than the native PHP functions.

See the "demo" script for examples.

References:

   https://www.joelonsoftware.com/articles/Unicode.html (Old but wonderful)
   https://iconoun.com/articles/collisions/ (My take on the issues)
   https://stackoverflow.com/a/11709412 (Tony Ferrara did good work here)
   https://www.unicode.org/versions/Unicode11.0.0/
   https://www.unicode.org/reports/tr36/#Ill-Formed_Subsequences
   http://php.net/manual/en/book.mbstring.php
   http://php.net/manual/en/function.utf8-encode.php
   http://php.net/manual/en/function.chr.php
   http://www.asciitable.com/
   http://en.wikipedia.org/wiki/UTF-8
   http://php.net/manual/en/function.mb-detect-encoding.php#112391

Files (3)

File	Role	Description
`class_UTF8.php`	Class	Class_UTF8 Source
`demo_UTF8.php`	Example	Demonstration Script
`readme_UTF8.txt`	Doc.	Readme text file

The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page.

Install with Composer

	utf8validation-2019-01-31.zip 6KB
	utf8validation-2019-01-31.tar.gz 5KB
	Install with Composer

Version Control

Unique User Downloads

Download Rankings

Total:	326
This week:	0

All time:	7,186
This week:	46

User Comments (1)

Thats a very good and useful class !
6 years ago (Jos� Filipe Lopes Santos)

80%

Applications that use this package

No pages of applications that use this class were specified.

If you know an application of this package, send a message to the author to add a link here.

About us

Advertise on this site

For more information send a message to info at phpclasses dot org.