CouchCMS • View topic - fixing error "undefined function utf8

by lstandish » Thu May 02, 2019 4:44 am

Hi Kamran,

I tried out your clever replacement for the utf8_decode() function. I didn't know which characters I use take 2 bytes (instead of one) in UTF8. My ignorance was so deep that I decided to make a test php function and use it to test different implementations of UTF8 strlen. Your algorithm passed all my tests. Those who are interested in the test can keep reading.

I learned that php strlen() gives the byte count of a string. Couch used strlen(utf8_decode(str)) to find out the character count. For UTF8-encoded strings, the character count is always equal to or less than the byte count. There are 1, 2, 3 and 4 byte UTF8 characters.

from http://en.wikipedia.org/wiki/UTF-8:

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

To test your new algorithm, I made the following test php script:

Code: Select all

<!DOCTYPE html>
<html>
<?php
$str_utf = $_POST['str_utf'];
echo "testing string: $str_utf<br>";
echo "Bytes: " . strlen($str_utf) . "<br>";

if( function_exists('utf8_decode') ){
    echo "strlen (utf8_decode(str)): " . strlen( utf8_decode($str_utf) ) . "<br>";
}
if( function_exists('mb_strlen') ){
    echo "mb_strlen: " . mb_strlen($str_utf,'UTF-8') . "<br>";
}
// adapted from Symfony Polyfill (https://github.com/symfony/polyfill)
$ulen_mask = array( "\xC0" => 2, "\xD0" => 2, "\xE0" => 3, "\xF0" => 4 );
$i = $j = 0;
$len = strlen( $str_utf );
while( $i < $len ){
    $u = $str_utf[$i] & "\xF0";
    $i += isset($ulen_mask[$u]) ? $ulen_mask[$u] : 1;
    ++$j;
}
echo "Symfony Polyfill function string length: " . $j;
?>
</html>

Note that this script also tests the mb_strlen() function. However that is a php extension function that is not enabled by default in php, and I didn't bother installing and enabling it on my computer (localhost).

Here's a simple html form to take test values:

Code: Select all

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<form action="/test.php" method="POST">
Test UTF-8 string: <input type="text" name="str_utf">
<button type="submit">Submit</button>
</body>
</html>

And here's the result of a test run:
testing string: 我能吞下玻璃而不伤身体。123
Bytes: 39
strlen (utf8_decode(str)): 15
Symfony Polyfill function string length: 15

fixing error "undefined function utf8_decode()"

Re: fixing error "undefined function utf8_decode()"

Re: fixing error "undefined function utf8_decode()"

Re: fixing error "undefined function utf8_decode()"

Re: fixing error "undefined function utf8_decode()"