Problems, need help? Have a tip or advice? Post it here.
5 posts Page 1 of 1
Hi Everyone,

When installing to a new server or your localhost, when trying to open the couch admin section you may see an error 500 "file not found" in your browser, with this error in your server logs:
Code: Select all
PHP Fatal error:  Uncaught Error: Call to undefined function utf8_decode()


On Debian/Ubuntu servers, this is fixed by the following command:
Code: Select all
apt-get install php7.1-xml

(replace "7.1" with your php version. Note that php versions below 7.1 are no longer supported by the php developers.)
Thanks for sharing the solution Lloyd.

It appears that from 7.0, the XML extension has been split into a separate package and is not automatically installed.
We could increasingly see this issue crop up now - I'll see if we can put in an alternative.
Lloyd, I have just pushed a commit to GitHub trying to solve this issue.
Would it be possible for you test it and let me know if it works as expected?

Thanks.
Hi Kamran,

I tried out your clever replacement for the utf8_decode() function. I didn't know which characters I use take 2 bytes (instead of one) in UTF8. My ignorance was so deep that I decided to make a test php function and use it to test different implementations of UTF8 strlen. Your algorithm passed all my tests. Those who are interested in the test can keep reading.

I learned that php strlen() gives the byte count of a string. Couch used strlen(utf8_decode(str)) to find out the character count. For UTF8-encoded strings, the character count is always equal to or less than the byte count. There are 1, 2, 3 and 4 byte UTF8 characters.

from http://en.wikipedia.org/wiki/UTF-8:

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).


To test your new algorithm, I made the following test php script:
Code: Select all
<!DOCTYPE html>
<html>
<?php
$str_utf = $_POST['str_utf'];
echo "testing string: $str_utf<br>";
echo "Bytes: " . strlen($str_utf) . "<br>";

if( function_exists('utf8_decode') ){
    echo "strlen (utf8_decode(str)): " . strlen( utf8_decode($str_utf) ) . "<br>";
}
if( function_exists('mb_strlen') ){
    echo "mb_strlen: " . mb_strlen($str_utf,'UTF-8') . "<br>";
}
// adapted from Symfony Polyfill (https://github.com/symfony/polyfill)
$ulen_mask = array( "\xC0" => 2, "\xD0" => 2, "\xE0" => 3, "\xF0" => 4 );
$i = $j = 0;
$len = strlen( $str_utf );
while( $i < $len ){
    $u = $str_utf[$i] & "\xF0";
    $i += isset($ulen_mask[$u]) ? $ulen_mask[$u] : 1;
    ++$j;
}
echo "Symfony Polyfill function string length: " . $j;
?>
</html>


Note that this script also tests the mb_strlen() function. However that is a php extension function that is not enabled by default in php, and I didn't bother installing and enabling it on my computer (localhost).

Here's a simple html form to take test values:
Code: Select all
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<form action="/test.php" method="POST">
Test UTF-8 string: <input type="text" name="str_utf">
<button type="submit">Submit</button>
</body>
</html>


And here's the result of a test run:
testing string: 我能吞下玻璃而不伤身体。123
Bytes: 39
strlen (utf8_decode(str)): 15
Symfony Polyfill function string length: 15
Thank you Lloyd :)

Several years back when I was trying to understand unicode, I found the following article quite helpful
https://www.joelonsoftware.com/2003/10/ ... o-excuses/
5 posts Page 1 of 1