mercredi 1 février 2017

Testing non UTF-8 string

I have read some other threads on this subject but I cannot understand what I am doing wrong.

I have a function

public function test($item)
{
    if (! mb_detect_encoding($item, 'utf-8', true)) {
        $item = utf8_encode($item);
    }

    return $item;
}

I am writing a test for this. I want to test a string that is not UTF-8 to see if this statement is hit. I am having trouble creating the test string.

$contents = file_get_contents('CyrillicKOI8REncoded.txt');
var_dump(mb_detect_encoding($contents));

$sanitized = $this->test($contents);
var_dump(mb_detect_encoding($sanitized));

Initially I used file_get_contents on a file I encoded in sublime as Cyrillic (KOI8-R), HEX and DOS (CP 437) as it has been stated that file_get_contents ignores the encoding. This seems to be true as the characters returned are a jumbled mess.

That said, every time I use mb_detect_encoding on these variables, I always get ASCII or UTF-8. The statement is never triggered because ASCII is a subset of UTF-8.

So I have tried mb_convert_encoding and iconv to convert a basic string to UTF-16, UTF-32, base64, hex etc etc but every time mb_detect_encoding returns ASCII or UTF-8

In my tests I want to assert the encoding type before and after this function is called.

$sanitized = $this->test($contents);

$this->assertEquals('UTF-32', mb_detect_encoding($contents));
$this->assertEquals('UTF-8', mb_detect_encoding($sanitized));

I cannot understand what basic mistake I am doing to constantly get ASCII or UTF-8 returned from mb_detect_encoding.

Aucun commentaire:

Enregistrer un commentaire