Unicodedecodeerror utf8 Codec Cant Decode Byte 0xf0 in Position 7 Invalid Continuation Byte

Usually, there should be no problem working with Latin characters. Except when interacting with special characters, we can see the "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte".

Why does the "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte" appear? And how to solve it?

Encode and decode 2 different character sets

The error appears when we encode with one character set and try to use a different character set when we want to decode an object. See the example for a better understanding.

            encoding = 'LearnShäreIT'.encode('latin-1') decoding = encoding.decode('utf-8') print(decoding) # UnicodeDecodeError          

Error:

          UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 7        

To solve this error, you must use the character set that was previously used for encoding when you decode the string you want, like the code sample below.

            encoding = 'LearnShäreIT'.encode('utf-8')  # Using the same character set decoding = encoding.decode('utf-8')  print(decoding)  # "LearnShäreIT"          

Output:

          LearnShäreIT        

The charset is inconsistent when saving files and reading files

When we create and save a CSV file, we choose the UTF-16 BE charset, as shown below.

But when reading the file with pandas.read_csv(), we use the default character set of read_csv() which is utf-8. See the code below for a better understanding.

            import pandas as pd  # Using encoding = 'utf-8' but charset of data.csv = 'utf-16' data = pd.read_csv('data.csv')  print(data)          

Error:

          UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0        

We have to set the encoding='utf-16' for consistency between encoding and decoding. Like this:

            import pandas as pd  # Using encoding='utf-16' data = pd.read_csv('data.csv', encoding='utf-16')  print(data)          

Output:

                    Name           Website 0  LearnShareIT  learnshareit.com 1      Facebook      facebook.com 2        Google        google.com 3         Udemy         udemy.com        

Using detect() function in the chardet package

You can use chardet to detect the character encoding of a file. This library is handy when working with a large pile of text. But it can also be used when working with downloaded data you don't know its charset.

Syntax:

chardet.detect(data)

Parameter:

  • data: data in the file you want to detect charset.

The detect() function detects what charset a non-Unicode string is using. It returns a dictionary containing the automatically detected charset and confidence level.

Before using the detect() function, we need to install the chardet with the following command line:

          pip install chardet        

Then we will import the chardet at the top of the python file. Next, we pass the data into the detect() function to detect its charset. After getting the charset, pass it to the read_csv(). Like this:

            import chardet import pandas as pd  # Detect character encoding of data.csv enc = chardet.detect(open('data.csv', 'rb').read())  print(enc['encoding'])  # UTF-16  # Use pandas to read data.csv data = pd.read_csv('data.csv', encoding=enc['encoding'])  print(data)          

Output:

          UTF-16           Name           Website 0  LearnShareIT  learnshareit.com 1      Facebook      facebook.com 2        Google        google.com 3         Udemy         udemy.com        

Change character encoding manually

This way is very simple. Just open the file you need to read with notepad++. On the menu bar, select Encoding -> Convert to UTF-8. Like this:

Code:

            import pandas as pd  # Using pandas to read data.csv with charset = UTF-8 data = pd.read_csv('data.csv')  print(data)          

Output:

                    Name           Website 0  LearnShareIT  learnshareit.com 1      Facebook      facebook.com 2        Google        google.com 3         Udemy         udemy.com        

Summary

Basically, the error "UnicodeDecodeError: 'utf-8' codec can't decode byte in position: invalid continuation byte" comes from the inconsistency between the encoding and decoding processes. As long as you make sure to use a character set for encoding and decoding (such as UTF-8), you won't get this error again.

Have a lucky day!

Maybe you are interested:

  • "unicodedecodeerror: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte"
  • UnicodeDecodeError: 'charmap' codec can't decode byte
  • UnicodeDecodeError: 'ascii' codec can't decode byte

reynanatied.blogspot.com

Source: https://learnshareit.com/how-to-solve-the-error-unicodedecodeerror-utf-8-codec-cant-decode-byte-in-position-invalid-continuation-byte/

0 Response to "Unicodedecodeerror utf8 Codec Cant Decode Byte 0xf0 in Position 7 Invalid Continuation Byte"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel