50. Working with Unicode and Encodings
1. Encoding and Decoding Unicode Strings
Encoding a Unicode string to bytes and then decoding it back.
Copy
# Unicode string
unicode_string = "Hello, world! 👋"
# Encoding the Unicode string to bytes using UTF-8
encoded_bytes = unicode_string.encode('utf-8')
print(encoded_bytes)
# Decoding the bytes back to Unicode string
decoded_string = encoded_bytes.decode('utf-8')
print(decoded_string)This snippet demonstrates encoding a Unicode string to bytes and decoding it back, preserving the characters.
2. Handling Errors in Encoding and Decoding
Specifying error handling strategy when encoding/decoding (e.g., 'ignore', 'replace').
Copy
# Unicode string with an invalid character for ASCII
unicode_string = "Hello, world! 👋"
# Encoding with 'ignore' strategy (ignores unencodable characters)
encoded_bytes = unicode_string.encode('ascii', 'ignore')
print(encoded_bytes)
# Encoding with 'replace' strategy (replaces unencodable characters with '?')
encoded_bytes = unicode_string.encode('ascii', 'replace')
print(encoded_bytes)This snippet shows how to handle encoding errors by either ignoring or replacing problematic characters.
3. Checking if a String is Unicode
Checking whether a string contains Unicode characters or ASCII characters.
Copy
This checks if the string contains characters beyond the ASCII range (values above 127).
4. Writing and Reading Files with Different Encodings
Writing a Unicode string to a file with UTF-8 encoding and reading it back.
Copy
This snippet writes and reads a Unicode string to a file using the UTF-8 encoding.
5. Detecting the Encoding of a File
Automatically detecting the encoding of a text file using chardet.
Copy
The chardet library helps automatically detect the encoding of a file, which is useful when the encoding is unknown.
6. Normalizing Unicode Text
Normalizing Unicode strings to a canonical form (NFC, NFD) for consistent comparisons.
Copy
Normalization ensures that Unicode strings are in a consistent form, which is crucial for equality comparisons.
7. Converting Unicode to ASCII Using unidecode
Removing accents and converting Unicode characters to their ASCII equivalents.
Copy
The unidecode library converts accented characters in Unicode strings to their closest ASCII equivalents.
8. Working with UTF-16 Encoding
Encoding and decoding a string in UTF-16 encoding.
Copy
This snippet demonstrates encoding and decoding with UTF-16, another popular Unicode encoding.
9. Unicode Escape Sequences in Python Strings
Using escape sequences to represent Unicode characters in Python strings.
Copy
Unicode escape sequences like \u allow the representation of Unicode characters in string literals.
10. Handling Unicode in Web Scraping (with Requests)
Scraping a website with different encoding and handling Unicode content properly.
Copy
The requests module allows setting the encoding for properly decoding web content to Unicode.
These snippets demonstrate different ways of handling Unicode and encodings in Python, making it easier to work with international characters, different text encodings, and file handling while ensuring compatibility across different systems and platforms.
Last updated