50. Working with Unicode and Encodings

1. Encoding and Decoding Unicode Strings

Encoding a Unicode string to bytes and then decoding it back.

Copy

# Unicode string
unicode_string = "Hello, world! 👋"

# Encoding the Unicode string to bytes using UTF-8
encoded_bytes = unicode_string.encode('utf-8')
print(encoded_bytes)

# Decoding the bytes back to Unicode string
decoded_string = encoded_bytes.decode('utf-8')
print(decoded_string)

This snippet demonstrates encoding a Unicode string to bytes and decoding it back, preserving the characters.

2. Handling Errors in Encoding and Decoding

Specifying error handling strategy when encoding/decoding (e.g., 'ignore', 'replace').

Copy

# Unicode string with an invalid character for ASCII
unicode_string = "Hello, world! 👋"

# Encoding with 'ignore' strategy (ignores unencodable characters)
encoded_bytes = unicode_string.encode('ascii', 'ignore')
print(encoded_bytes)

# Encoding with 'replace' strategy (replaces unencodable characters with '?')
encoded_bytes = unicode_string.encode('ascii', 'replace')
print(encoded_bytes)

This snippet shows how to handle encoding errors by either ignoring or replacing problematic characters.

3. Checking if a String is Unicode

Checking whether a string contains Unicode characters or ASCII characters.

Copy

# String with Unicode character
unicode_string = "Hello, world! 👋"

# Checking if the string contains non-ASCII characters
if any(ord(char) > 127 for char in unicode_string):
    print("The string contains Unicode characters.")
else:
    print("The string contains only ASCII characters.")

This checks if the string contains characters beyond the ASCII range (values above 127).

4. Writing and Reading Files with Different Encodings

Writing a Unicode string to a file with UTF-8 encoding and reading it back.

Copy

# Unicode string
unicode_string = "This is a Unicode text with special characters: ä, é, ü"

# Writing to a file with UTF-8 encoding
with open('unicode_text.txt', 'w', encoding='utf-8') as file:
    file.write(unicode_string)

# Reading the content back with UTF-8 decoding
with open('unicode_text.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(content)

This snippet writes and reads a Unicode string to a file using the UTF-8 encoding.

5. Detecting the Encoding of a File

Automatically detecting the encoding of a text file using chardet.

Copy

import chardet

# Detecting the encoding of a file
with open('unicode_text.txt', 'rb') as file:
    raw_data = file.read()
    result = chardet.detect(raw_data)
    print(f"Detected encoding: {result['encoding']}")

The chardet library helps automatically detect the encoding of a file, which is useful when the encoding is unknown.

6. Normalizing Unicode Text

Normalizing Unicode strings to a canonical form (NFC, NFD) for consistent comparisons.

Copy

import unicodedata

# Unicode string with combining characters
unicode_string = "e\u0301"  # 'e' followed by an acute accent (combining character)

# Normalize to NFC (Canonical Composition)
normalized_string = unicodedata.normalize('NFC', unicode_string)
print(f"NFC Normalized: {normalized_string}")

# Normalize to NFD (Canonical Decomposition)
normalized_string = unicodedata.normalize('NFD', unicode_string)
print(f"NFD Normalized: {normalized_string}")

Normalization ensures that Unicode strings are in a consistent form, which is crucial for equality comparisons.

7. Converting Unicode to ASCII Using unidecode

Removing accents and converting Unicode characters to their ASCII equivalents.

Copy

from unidecode import unidecode

# Unicode string with accents
unicode_string = "El Niño está aquí!"

# Convert to ASCII by removing accents
ascii_string = unidecode(unicode_string)
print(ascii_string)

The unidecode library converts accented characters in Unicode strings to their closest ASCII equivalents.

8. Working with UTF-16 Encoding

Encoding and decoding a string in UTF-16 encoding.

Copy

# Unicode string
unicode_string = "Hello, world! 👋"

# Encoding the string to bytes using UTF-16
encoded_bytes = unicode_string.encode('utf-16')
print(f"Encoded in UTF-16: {encoded_bytes}")

# Decoding the bytes back to string using UTF-16
decoded_string = encoded_bytes.decode('utf-16')
print(f"Decoded from UTF-16: {decoded_string}")

This snippet demonstrates encoding and decoding with UTF-16, another popular Unicode encoding.

9. Unicode Escape Sequences in Python Strings

Using escape sequences to represent Unicode characters in Python strings.

Copy

# Unicode escape sequences
unicode_string = "\u0048\u0065\u006C\u006C\u006F"  # 'Hello' in Unicode

print(unicode_string)  # Output: Hello

Unicode escape sequences like \u allow the representation of Unicode characters in string literals.

10. Handling Unicode in Web Scraping (with Requests)

Scraping a website with different encoding and handling Unicode content properly.

Copy

import requests

# Fetching a web page
response = requests.get("https://www.example.com")

# Getting the content of the page in the proper encoding
response.encoding = 'utf-8'  # Specify the encoding
print(response.text)  # Print the decoded HTML content

The requests module allows setting the encoding for properly decoding web content to Unicode.

These snippets demonstrate different ways of handling Unicode and encodings in Python, making it easier to work with international characters, different text encodings, and file handling while ensuring compatibility across different systems and platforms.

Previous49. Lazy Evaluation with itertools Next51. Python's pdb Debugger

Last updated 9 months ago