Encoding a Unicode string to bytes and then decoding it back.
Copy
# Unicode string
unicode_string = "Hello, world! 👋"
# Encoding the Unicode string to bytes using UTF-8
encoded_bytes = unicode_string.encode('utf-8')
print(encoded_bytes)
# Decoding the bytes back to Unicode string
decoded_string = encoded_bytes.decode('utf-8')
print(decoded_string)
This snippet demonstrates encoding a Unicode string to bytes and decoding it back, preserving the characters.
2. Handling Errors in Encoding and Decoding
Specifying error handling strategy when encoding/decoding (e.g., 'ignore', 'replace').
Copy
# Unicode string with an invalid character for ASCII
unicode_string = "Hello, world! 👋"
# Encoding with 'ignore' strategy (ignores unencodable characters)
encoded_bytes = unicode_string.encode('ascii', 'ignore')
print(encoded_bytes)
# Encoding with 'replace' strategy (replaces unencodable characters with '?')
encoded_bytes = unicode_string.encode('ascii', 'replace')
print(encoded_bytes)
This snippet shows how to handle encoding errors by either ignoring or replacing problematic characters.
3. Checking if a String is Unicode
Checking whether a string contains Unicode characters or ASCII characters.
Copy
This checks if the string contains characters beyond the ASCII range (values above 127).
4. Writing and Reading Files with Different Encodings
Writing a Unicode string to a file with UTF-8 encoding and reading it back.
Copy
This snippet writes and reads a Unicode string to a file using the UTF-8 encoding.
5. Detecting the Encoding of a File
Automatically detecting the encoding of a text file using chardet.
Copy
The chardet library helps automatically detect the encoding of a file, which is useful when the encoding is unknown.
6. Normalizing Unicode Text
Normalizing Unicode strings to a canonical form (NFC, NFD) for consistent comparisons.
Copy
Normalization ensures that Unicode strings are in a consistent form, which is crucial for equality comparisons.
7. Converting Unicode to ASCII Using unidecode
Removing accents and converting Unicode characters to their ASCII equivalents.
Copy
The unidecode library converts accented characters in Unicode strings to their closest ASCII equivalents.
8. Working with UTF-16 Encoding
Encoding and decoding a string in UTF-16 encoding.
Copy
This snippet demonstrates encoding and decoding with UTF-16, another popular Unicode encoding.
9. Unicode Escape Sequences in Python Strings
Using escape sequences to represent Unicode characters in Python strings.
Copy
Unicode escape sequences like \u allow the representation of Unicode characters in string literals.
10. Handling Unicode in Web Scraping (with Requests)
Scraping a website with different encoding and handling Unicode content properly.
Copy
The requests module allows setting the encoding for properly decoding web content to Unicode.
These snippets demonstrate different ways of handling Unicode and encodings in Python, making it easier to work with international characters, different text encodings, and file handling while ensuring compatibility across different systems and platforms.
# String with Unicode character
unicode_string = "Hello, world! 👋"
# Checking if the string contains non-ASCII characters
if any(ord(char) > 127 for char in unicode_string):
print("The string contains Unicode characters.")
else:
print("The string contains only ASCII characters.")
# Unicode string
unicode_string = "This is a Unicode text with special characters: ä, é, ü"
# Writing to a file with UTF-8 encoding
with open('unicode_text.txt', 'w', encoding='utf-8') as file:
file.write(unicode_string)
# Reading the content back with UTF-8 decoding
with open('unicode_text.txt', 'r', encoding='utf-8') as file:
content = file.read()
print(content)
import chardet
# Detecting the encoding of a file
with open('unicode_text.txt', 'rb') as file:
raw_data = file.read()
result = chardet.detect(raw_data)
print(f"Detected encoding: {result['encoding']}")
import unicodedata
# Unicode string with combining characters
unicode_string = "e\u0301" # 'e' followed by an acute accent (combining character)
# Normalize to NFC (Canonical Composition)
normalized_string = unicodedata.normalize('NFC', unicode_string)
print(f"NFC Normalized: {normalized_string}")
# Normalize to NFD (Canonical Decomposition)
normalized_string = unicodedata.normalize('NFD', unicode_string)
print(f"NFD Normalized: {normalized_string}")
from unidecode import unidecode
# Unicode string with accents
unicode_string = "El Niño está aquí!"
# Convert to ASCII by removing accents
ascii_string = unidecode(unicode_string)
print(ascii_string)
# Unicode string
unicode_string = "Hello, world! 👋"
# Encoding the string to bytes using UTF-16
encoded_bytes = unicode_string.encode('utf-16')
print(f"Encoded in UTF-16: {encoded_bytes}")
# Decoding the bytes back to string using UTF-16
decoded_string = encoded_bytes.decode('utf-16')
print(f"Decoded from UTF-16: {decoded_string}")
import requests
# Fetching a web page
response = requests.get("https://www.example.com")
# Getting the content of the page in the proper encoding
response.encoding = 'utf-8' # Specify the encoding
print(response.text) # Print the decoded HTML content