2025, Dec 03 09:00
Fix garbled Polish characters in YouTube uploads: normalize Python CLI input from Windows CP852 to UTF-8
Fix garbled Polish characters in YouTube uploads. Detect Windows CP852 and convert Python CLI text to UTF-8 before sending metadata safely via the API.
When uploading a YouTube video via the official Python sample, non‑ASCII characters in title and description may arrive garbled. Polish diacritics, for example, end up as strange glyphs. The payload is supposed to be UTF‑8, but the text you pass from the terminal can be in a different code page, and that mismatch corrupts characters on the way in.
Reproducing the issue
A typical invocation looks like this:
python upload_video.py --file="DU.mp4" --title="Rozporządzenie" --description="Zapraszamy do odsłuchania nowej publikacji dziennika ustaw opublikowanego na stronie sejmu.
Tytuł: DU/2025/695 - Rozporzaogon;dzenie Ministra Rolnictwa i Rozwoju Wsi" --keywords="" --category="25" --privacyStatus="private" --noauth_local_webserverOn systems where the terminal uses a legacy code page, characters like ą, ę, ł, ź, ż do not arrive as UTF‑8. You can observe the transformation explicitly:
>>> "ą".encode('utf-8').decode('852')
'─ů'This shows exactly the kind of corruption visible after upload: the original character is treated as if it came from code page 852. YouTube’s API expects UTF‑8 for title and description, so the wrong input encoding leads to mangled text.
What’s really happening
The YouTube Data API accepts UTF‑8 for text fields by default. If your terminal uses a different encoding, command‑line arguments are decoded accordingly before your script sees them. Passing those strings directly to the API means you are effectively sending the wrong bytes. On Linux the environment commonly uses UTF‑8 everywhere, so this mismatch is less likely. On Windows, seeing an active code page like 852 is a red flag that your CLI text is not in UTF‑8. Even switching the console to a different code page such as 65001 may not fix already processed text if the script still reads arguments using the legacy encoding; the robust approach is to normalize inside the script.
The fix: normalize to UTF‑8 inside the uploader
Convert the incoming text from the terminal’s code page to UTF‑8 before building the request. The transformation that undoes the corruption shown earlier is:
restored = raw_text.encode('852').decode('utf-8')Here is a minimal, self‑contained way to apply it to the fields you send to YouTube. The names are illustrative and can be adapted to where you assemble the upload payload:
def to_utf8_from_cp852(s):
return s.encode('852').decode('utf-8')
# assume cli_opts.title and cli_opts.description come from your argument parser
fixed_title = to_utf8_from_cp852(cli_opts.title)
fixed_desc = to_utf8_from_cp852(cli_opts.description)
# later when constructing the body for the API request
payload = {
'snippet': {
'title': fixed_title,
'description': fixed_desc,
}
}If you prefer to transform inline at the point of assignment, keep the same logic:
desc_text = cli_opts.description.encode('852').decode('utf-8')To validate what you actually receive, you can also inspect individual characters numerically. This helps confirm whether your script is seeing 852‑decoded text or proper UTF‑8:
# pick one character from the incoming string
ch = cli_opts.description[0]
code_point = ord(ch)
print(code_point)For a portable workflow, it can be useful to detect the terminal’s encoding and convert accordingly. There are established ways to obtain the terminal encoding in Python; the discussion here can point you in the right direction: https://stackoverflow.com/questions/6396659/how-do-you-get-the-encoding-of-the-terminal-from-within-a-python-script
Why this matters
Titles and descriptions with broken diacritics are not just cosmetic glitches. They affect readability, searchability and trust. Ensuring your upload path consistently produces UTF‑8 keeps your metadata usable across the entire YouTube pipeline and avoids manual fixes post‑upload.
Wrapping up
If your YouTube uploads show corrupted Polish characters, the root cause is a mismatch between your terminal’s code page and the UTF‑8 that the API expects. Reverse the mismatch inside the script by converting from the console code page, such as 852, to UTF‑8 using text.encode('852').decode('utf-8'), and pass the normalized strings to the API. If needed, verify the terminal encoding first and only then apply the conversion. This small adjustment makes your upload process resilient regardless of where the script runs.