2025, Oct 24 01:00

Tesseract OCR chi_sim Returns Empty Text on Chinese Photos: Real Causes, Image Preprocessing Steps, and Reliable Python Calls

Fix Tesseract OCR chi_sim returning empty Chinese text: boost contrast, apply thresholding and rectification, and set pytesseract tessdata for reliable results.

When Tesseract OCR returns an empty string for Chinese text even though chi_sim is installed and visible to your tooling, the immediate instinct is to suspect a misconfiguration. In reality, the bottleneck here is often the image itself. A noisy, low-contrast photo with skew and a textured background can drop below the threshold where Tesseract can isolate glyphs at all.

Reproducing the issue

The following snippet reads a photo, converts it to a PIL image, and runs pytesseract with chi_sim. The language pack is present, yet the output is empty:

import cv2 as cv
from PIL import Image as PilImage
import pytesseract as ptes
from pyocr import tesseract as pyocr_tess

img_file = 'photo.jpeg'
cv_mat = cv.imread(img_file)
pil_frame = PilImage.fromarray(cv_mat)
result_text = ptes.image_to_string(pil_frame, lang='chi_sim')
print(result_text)

Verifying that the language is discoverable also succeeds:

import pytesseract as ptes
print(ptes.get_languages(config=''))  # ['chi_sim', 'eng', 'osd']

What is actually going on

The underlying image is simply too poor for Tesseract to isolate and recognise characters. It needs several kinds of cleanup before recognition: rectification to address perspective or rotation, contrast improvement to push foreground above the noise floor, and colour thresholding to strip away background texture. Even after cleanup, if the source is low fidelity, the result can still be below par for ordinary OCR.

That explains why some services appear to “magically” extract clean text from such photos. Systems that aggregate many similar images can infer a much cleaner average and effectively read what a single-pass OCR engine cannot. For this specific kind of document, outputs like the following are achievable via such aggregation:

中华人民共和国
居民身份证
签发机关
有效期限
2007.05.14-2027.05 14

But even with a well-cleaned single image, Tesseract will often fall short of an artificially improved interpretation derived from multiple samples.

What actually helps

Start with the image. Improve contrast, rectify the document area, and apply colour thresholding to suppress the background. Tesseract prefers black text on a white background, and it struggles when text is too small or too large; sufficient resolution, at least 72 dpi, matters. If you rely on a single noisy photograph with textured background and low contrast, an empty or near-empty result is expected.

On the integration side, keep your I/O and environment predictable. Loading directly into PIL avoids colour channel surprises, since OpenCV reads BGR while PIL expects RGB. If you read via OpenCV, confirm the image isn’t empty by checking the shape after imread, because a wrong path can silently hand you an unusable matrix. You can also run the OCR from the command line to compare behaviour with your Python stack. When layout is unusual, tuning psm can help. And if your tessdata directory isn’t on the default search path, passing it explicitly in config may be required even if the language appears in the list of available packs.

A safer call pattern in Python

The logic is the same—load the image and call image_to_string—but this variant removes BGR/RGB ambiguity and shows how to pass a tessdata path when needed:

from PIL import Image as PilImage
import pytesseract as ptes

src_path = 'photo.jpeg'
lang_id = 'chi_sim'
# If your system needs an explicit tessdata path, keep it here; otherwise leave as an empty string.
extra_cfg = '--tessdata-dir "/usr/share/tesseract-ocr/5/tessdata"'

img_obj = PilImage.open(src_path)
ocr_text = ptes.image_to_string(img_obj, lang=lang_id, config=extra_cfg)
print(ocr_text)

This won’t conjure text out of a fundamentally unreadable photo, but it reduces false negatives linked to I/O or configuration oddities.

Why this matters

OCR accuracy isn’t just about the model or the language pack. It is critically dependent on the visual quality of the input and the surrounding pipeline. Knowing the limits avoids chasing non-issues in your code when the real constraint is the data. It also clarifies why some end-to-end systems outperform a single Tesseract call on a single frame: they denoise by aggregation and enforce stricter pre-processing.

Practical takeaways

If Tesseract with chi_sim returns a blank line on a noisy document photo, treat the image, not just the code. Rectify the page, boost contrast, and threshold the background so the text becomes black on white. Ensure the resolution is adequate and consider tuning psm for atypical layouts. Load images in a way that preserves expected colour channels, verify that your input actually loaded, and, when necessary, pass the tessdata directory explicitly. With those in place, you can separate configuration issues from the more common reality: the image is simply below the quality threshold for reliable OCR.

The article is based on a question from StackOverflow by showkey and an answer by K J.

ocr python python-tesseract