2026, Jan 10 15:00
How to Reliably Read OpenFoodFacts Tab-Delimited Data with pandas by Disabling CSV Quoting
Fix pandas CSV/TSV imports from OpenFoodFacts: disable quoting (csv.QUOTE_NONE) to prevent field count drift and type errors, and load large files reliably.
Parsing OpenFoodFacts’ “CSV” export can be unexpectedly brittle. Although the file is tab-delimited, it contains the double-quote character used as an inch mark inside free text. With the default CSV quoting rules, that tiny detail is enough to derail tokenization, inflate field counts on random lines, and cascade into type-conversion errors where URLs suddenly appear inside numeric columns. Here is what is happening and how to make pandas reliably ingest the data.
Reproducing the issue
The dataset is large and tab-separated. A straightforward ingestion with pandas looks reasonable, including explicit dtypes inferred from column suffixes and a preliminary pass to “fix” suspicious sequences. Yet this setup triggers two classes of errors: “expected N fields, saw M” and “Unable to parse string …” during numeric conversion.
import os.path
import pandas as pd
import numpy as np
import linecache
# dataset: https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv.gz
# delimiter is TAB
base_dir = os.getcwd() + '\\'
raw_name = 'en.openfoodfacts.org.products.csv'
raw_path = base_dir + raw_name
fixed_name = 'en.off.corrected.csv'
fixed_path = base_dir + fixed_name
# pre-pass replacing "\n\t" with "\t"
if not os.path.isfile(fixed_path):
with open(raw_path, 'r', encoding='utf-8') as src, open(fixed_path, 'a', encoding='utf-8') as dst:
for ln in src:
dst.write(ln.replace('\n\t', '\t'))
# infer column names and types by suffix
cols = pd.read_csv(fixed_path, sep='\t', encoding='utf-8', nrows=0).columns.values
dtypes_map = {c: 'Int64' for c in cols if c.endswith(('_t', '_n'))}
dtypes_map |= {c: float for c in cols if c.endswith(('_100g', '_serving'))}
dtypes_map |= {c: str for c in cols if not c.endswith(('_t', '_n', '_100g', '_serving', '_tags'))}
# main load
df = pd.read_csv(
fixed_path,
sep='\t',
encoding='utf_8',
dtype=dtypes_map,
parse_dates=[c for c in cols if c.endswith('_datetime')],
on_bad_lines='warn'
)
When the parser emits “Skipping line … expected 209 fields, saw 239,” a manual check that splits the raw line by TAB reports the expected 209 fields, which is confusing at first glance.
target_line = linecache.getline(raw_path, 1715281)
pieces = target_line.split('\t')
print(len(pieces)) # reports 209 while the parser saw 239
Root cause
The double-quote character appears in text fields as an inch symbol, for example: fluted shell round sweet 2.5". The CSV reader, however, treats double quotes as quoting delimiters. Once the parser assumes a quote opens a quoted field, tabs inside that region are no longer recognized as separators, or, conversely, stray quotes can force the tokenizer into malformed states. The immediate symptoms are field count drift on otherwise valid lines and, later, type mismatches when columns shift and a value like a URL ends up in a numeric column, triggering “Unable to parse string …”.
The fix
Instruct pandas to ignore quotes entirely for this file. That aligns the tokenizer with the actual data semantics where double quotes are plain characters, not field boundaries. The change is minimal and robust.
import os.path
import pandas as pd
import numpy as np
import csv
base_dir = os.getcwd() + '\\'
raw_name = 'en.openfoodfacts.org.products.csv'
raw_path = base_dir + raw_name
# read header with quoting disabled
cols = pd.read_csv(
raw_path,
sep='\t',
encoding='utf-8',
nrows=0,
quoting=csv.QUOTE_NONE
).columns.values
# same dtype inference by suffix
dtypes_map = {c: 'Int64' for c in cols if c.endswith(('_t', '_n'))}
dtypes_map |= {c: float for c in cols if c.endswith(('_100g', '_serving'))}
dtypes_map |= {c: str for c in cols if not c.endswith(('_t', '_n', '_100g', '_serving', '_tags'))}
# main load with quoting disabled
df = pd.read_csv(
raw_path,
sep='\t',
encoding='utf_8',
dtype=dtypes_map,
parse_dates=[c for c in cols if c.endswith('_datetime')],
quoting=csv.QUOTE_NONE
)
If memory is a constraint, read the file in chunks. This approach was validated on the dataset: disabling quoting allows chunked ingestion to proceed without the earlier parser errors.
import csv
import pandas as pd
for block in pd.read_csv(
raw_path,
sep='\t',
encoding='utf_8',
dtype=dtypes_map,
parse_dates=[c for c in cols if c.endswith('_datetime')],
quoting=csv.QUOTE_NONE,
chunksize=100_000
):
# process each block
pass
Why this matters
Large exports from real-world systems frequently blur the lines between “CSV” and “TSV” and may embed characters with cultural meanings, like quotes as inch marks. Relying on default CSV semantics in these cases silently corrupts tokenization and type alignment, which then surfaces as intermittent parser warnings and opaque conversion errors deep in the pipeline. Telling the parser exactly how to treat quotes is the difference between a clean load and hours of chasing ghosts.
Takeaways
When a tab-separated export uses the double-quote inside free text, disable quoting with quoting=csv.QUOTE_NONE so the character is treated literally. Avoid ad-hoc pre-processing such as replacing sequences in the raw file; with proper quoting settings, pandas can read the export as-is. If the dataset is too large to fit in memory, iterate in chunks with the same quoting rule. For deeper inspection of edge cases, a line-by-line pass using Python’s CSV module and try/except can help isolate irregularities. Finally, if you control the upstream or can report issues, it is worth requesting that exported files use consistent and proper quoting rules to prevent downstream parsing surprises.
With the quoting behavior aligned to the dataset, the warnings about mismatched field counts and the “Unable to parse string” errors disappear, and ingestion becomes deterministic.