2025, Dec 27 03:00
Parse commented headers with NumPy genfromtxt: avoid dtype=str empty strings using names=True and dtype=None
Learn why NumPy genfromtxt with names=True and a commented header returns empty strings with dtype=str, and how dtype=None or object dtype.names fixes parsing.
Parsing delimited text with NumPy can be deceptively simple until a small detail in the API changes how the data is interpreted. A common case: a file has a commented header line and you want NumPy to treat it as column names while reading the rest as strings. On Windows 11 with Python 3.13.2 and NumPy 2.1.3, this can surprise you if you pass dtype=str and expect the header to appear as a regular row.
Reproducible setup
Input file with a commented header and pipe-separated values:
#firstName|LastName
Anthony|Quinn
Harry|POTTER
George|WASHINGTON
Code that looks reasonable at first glance, but returns empty strings:
import numpy as np
with open("C:/tmp/data.txt", "r", encoding="UTF-8") as fhandle:
parsed = np.genfromtxt(
fhandle,
comments="#",
delimiter="|",
dtype=str,
names=True,
skip_header=0,
)
print(f"parsed = {parsed}")
Observed output:
parsed = [('', '') ('', '') ('', '')]
What is really happening
NumPy’s documentation for genfromtxt is clear about how comments and headers interact:
The optional argument comments is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumes comments='#'. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored.
Note: There is one notable exception to this behavior: if the optional argument names=True, the first commented line will be examined for names.
That “exception” is the key. When names=True, the first commented line is not returned as a data row. Instead, genfromtxt extracts the column names from that line and stores them in the structured dtype of the resulting array. The header does not sit alongside the data; it lives in result.dtype.names.
There is also a second observable detail: with dtype=str in this setup, the data fields end up as empty strings, while using a more permissive dtype fixes the issue. Changing dtype to None makes genfromtxt infer appropriate types and parse the non-header rows correctly. Using dtype=object also yields data, with the header still exposed via dtype.names.
Solution
Let genfromtxt infer the column types and read the header as field names. Access the header through the array’s dtype metadata.
import numpy as np
with open("C:/tmp/data.txt", "r", encoding="UTF-8") as src:
table = np.genfromtxt(
src,
delimiter="|",
comments="#",
dtype=None, # let NumPy infer types
names=True, # use the first commented line as field names
skip_header=0,
autostrip=True, # trim surrounding spaces if present
)
print(f"table = {table}\n")
print(f"Detected header (from dtype): {table.dtype.names}")
Expected output:
table = [('Anthony', 'Quinn') ('Harry', 'POTTER') ('George', 'WASHINGTON')]
Detected header (from dtype): ('firstName', 'LastName')
This demonstrates two important points. First, names=True with comments="#" does treat the commented header line specially, promoting it to field names. Second, the header is not part of the row data; it’s stored in the structured array’s dtype.
Why this matters
Structured arrays in NumPy carry schema in their dtype. When genfromtxt builds such an array, field names come from the header, and the header is no longer a regular row in the dataset. If code later assumes the header is present as the first record, it will silently misalign data processing. Understanding where the header is stored prevents incorrect indexing and avoids fragile workarounds.
Another practical takeaway is that dtype selection affects parsing outcomes. In this scenario, keeping dtype flexible with dtype=None or using dtype=object allows the parser to populate fields as expected, while the header remains accessible via dtype.names.
Takeaways
If you load delimited text with genfromtxt where the header is commented and names=True is set, don’t search for that header in the array body. It won’t be there. Read it from array.dtype.names. If you see empty strings with dtype=str, allow NumPy to infer types by switching to dtype=None, or use dtype=object, and then proceed with structured field access. This keeps the data layout explicit, the header discoverable, and the parsing reliable across similar files.