2025, Nov 30 17:00

Resolve Google Cloud Discovery Engine import failures: use the documents schema to ingest PDFs and metadata from Cloud Storage

Importing JSONL with PDFs to Google Cloud Discovery Engine fails under CONTENT REQUIRED? Switch to data_schema=documents to ingest content and metadata.

Uploading unstructured documents with metadata into a Google Cloud Discovery Engine data store sounds straightforward until the import pipeline pushes back with cryptic errors. If you’re pointing a JSONL manifest at PDFs in Cloud Storage and seeing content-related failures, the root cause is almost certainly a mismatch between your data schema and the data store’s content configuration.

Problem overview

The flow is simple: create a data store, place PDFs and a JSONL manifest in a GCS bucket, then run an import with the Python SDK. The JSONL includes per-document metadata and a content.uri pointing to the PDF. The data store is configured with CONTENT REQUIRED. During import, the operation fails with messages like “To create document without content, content config of data store must be NO_CONTENT,” repeated per JSONL line.

Problematic code example

The following Python snippet triggers the error when the JSONL includes unstructured documents with metadata, and the data store is CONTENT REQUIRED:

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

REGION = "your-region"
PROJECT = "your-project-id"
STORE_ID = "your-store-id"
META_URI = "gs://your-bucket/metadata.jsonl"

client_opts = (
    ClientOptions(api_endpoint=f"{REGION}-discoveryengine.googleapis.com")
    if REGION != "global"
    else None
)

svc = discoveryengine.DocumentServiceClient(client_options=client_opts)

parent_branch = svc.branch_path(
    project=PROJECT,
    location=REGION,
    data_store=STORE_ID,
    branch="default_branch",
)

req = discoveryengine.ImportDocumentsRequest(
    parent=parent_branch,
    gcs_source=discoveryengine.GcsSource(
        input_uris=[META_URI],
        data_schema="custom",
    ),
    id_field="id",
    reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
)

op = svc.import_documents(request=req)
print(f"Waiting for operation to complete: {op.operation.name}")
res = op.result()
print(res)

The JSONL manifest looks like the following, where each line declares metadata and a content.uri pointing to the PDF in GCS:

{"id": "1", "structData": {"title": "Coldsmokesubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/ColdSmokeSubmittal.pdf"}}
{"id": "2", "structData": {"title": "Defssubmittal", "category": "212027"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/DEFSSubmittal.pdf"}}
{"id": "3", "structData": {"title": "Cmu Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/CMU_Submittal.pdf"}}
{"id": "4", "structData": {"title": "Concrete Mix Submittal", "category": "222039"}, "content": {"mimeType": "application/pdf", "uri": "gs://meta-data-testing/Concrete_Mix_Submittal.pdf"}}

What’s really happening

The error message is accurate. Using data_schema="custom" declares that you are importing metadata only. In that mode, Discovery Engine expects the data store to accept documents without content, which corresponds to NO_CONTENT configuration. If the data store requires content (CONTENT REQUIRED), the system rejects these records as incomplete. That’s why flipping the store to NO_CONTENT lets the import pass but leaves you with non-searchable documents, because there’s no ingested content to index.

The fix

Switch the import to use data_schema="documents" so the pipeline ingests both the metadata and the referenced unstructured content. Also omit the id_field parameter. You still point the import at the JSONL manifest, which contains content.uri links to the PDFs in Cloud Storage. The “documents” schema interprets your JSONL accordingly and imports the PDFs alongside the metadata. Clear definitions of the data_schema parameter are available in the official reference.

Corrected code example

Here is the adjusted Python snippet that successfully imports the PDFs with metadata into a CONTENT REQUIRED data store:

from google.api_core.client_options import ClientOptions
from google.cloud import discoveryengine

REGION = "your-region"
PROJECT = "your-project-id"
STORE_ID = "your-store-id"
META_URI = "gs://your-bucket/metadata.jsonl"

client_opts = (
    ClientOptions(api_endpoint=f"{REGION}-discoveryengine.googleapis.com")
    if REGION != "global"
    else None
)

svc = discoveryengine.DocumentServiceClient(client_options=client_opts)

parent_branch = svc.branch_path(
    project=PROJECT,
    location=REGION,
    data_store=STORE_ID,
    branch="default_branch",
)

req = discoveryengine.ImportDocumentsRequest(
    parent=parent_branch,
    gcs_source=discoveryengine.GcsSource(
        input_uris=[META_URI],
        data_schema="documents",
    ),
    reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
)

op = svc.import_documents(request=req)
print(f"Waiting for operation to complete: {op.operation.name}")
res = op.result()
print(res)

Why this matters

Choosing the correct data schema is the difference between uploading metadata-only records and actually ingesting unstructured content for search. With the “documents” schema, Discovery Engine processes content.uri, fetches the PDFs, and indexes their text while preserving your custom structData. This enables you to enrich retrieval with fields like project_number and filter search results through the API based on that metadata.

Outcome

After switching to data_schema="documents" and removing id_field, the import completes successfully. A custom project_number field in the manifest is parsed, and search results can be filtered by that field via the API.

Conclusion

If your JSONL manifest includes content.uri for unstructured assets and your data store expects actual content, do not import with data_schema="custom". Use data_schema="documents" and keep the data store at CONTENT REQUIRED. This path imports both the file content and your metadata, making the documents searchable while retaining the fields you need for filtering and relevance control.