I am seeking insights from the community regarding a persistent technical issue I have encountered while using the LLM assistant for a quantitative stock screening process. The fundamental problem is the model’s recurring failure to comply with an explicit instruction to exclusively use data provided in an uploaded CSV file, which results in the injection of external data and incorrect company tickers.
1. Background and Goal
My objective was to generate a “short-list” of 5 to 10 companies for the U.S. Consumer Staples sector, based purely on financial Key Performance Indicators (KPIs) provided in a file I uploaded.
The primary and repeated constraint I gave the LLM was:
“Read only the file I send here. Do not add any company that is not on the list. Do not invent data, use only the available data in the table.” (And, later: “DO NOT LOOK FOR ANY EXTERNAL DATA.”)
2. Core Failure: External Data Injection
In my initial attempts to filter the data, the LLM returned a ranked list that included tickers (MO, K, TSN, GIS, etc.) that were not present in my uploaded file.
-
Problem: The LLM bypassed the strict constraint and accessed an internal or external knowledge base associated with “U.S. Consumer Staples,” injecting tickers and data from a source other than my CSV.
-
Impact: The initial analysis was invalid, as it was based on an entirely different set of companies. I had to repeatedly demand that the model show me the exact data it was processing to force it to use my actual file content (which contained PG, HSY, LEVI, CHD, INGR, etc.).
3. Systemic Data Ingestion Reliability Issues
The issue of data constraint violation is amplified by critical failures in the data ingestion process itself, which have seriously limited my ability to use the platform for quantitative analysis.
-
XLSX/CSV Conversion Failure: I have faced multiple failures when attempting to upload Excel files (
.xlsx). The file processing tool often fails during the conversion to CSV, typically by incorrectly appending the filename to the CSV extension (e.g., converting a file toList Of Companies - US Consumer Staples.xlsx - Sheet1.csv), making the file inaccessible or unusable for the model. This has occurred multiple times on other occasions, which has caused me to give up on using Gemini for quantitative tasks entirely. -
Google Sheets Access Failure: I also attempted to bypass the file upload issue by putting my data in a Google Sheet and granting the LLM direct access. Even with this explicit data connection, the model was incapable of reliably reading and adhering to the instruction to only use the provided data, repeating the same external data injection errors observed with direct file uploads.
4. Technical Questions for the Community
Given the LLM’s failure to restrict its data source and the systemic issues with data ingestion reliability, my questions for architectural or behavioral analysis are:
-
Why does the model default to external/internal knowledge bases when processing uploaded data, even when my instructions explicitly forbid it? Is this a limitation in the file-processing architecture or a prioritization error in the model’s instruction-following mechanism?
-
What is the root cause of the recurring failure to correctly process and convert uploaded XLSX files (specifically, the naming convention error during conversion)?
-
Why is the model unable to reliably read and exclusively use data from connected services like Google Sheets, mirroring the exact constraints violation observed with direct file uploads?
-
Is there a specific phrasing or a robust instruction formatting technique that is demonstrably more effective at enforcing absolute data source exclusivity for quantitative screening tasks?
Any insights into the causes of this data leakage and ingestion failure would be greatly appreciated.