Accuracy with large documents. Chunking strategy to reduce hallucinations

I do a test with every LLM I come across. “Summarize this Screenplay into a 1 page synopsis.” They all fail.

I know how to sumarise because of my training. Chunking requires context. If you take 4 sentences from the end of every chunk, you’ll know where something big has changed. You’ll know the overall logic of the document. This section is mostly about this, this next section is mostly about this, and so forth. Based on that overall context of the logic of the document, core concepts can be identified in each chunk that can be carried over as context for the next chunk. It helps with fundamental meaning. The overall logic can then be refined based on the core concepts of each chunk. Then, using the overall core logic, and the concepts as context, the chunks can be scanned for important details that relate directly to the core logic and the concepts. These can then be put in order as a seriously rough composition of the document. Then the LLM can just rewrite it.

I’ve done heaps of agentic work in Python with Ollama and embeddings, so I have a pretty good idea about how things can be improved.

It’s just a shame that these proven concepts about how the brain is most efficient in learning and comprehension etc, isn’t used for LLMs to become proficient.

Google “4 Layers of Learning Every Student MUST Master” “Justin Sung”

2 Likes

Correct. I was about to reply on another post on this subject. Breaking it down to subtasks reduces cognitive load & decreases hallucinations.

1 Like

Yeah if we can actually use chunking and sequencing, (a teaching principle) and build it into the processing, then the chunks can be based on a chunk of specific information. Like scenes in a screenplay, and stories between scene turning points. So introduction to the situation, the situation plays out, and a conclusion to that situation. Then the next. If and LLM can identify these as chunks, then it will do much better because rather than maybe reading the last half of a scene, and then next half of a completely different scene, and trying to summarise the scenes and not having enough to get the full context of the scene, then it can have all the relevant information it needs to completely understand the chunk as a whole. Same goes for say research papers. Identifying the sections first is key. Then each section becomes a chunk that is processed and understood. However, if we don’t look at 1 section at a time, and we only read half a section and maybe half of a new section, then we can confuse information.

Say for instance, the first section is all about price losses, and there is a table. Then the next section has a new table which is for price gains. If you are trying to process the second half of the first table, and the first half of the second table, then it doesn’t really make sense. Even overlapping chunking can still miss parts.

All the chunks will be different sizes but if the documents are broken up the same what humans break up the learning process in order to understand the context, then the AI can avoid a lot of the issues with large data.

If the chunks can be summarized more accurately “Specific subject matter chunk + context brought over from previous steps” then the output will be about 10-15% of the whole yet still be very accurate. Then working with just the 15% or so will be almost the size of 1 specific chunk. You could then do the same, breaking it down from say scenes, to act structure. Act 1, act 2, act 3, act 4, and summarise them. A research paper might be split into, established literature, method, discussion of results, conclusions and ramifications.

It’s about how to sort the data into chunks that can be processed in the most efficient way. And humans have had millions of years to perfect the chunking and processing system. I’m not a complete expert on the psychology of it all but I do think emulating how the brain works not just in neural networks, but also in processing information, we can get way bigger jumps in LLM technology.

Maybe a bunch of places are already working on this. But yeah the moment we can get the 128k context window to actually be useful, that will be HUGE!

I’m currently manually doing this with my code. I take the specific subject like a complicated function, I then input into the prompt the context of what the function does in the whole, what calls it, where the outputs are sent, and why, and then I post the function in there as well. Then I can get it to work just with that chunk and the context so that it can go through it without hallucinations. Giving it the entire project without context just confuses it. It works well.