Why structured data offers LLMs tremendous benefits -- and a major challenge [Q&A]
ChatGPT and other LLMs are designed to train and learn from unstructured data -- namely, text. This has enabled them to support a variety of powerful use cases.
However, these models struggle to analyze structured data, such as numerical and statistical information organized in databases, limiting their potential.
Mike Finley, CTO and co-founder of AnswerRocket, believes there is a way around this, by treating structured data as unstructured. We spoke to him to learn more.
BN: Why is it difficult for LLMs to analyze structured data?
MF: ChatGPT and other large language models (LLMs) are designed to train and learn from unstructured data (namely, text). This has enabled them to support a variety of powerful use cases. However, these models struggle to analyze structured data, such as numerical and statistical information organized in databases, limiting their potential.
LLMs are great at learning from unstructured data -- figuring out the meaning of text, its broader context, and delivering powerful insights from their learnings. But they struggle to do the same with structured data. When LLMs analyze columns of tabular data, they simply aren’t sure what they're looking at. For example, consider a model combing through GPS-based weather data -- specifically, daily temperatures. It might realize that this information is temperatures, but it could also think that it's a list of longitude coordinates or humidity measurements or even something completely unrelated to weather, like a list of people's ages.
The problem is that LLMs are built to learn for themselves and are resistant to being ‘told’ things. Data scientists are discovering they can't effectively explain what structured data is to these models and how they should treat it. You can attempt to tell LLM a column is zip codes, but it’s always going to say to itself, "Well, maybe it's zip codes, or maybe it's the first five digits of a phone number." That's just how these models were designed -- and it makes its insights from structured data unreliable.
BN: What value does structured data offer organizations leveraging LLMs?
MF: All sorts of critical statistics and records are stored in structured databases. This includes information like sales and revenue figures, customer engagement stats, web data, tax data, CRM data, patient records, etc. Every company collects tons of structured data like this -- and in verticals like science, finance and healthcare, organizations collect even more of it. If LLMs can effectively analyze and learn from that data, they can deliver all sorts of useful insights pertaining to that information. These insights can be used to guide almost any business decisions, such as product strategy, marketing, hiring, and expansion.
Solving this problem is critical to enabling organizations to use LLMs on their own private data. LLMs, like GPT, train on publicly available web data and from their interactions with users. They're great for uncovering things about the world or an industry more generally. But for a company to get insights about their own business -- like the examples mentioned above -- they need an LLM that can analyze and contextualize their structured data.
BN: Why is context so important?
MF: LLMs wouldn't provide value if all they did was memorize data. They have to understand the context or meaning of data to learn about it and provide deeper insights from it. As noted above, LLMs understand unstructured data, but they don’t effectively grasp structured data.
Imagine a restaurant chain using an LLM to dive deeper into their structured data to uncover how breakfast menu sales are doing at one of their locations. On the surface, it sounds like a straightforward Natural Language Processing (NLP) question you’d ask an LLM. But with existing capabilities, at best the LLM will only be able to summarize the tabular data in the simplest way. It would tell you, "Customer three ordered a biscuit at 10:22am, customer seven had a burger at 10:48am, customer 12 had pancakes at 11:35am, etc." The model wouldn't understand which items count as breakfast food, unless those items were categorized as such in the database. But if that model understood the context of structured data just as well as unstructured data, it would easily learn which items are breakfast foods and could answer that business's question about sales. Furthermore, it would deliver many deeper insights: Low carb breakfast items are doing particularly well, most customers want cheese on their breakfast sandwich, no one is ordering hash browns anymore.
BN: What does AnwerRocket do differently and how do you address this challenge?
MF: AnswerRocket's augmented analytics technology leverages the power of LLMs to allow organizations to analyze and learn from their raw data. Our new Max platform combines GPT with AnswerRocket's own analytics engine. It’s easy and requires no expertise: users simply ask natural language questions to the Max conversational AI assistant and get back answers immediately. Essentially, we allow organizations to uncover the same type of business insights outlined above.
Our solution to the structured data problem is straightforward: We convert structured data to unstructured data. As a result, GPT can understand and learn from structured data just as effectively as it can from unstructured data. There’s no work required of the user. AnswerRocket automatically converts the structured data as it’s collected, so users can analyze it in real time.
Image credit: nevarpp/depositphotos.com