As businesses scramble to take advantage of artificial intelligence, they’re finding data makes all the difference in using AI effectively.
A recent report from Amazon Web Services found small and medium-sized businesses that have already integrated data analysis into their operations are significantly more likely to be using AI—and more likely to outperform their peers in the market.
“Probably the biggest takeaway is just how much better the small and medium businesses that are leveraging data are doing financially,” says Ben Schreiner, AWS’ head of business innovation for U.S.-based small and medium businesses.
Many traditional business tools let users analyze and query numeric business records like sales or expense spreadsheets with help from large language model assistants. But at the same time, AI’s increasing fluency with other content like text, images, and audio recordings are making such materials suddenly very valuable. . . That is, provided companies can effectively organize and prep it for use with the technology.
Turning old files into AI gold
AI vendors have, sometimes controversially, made deals with organizations like news publishers, social media companies, and photo banks to license data for building general-purpose AI models. But businesses can also benefit from using their own data to train and enhance AI to assist employees and customers. Examples of source material can include sales email threads, historical financial reports, geographic data, product images, legal documents, company web forum posts, and recordings of customer service calls.
“The amount of knowledge—actionable information and content—that those sources contain, and the applications you can build on top of them, is really just mindboggling,” says Edo Liberty, founder and CEO of Pinecone, which builds vector database software.
Vector databases store documents or other files as numeric representations that can be readily mathematically compared to one another. That’s used to quickly surface relevant material in searches, group together similar files, and feed recommendations of content or products based on past interests. Vector databases are also used with AI to provide language models with relevant content to respond to user requests, like providing a chatbot with on-point material from a corporate knowledge base to answer a tech support question.
That process, called retrieval augmented generation, is often what enables generative AI to answer questions beyond what’s in its general-purpose training data. And it, like other uses of machine learning to address specific business needs, relies on accurate, well-organized data.
“AI is only as good as the data it gets,” says Eilon Reshef, cofounder and chief product officer at revenue intelligence provider Gong. “If you don’t have any data, it’s not going to tell you anything.”
Gong helps its clients use a slew of data including sales emails, call transcripts, and online interactions to understand and manage their sales practices. Its tools can make overall forecasts and predict when particular deals might close, determine what selling techniques are working and offer coaching to salespeople, and help draft emails to prospects. Turning all of those disparate records into machine-parsable data—managed, in Gong’s case, with Pinecone—enables analysis and automation that wouldn’t otherwise be possible. It can also avoid some of the issues that would arise with a boss directly listening to sales calls and offering advice.
Keeping it clean
In general, having clean, trustworthy data is necessary for building trustworthy AI. Making sure systems comply with laws and internal rules around data use is also critical.
At Walmart, a machine learning platform called Element helps the company quickly build reliable AI solutions that work across multiple cloud providers, says Anil Madan, SVP of global tech platforms at the retailer. The software helps ensure data use complies with relevant rules and that AI built with it is tested for bias and inappropriate output. The technology also monitors models for shifts in behavior and continued accuracy, even as Walmart’s huge number of products, transactions, customers, and employees ensures a deluge of data to process.
“As new data is coming in, our goal is to basically detect that drift and constantly evolve the model, so that we don’t have unnecessary biases introduced,” says Madan. “The key here is making sure that the data which we have is constantly helping us get better with our responsible use of AI.”
Legal and contractual restrictions can also shape how businesses can use customer and employee data to feed AI.
Startup Altana uses AI to process all sorts of data on the production and shipment of goods, tariffs, sanctions, and other factors that can affect risks and rewards for businesses throughout what it calls the entire value chain. The company takes pains to keep individual customers’ data confidential, while sharing appropriate derived information with the network as a whole.
Some of Altana’s data comes from documents like customs declarations, corporate registry filings, and other records that can be in numerous different languages or refer to the same product or facility in slightly different ways, says Peter Swartz, cofounder and chief science officer of Altana. AI helps to parse and clean up those documents, as well as understand the network graph of relationships they specify and provide recommendations to customers as they look to understand risks and opportunities, he says.
“We have models and systems that learn from the entire graph and then identify where the errors are, and then go through and correct them,” he says.
Learning from customer data
For business-to-business companies in general, years of data from multiple customers can help analyze and make AI-powered recommendations for any particular customer, with those ideally increasing in accuracy over time based on that customer’s own accumulated data.
Intuit’s Mailchimp has close to two decades of data representing how its clients’ own customers have engaged with hundreds of millions of email campaigns and other marketing material, says Shivang Shah, the company’s chief architect. That means that its increasingly sophisticated AI tools can start making marketing recommendations essentially as soon as businesses sign up for the service, based on factors like what industry they’re in. Those recommendations then get more powerful as a business uses the service (and, if they choose to, integrate data from Intuit’s QuickBooks) and Mailchimp gathers more data about that particular company, what sort of marketing campaigns it finds successful, and how its customers behave.
“Understanding the small businesses better, our recommendations get extremely personalized,” says Shah. “And then at the same time, autonomous to a point where the revenue intelligence engine can predict with very high confidence what marketing campaigns they need to run.”
Slightly more than a year ago, Intuit unveiled a platform called GenOS designed to help its developers speedily develop AI tools. GenOS includes a development environment, tools for integrating AI with other software, standard components for users to communicate with generative AI, and AI models optimized for its business and personal finance products. In the past tax season, for instance, AI was able to help TurboTax users with tax questions and connect those who wanted human help with tax pros that matched their needs, says Ashok Srivastava, Intuit’s senior vice president and chief data officer.
Beyond building out the AI infrastructure, Intuit also built out an underlying data infrastructure, effectively treating datasets as products in themselves, shepherded by newly appointed “data stewards” dedicated to keeping them accurate and up to date. The company’s processes are designed to make its roughly 60 petabytes of data clean, accurate, and usable—with appropriate privacy and other safeguards—by developers for generative AI and traditional machine learning.
“Our data is cleaner now than it ever has been before,” says Srivastava. “We are using AI to actually monitor the quality of the data and other aspects of the data, and let data producers know if there’s a problem.”