Data silos -- why they’re flawed and what to do about it [Q&A]


Every application, database, filesystem and SaaS service inevitably creates another data silo. From Hadoop-based data lakes to modern data warehouses and lakehouses, enterprises have invested millions in the promise of a single source of truth. But these grand visions invariably fall short.
We talked to Saket Saurabh, CEO and co-founder of Nexla, to discuss a more practical approach that embraces the existence of data silos while ensuring seamless access and usability.
BN: Why do organizations spend time trying to eliminate data silos and why does it never work?
SS: Organizations spend time trying to eliminate data silos because they see them as barriers to efficiency, collaboration, and data-driven decision-making. The common belief is that silos create redundancy, inefficiencies, and inconsistency in data, making it harder to get a unified view of business operations.
However, the attempt to unify 100 percent of all data silos can often turn out to be too ambitious with diminishing returns. Yet companies try to do that, and the underlying reason is in basic human psychology. As humans, we like to keep things neatly organized, like books in a library. It gives a sense of control and simplifies future discovery and use. We naturally try the same with data, until we realize that the scale, complexity, and fragmentation of data make 'one silo' an impossible goal. Ultimately the reason it never works is that new data silos are getting created faster than we can centralize them. By the time we get the bulk of the data in a single pattern the underlying technology changes, case in point Hadoop, Snowflake, Databricks, and now Iceberg.
BN: How can you determine which silos are essential?
SS: Determining which data silos are essential involves evaluating their purpose, usability, and necessity within an organization. But one key point, not to be missed is to obey Data Gravity. Data Gravity means that naturally data will cluster by its type, function, or domain. It would be wise to understand that versus pushing against the grain.
A good analogy is teams in a company. You might ideally want all employees to be in your headquarters in New York, for the sake of better communication and efficiency. But now you start a new division and realize that most good hiring for that division is happening in a particular location, say North Carolina. It might now make sense for you to then consider creating a satellite office (silo) in that location versus requiring everyone to move to your HQ in New York.
The key points for identifying essential silos include:
- Business Functionality -- If a data silo supports a specific team’s efficiency and decision-making without blocking collaboration, it may be necessary.
- Security & Compliance Needs -- Some silos exist for regulatory reasons, such as GDPR or HIPAA compliance, where restricted data access is required.
- Performance Considerations -- Certain data silos improve system performance by keeping data localized rather than centralizing it in a way that could slow down processes.
- Integration Potential -- Essential silos should allow for seamless integration or data exchange with other systems, rather than being completely isolated. This is where the right tools can reduce friction in data flow across silos.
BN: Which types of data should be prioritized for decentralization?
SS: The types of data that should be prioritized for decentralization are those that benefit from being accessible, agile, and adaptable to different teams and use cases. This is data that is relevant to a large number of diverse users and applications in your enterprise. The key categories include:
- Real-Time and Operational Data -- Data that needs to be quickly accessed and acted upon by different departments (e.g., customer interactions, supply chain updates, or IoT sensor data).
- Domain-Specific Data -- Data that is highly relevant to a particular team or function, such as marketing analytics, product telemetry, or sales performance metrics. These should remain close to the teams that use them most.
- Collaborative and Dynamic Data -- Data that frequently changes and is used by multiple teams in different ways, such as experimental datasets, machine learning training data, or product development insights.
- Multi-Format or Unstructured Data -- Data that exists in different formats (text, video, images, logs, etc.), which may not fit well into a centralized system and needs to remain in specialized tools.
BN: What’s the best way to ensure seamless access to essential data?
SS: Instead of force-fitting all data into a central system, leverage Data Products to simplify discoverability and accessibility. Forrester Research reports that organizations implementing virtual data products achieve 47 percent faster time-to-insight and 35 percent reduction in data integration costs compared to traditional centralization approaches. Virtual Data Products support two approaches:
- For low to medium volume data, keep the data in place and enable real-time access across silos. This is done via virtual Data Products that act as a gateway to any system.
- For high-volume data requiring central computation, connect Data Products to a destination store and trigger an efficient ETL or ELT pipeline that brings data together from across silos into a cloud warehouse or lakehouse.
BN: Where does AI fit into this approach?
SS: We are in an era of AI where powerful, general-purpose models are now available to everyone, everywhere. Gone are the days when every AI idea could take months or years from model design to training, to production. This means the true challenge for enabling AI is now connecting it to the right data.
Breaking down data silos and making data accessible to AI is key. Silos can make that challenging, but before we go down breaking all silos it is very important to remember that data has to be tightly governed before it feeds into AI. Now breaking silos doesn’t mean putting all data from across silos into one silo. Instead, what it means is making data behind a silo seamlessly accessible. Ultimately the data user or application shouldn’t see silos as a source of friction.
Image credit: Khakimullin/depositphotos.com