How open source is helping remove data silos in the enterprise [Q&A]
Historically data has been stored in silos in order to deliver a quick solution. But in the longer term silos can slow down decision making, make modifying systems harder, and hinder compliance with regulations.
One of the ways to break down barriers between silos is to allow data to be freely shared between them and open source has a big part to play in this. We spoke to Mandy Chessell, distinguished engineer at IBM Cognitive Applications, and recently elected leader of the Technical Steering Committee of the ODPi, to learn more.
BN: The current mindset seems to be that all data silos are bad, is this really the case?
MC: Data silos are not necessarily all bad as they provide separation of concerns. Silos emerge as an enterprise evolves, and data ends up distributed among the many systems that supports its operations The data in a silo is typically focused on the goals of the organization that owns the silo.
As examples, the kinds of queries that are supported well in a silo are those that are most needed by that organization. More effort is needed to optimize a data silo for the needs of multiple groups.
It can be quicker to develop a solution while bypassing integrating data the application produces into existing data stores, instead creating new data stores and silos -- deferring the data integration to a later time, via extract transform and load tools.
However, each new silo increases the overall complexity of the IT landscape making it harder to drive strategic change to a business and demonstrate compliance with regulations.
BN: Given the move towards big data aren't silos inevitable?
MC: Yes it is difficult to avoid silos. People are just trying to get a job done in a timely way and cannot always take the time to study the bigger picture and how the data they are manipulating or producing fits in with the data that the larger entity manages. Yet organizations want to provide high quality data to their employees and use advanced analytics and AI to improve decision making. High quality content needs consistent and coherent data to be derived from the silos of data located in the organization's systems, and beyond.
BN: How can businesses improve the flow of data?
MC: Organizations need help to understand, manage, rationalize and evolve their systems and data. This needs an integrated and flexible knowledge base describing the systems, how they are linked together and the data that flows between them. This knowledge base must be self-managing and available to a wide range of tools and technologies.
Regulations are emerging within and across industries to ensure consistency and quality of data used within enterprises. Metadata is at the heart of these regulations. It encompasses database schemas, formats, semantic information, business rules, ownership, lineage, movement, and usage. Having a sound metadata strategy across an institution is key to improving the flow of data -- ensuring as much of the data as possible in the data silos is well described within the tools that manage the data, using shared glossaries and agreed vocabularies. Metadata that describes the format and content of data allows developers to determine which dataset to use in a new project. Metadata enables data to be used outside the applications and organizations that created it. As a bonus, metadata that describes the business context for data enables automated governance processes to apply, making it easier to demonstrate compliance with initiatives such as GDPR.
BN: How does the ODPi Egeria project help?
MC: Products from a variety of vendors support metadata, however, they do not interoperate easily today. ODPI Egeria helps by defining a way to share metadata across diverse and heterogeneous tools.
In case you're not aware of the background of ODPi Egeria, in August 2018, IBM, ING, Hortonworks (now Cloudera), SAS, and more created the open source ODPi Egeria project. To support the free flow of metadata between different technologies and vendor offerings, Egeria enables organizations to locate, manage, govern and use their data more effectively across silos.
In January 2019, the ODPi Egeria conformance suite became available ensuring that vendors who ship ODPi Egeria in their product offerings are delivering a consistent set of APIs and capabilities, such that data governance professionals can easily build an enterprise-wide metadata catalog that all their data tools can easily leverage.
IBM is a founder and a leader of the ODPi Egeria project and is incorporating the tech in its tools such as InfoSphere Information Governance Catalog.
Egeria is helping enterprises close the data silo gap by making it possible for diverse tools used in multiple silos to interoperate and share their metadata. Organized metadata sharing through Egeria simplifies the access and integration of data in different silos. It also enables unified data governance, and eases the application & demonstration of compliance with data regulations.
BN: What benefits does integrating silo data deliver for the business?
MC: Businesses need to analyze data across organizations for many reasons including GDPR to show compliance. Tools that can integrate with each other via open metadata interfaces include business intelligence and data visualization tools -- they can utilize the metadata to locate suitable assets to produce reports and visualizations, and incorporate the metadata and lineage information in their outputs. Data science tools can find out about datasets that are available and highlight their availability to data scientists trying to complete an assigned task. API tools enable developers to create appropriate interfaces for use by applications. Curation and glossary tools benefit greatly from access metadata across silos -- as do stewardship tools, managing discrepancies in data handling through examining metadata, and then requesting approval.
IBM's leadership in the ODPi Egeria open source project is driving the definition of industry wide interfaces to integrate metadata across diverse tools, and is creating a community to drive the adoption of these interfaces. The open source activities take into consideration the needs of companies to create visualizations across silos, to locate datasets across an enterprise to train AI systems, etc. as described above. The activities also incorporate the needs of tool suppliers to show that their software supports GDPR and other compliance initiatives in a unified rather than piecemeal way, helping to grow the data industry. Leaders in the ODPi Egeria community include ING and SAS as well as IBM.
Individuals and institutions can join the Egeria project to shape its future by providing requirements, sharing expertise, coding, testing and incorporating the tech in solutions. You can join the group and sign up for a mailing list on the ODPi site.