Key Elements of Your Data Strategy

In our previous blog post on Data-driven Business Value we made the point that, to build on top of the Artificial Intelligence (AI) and machine learning paradigm, you need to be in control of your own data. Becoming truly data-driven will unlock new business as well as technical opportunities. We also stressed that your data strategy needs to be aligned with your business goals and the importance of knowing your current state before starting your transformation.  

This post dives into three key elements of a data strategy:

  • Data Operating Model: Which organizational setup should you go for and who will be responsible for what? We’ll explain the three main operating models and discuss how to achieve scalability, quality and speed of innovation when working with data across an organization.

  • Security and Compliance: An increase in cyber security threats have put a much stronger focus on keeping systems and information secure. On top of this, depending on which industry and part of the world your business operates in, there are likely multiple regulations you need to comply with and even more on the horizon. Implemented wisely, your data strategy can help you stay both secure and compliant.

  • Technology Stack and Architecture: Before deciding on a tech stack and architecture you need to have a good understanding of both your current technology landscape and what capabilities a modern data platform should include. We’ll dig into how to choose a tech stack and set the architecture for a data platform that will help you achieve your business goals. 

Data Operating Model   

Possibly the most central part of your data strategy is what organizational setup you will go for and who will be responsible for what. This is often referred to as your data operating model. Basically, you need to decide: 

  • Who will be in charge of making sure that data is accurate and of high quality? 

  • Who will be responsible for the technical platform and infrastructure used to store and manage the data? 

  • Who will set guidelines and policies for how data is used and protected, and make sure they are followed? 

Without clear ownership for each of these areas, it's almost impossible to get good data quality or ensure data is secure. Even more fundamental than this, to become successful, you need to make sure your data strategy and platform becomes adopted. If no one is adding data, or using data, there won’t be any value gain, so you should aim for a model that has potential to maximize adoption in your organization.

There are three main variants when it comes to operating model: a centralized approach, a decentralized approach or, the middle ground: a hybrid/federated approach.  

Many organizations start out by having a central data/Business Intelligence (BI) team, where you have all the skills needed to capture, transform and analyze data. Over time this almost always becomes a bottleneck. The team can’t handle all the analytical questions of the rest of the organization fast enough and other teams aren’t able to build new digital solutions and products based on data, as the data isn’t easily available outside the central team.    

The pendulum then sometimes shifts towards a completely decentralized approach. The downside with this is that data often end up in silos as interoperability and visibility between different departments or business units is lacking. Costs often increase since there is no reuse of any of the underlying technical capabilities and infrastructure. Also, without any shared government, data quality can suffer. 

Data Mesh – Striking the Right Balance?   

Following the rise of microservice architectures and the organizational shift in many software companies towards stream-aligned teams (sometimes called product teams), the data mesh concept was introduced by Zhamak Dehghani in 2019.    

Stream-aligned teams own and know their domain, including the information needs of the business. They take full ownership over designing, building and running their applications and APIs. With a data mesh setup, they also fully own their data products.    

Data governance is done through a distributed and collaborative approach, where the stream-aligned teams have responsibility for the data quality within their domain while a central team or function establishes overall guidelines and standards that everyone follows. This is combined with a shared, self-service, technical platform, owned and maintained by a platform team.   

There are four principles of a data mesh:    

  • Decentralized Ownership and Architecture: Data is owned by the business domains and the stream-aligned teams that generate and use the data.    

  • Data as a Product: Data is treated as a product, with well-defined APIs and service-level agreements (SLAs) to ensure consistent and reliable access for consumers.   

  • Self-Service Data Platform/Infrastructure: There is a self-service platform that allows stream-aligned teams (and other users) to discover, access, and use data for analysis, without waiting for a central team to do it for them.     

  • Federated Governance: Stream-aligned teams have responsibility for data quality within their domain, while a central data governance team or function establishes overall guidelines and standards.   

Adopting a data mesh is usually a good idea if the organization already has stream-aligned teams and has reached a certain maturity level and size, e.g. this model is probably overkill if you’re a small startup with only one, or a few teams.

The most important benefits are:   

  • Scalability: The distributed architecture and ownership allow the data mesh to scale efficiently as data volume and user demands grow.   

  • Speed of Innovation: Stream-aligned teams can adapt their data solutions quickly to meet evolving business needs.   

  • Improved Data Quality: Ownership fosters accountability for data quality within each domain/stream-aligned team. However, this requires that the owning team is given the right pre-conditions, such as technical training, to allow them to fully take that accountability.    

The biggest challenges applying a data mesh are:   

  • Complexity: Implementing and managing a data mesh can be complex, requiring a cultural shift and changes in organizational structure.   

  • Maturity and Technical Expertise: A certain level of maturity and technical expertise is needed across the organization to manage data effectively.   

  • Data Consistency: Ensuring consistent data definitions and quality across different stream-aligned teams requires collaboration and well-designed governance.    

If your goal is to become data-driven, adopting the skills necessary to manage data effectively across the organization is both a well worth investment and a necessity. This transition can happen iteratively, allowing the organization to take on these new skills over time. During the transition many teams need training and support, for example by an enabling team.    

Security and Compliance   

An increase in cyber security threats have put a much stronger focus on protecting software systems as well as data. Your data platform and governance need to implement and follow robust security measures to protect sensitive information. Depending on which industry and part of the world your business operates in, there are likely multiple regulations and standardizations you need to comply with and even more on the horizon.  

The good news is that, with modern solutions, security and compliance can be built in from the start. For many organizations, complying with necessary regulations, especially around reporting, is the starting point for taking control over their data. From a business perspective, this is merely a box you need to tick. Using your data strategically, for decision making and to innovate and unlock entirely new value streams, is where your data platform really starts to pay off.  

To list just a few of the areas where security and compliance often come in play regarding data:   

  • User Privacy: For example, GDPR in the EU and CCPA/CPRA in US/California  

  • Information and Cyber Security: Standardizations such as ISO 27001 and legislations, including NIS and the upcoming new EU legislation on cyber security: NIS2  

  • Financial and Sustainability Reporting: Financial reporting, such as SOX, and the upcoming Corporate Sustainability Reporting Directive (CSRD), which will put new requirements on data capturing and reporting for almost all businesses operating in the EU.  

  • AI Legislation: In the EU the upcoming AI act will take effect next year.  

To get you started, these are a few of the capabilities that you should make sure your platform supports you with:  

  • Personally Identifiable Information (PII): Keeping track of PII and having capabilities for anonymization, pseudonymization and/or tokenization.   

  • Separation of Data: Keeping production data and test data separate. 

  • Access Control: Robust access control mechanisms. In combination with applying best practice security measures, such as least-privilege access to sensitive data.  

  • Data Lineage: Being able to capture how data flows and changes from the time it is captured until it is used.  

  • Location of Data: Being in control of where in the world your data is stored and who has legal rights to it. As mentioned in our modern cloud blog post, there are some interesting alternatives emerging to the big cloud providers, such as evroc, who are focusing on becoming a sovereign and sustainable cloud provider for Europe.

Implemented wisely, your data strategy and platform can support you by adding the flexibility you need, allowing you to adapt to stay compliant, as both regulatory and security requirements will change over time and are likely to continue to increase in importance going forward.

Technology Stack and Architecture  

A well-built data platform, with lots of technical capabilities, shouldn’t be the end goal in itself; it should be a tool to drive business outcomes and value, as we’ve discussed before. We strongly recommend choosing a technology stack and defining your architecture after you have your business goals and the two already mentioned pieces of your data strategy outlined.

Picking a technology stack will require some careful considerations and a good understanding of your current tech landscape. The good news is there are multiple solutions on the market that will fulfill most common requirements, the bad news is there are so many options it is easy to get lost.   

A modern data platform normally consists of the following capabilities:   

  • Data Ingestion: Solutions for importing data from various sources, in real-time and in batches.  

  • Data Storage: Solutions for storing structured as well as unstructured data.  

  • Data Processing: Tools and services for cleaning, transforming, and preparing data for analysis.  

  • Data Orchestration and Workflow Management: Tools for managing data pipelines and workflows.

  • Data Analysis and Reporting: Tools that enable the analysis of data to generate insights, including BI tools and analytics platforms.  

  • Data Governance and Security: Policies and mechanisms to ensure access control, data quality, compliance, and protection against breaches, security threats and vulnerabilities.  

  • Machine Learning and AI: Integration of AI and machine learning models and the capabilities to use them in production, including both training your own models, using existing pre-trained models or combinations of using your own data with pre-trained models, e.g. through Retrieval-Augmented Generation (RAG).  

You can either go for a single supplier’s solution or pick the best-of-breed from different suppliers and integrate them together. Regardless of which you choose, make sure you put your platform together using a product mindset, since ease of use is key to ensure wide adoption. Here are a few points to weigh in when choosing architecture and implementation of your data platform:     

  • Existing Technology and Infrastructure: What does your current data landscape look like? What data sources do you have? How is data currently stored and managed? What other tech stacks do you already have in place, e.g. for application development? Knowing this helps identify potential challenges and integration points. Making it as easy as possible for many teams to use the platform, both as data producers and consumers, is important to increase adoption.   

  • Cloud vs. On-prem: Cloud-based data platforms are becoming increasingly popular due to scalability and ease-of-use. Cloud is great because it is easy to scale up and down as your needs change over time. If you choose to go with one of the big cloud providers, most of them have a plethora of tools available for each of the capabilities listed above. There are also some interesting independent Software as a Service (SaaS) alternatives (sometimes referred to as Data-as-a-Service) to consider, such as Databricks and Snowflake.   

  • Cost Aspects: Total cost of ownership needs to be taken into account, not only license costs but also maintenance and operations costs. If you are in a mixed cloud/on-prem setup, moving big amounts of data from one to the other can be costly. With guardrails you can make sure costs aren’t spiraling out of control.  

Regardless of if you go for a cloud-based or on-prem solution, you can get far by using “boring” technology, i.e. usually you don’t need a lot of specialized tools and services to start with, even if some of them are very good for specific use-cases. Adapting to the integration patterns and storage solutions you already have in place is usually better than going with the latest fancy tool.  

Our most important advice is to build up your platform iteratively, start small and aim for a minimum viable platform that you can iterate on, while adding the first data products that deliver value to the business.   

Once you have your data strategy in place and have started the transition to a data-driven organization, further capabilities around AI and machine learning can be added. It isn’t difficult to get started, however, it usually takes a few iterations to get something production ready and you should be mindful of the common challenges, e.g. hallucinations and biases in data. We’ll dig into the latest trends in AI and machine learning, such as generative AI, RAG, machine learning operations (ML ops) and much more, in a future blog post. 

Next
Next

Takeaways from GAIA