By Vivek Shitole
With nearly half a billion terabytes of data created and a significant portion transacted daily, the need for effective ways to collect, store, and analyse data is immense. With this gigantic data growth comes the equal, if not more, the necessity to secure this data from breaches and cybersecurity threats. This article outlines potential security threats and risks around data lake and some practical ways to tackle those.
Let us first understand why data lake has become an essential aspect of today’s modern data architecture.
It has been a while since structured data was managed and used effectively for day-to-day operations and effective and accurate decision-making. Unstructured data has recently made its presence felt in data analytics. It has also made data analytics a bit more complicated, so repositories such as data lake have become critical, as they are centralised repositories for both structured and unstructured data. Data lake stores the structured and unstructured data in open-source file formats to enable direct analytics. Using big data via tools such as data lake has propelled technological advancements, which has, in turn, given organisations the ability to use data to uncover insights about their customer needs and help them fulfil them while growing their revenues.
However, this data analytics/science advancement comes with challenges. It complicates information security and data security in multiple directions. The information below is focused on these complications related to data lake security and potential remediations.
Many companies nowadays have shifted their data lakes to cloud platforms as they have discovered the core advantages of cloud computing and storage. Lower infrastructure and maintenance costs, incredible customizability, and accessibility have allowed them to effectively manage and store vast volumes of data while saving money on data infrastructure costs. But, as the promise of cloud technology allures companies, many still don’t understand the vulnerabilities and challenges associated with the migration and integration process – especially the security risk it entails. All this means more data vulnerabilities, which creates a need for data security policies. From data loss to defending against cyber-attacks when migrating and operating, there are inherent security vulnerabilities that should be understood.
Like many standard information security programs, data lake security comprises a set of processes and procedures to ensure data protection from cyberattacks. Depending on the industry or organisations deploying data lake, a lot of sensitive information, such as credit card numbers, medical test results, customer data, and so on, may be at stake, potentially creating many cybersecurity risks.
Given below are some of the best practices to remediate critical risks emerging from cyberattacks:
Data Governance and compliance: Data governance is about managing an organisation’s data assets and ensuring that data is accurate, reliable, and secure. This is crucial in making informed decisions, complying with regulatory requirements, and driving business success. By establishing a robust and comprehensive data governance framework, an organisation can manage its data effectively and efficiently, protect sensitive information, and meet regulatory requirements, enhancing its overall performance and reputation. Policies, procedures, and standards are three critical components of any data governance framework. Data governance is an excellent guide for organisations to navigate the abundance of data. In the current digital era, data is both a valuable asset and a liability. Therefore, effective data governance is critical for managing, utilising, and protecting data precisely and purposefully. It involves maintaining the authenticity of data and complying with regulations. Data governance fosters trust, enables informed decision-making, and ultimately guides organisations toward success. Throughout all of this, the increasing interconnectedness of operational and information technology systems presents critical security risks.
Data Administration: Data administration focuses on managing data from a conceptual, database-independent perspective. It coordinates the information and metadata management strategies by controlling the requirements gathering and modelling functions. Data modelling supports individual application development with tools, methodologies, naming standards, and internal modelling consulting. It also provides the upward integration and bridging of disparate application and software package models into the overall data architecture. This overall data architecture is the enterprise data model and is critical in the organisation’s ability to assess business risk and the impact of business changes. Each data tool may need a unique approach to administration. Data administration allows the organisation to maintain consistent security standards throughout the data lake. Another aspect of data lake administration can be auditing data lake usage. This helps understand the importance of data assets and defines best security practices.
Data Access and Control: Generally, an organisation can define data access and controls through authentication and authorisation. Authentication verifies the user’s identity. These days, it is the norm to use a multifactor authentication mechanism. Authorisation determines each user’s level of access to the data based on specified policies and the actions the user can take on it. Security principal-based authorisation, where the system evaluates permissions based on a policy-designed order, is one effective way of authorisation.
Authentication and authorisation must be adequately implemented across the organisation to have effective and adequate data access controls for data lake. In addition, no single approach to managing Data Lake access suits everyone. In practice, different organisations want different levels of governance and control over their data in data lake. Organisations must choose the approach that meets the required level of governance. That choice should not result in undue delays or friction in gaining access to data in a specific data lake.
Data Protection: Most information security standards require data encryption at rest, which is traditionally implemented through third-party database encryption products. However, for enterprises using cloud data lake vendors, encryption at rest is offered as a free bundled service.
Encryption is desired and often required for data lake security, but it is not a complete solution, especially for analytics and machine learning applications.
One needs to make security a primary focus of their data operations, and the same thing applies to data lake. This has been done through a simple ‘always on’ security posture that makes security an integrated and prescriptive approach to securing your most sensitive assets by default. Following industry best practices, encrypting your data in transit and at rest is essential. With this encryption comes the encryption key that must also be protected and secured. Both an on-prem and a native in the cloud operations should provide secure storage and management of your encryption keys, not only for internal applications but also for third parties.
With encryption, there are two challenges. Firstly, the changed data field format may cause many applications to break. Secondly, encryption is only as secure as the key to encrypt and decrypt, which is nothing but a single point of failure.
Unlike encryption, tokenisation keeps the format intact, so even if a hacker gets the key, they still do not have access to the data.
The best practice would be to use the cloud provider’s built-in encryption and add additional security from a third party. This vendor should decrypt the data, tokenise it, and provide custom views depending on the user’s access rights, all done dynamically at run time.
Importance of Metadata:
A lot about data lake security and data security in general can be achieved via data governance, and metadata is an integral part of data governance. Let us understand a bit more about metadata and how it enables effective data governance.
Metadata means data about data. It helps understand the characteristics of the data in consideration. Format of data, length of data fields, number of data fields, and type of data are all part of metadata. It shares information about data assets in various dimensions, such as technical information (technical metadata such as data structure, data schema, technical data field characteristics, data transfer protocol, etc.) and business information about data assets (business metadata such as data owner, data processor, types of data access roles, etc.)
If used strategically, metadata can help understand the key attributes of data such as data ownership, data accuracy, data classification, practical approaches towards data governance, data sources and destinations, and reliability of data.
Metadata management is critical for effective data governance. It is enabled via various metadata management processes, such as access controls, data control processes, data schema management, data field edit management, data classification, data quality management, data inflow and outflow, data search features, and required data compliance controls.
Effective metadata management leads to a solid data governance structure, which ultimately leads to high-quality data, data accuracy, data integrity, and intelligent usability. These are critical blocks for various users to use the data assets to their full potential.
Conclusion
Due to technical advancements and various innovative data management techniques, data lake security has become a dynamic and challenging topic. A great combination of processes, tools, integrations, and skill sets needs to be adopted to ensure appropriate controls around data lake security, which may lead to a better security posture and reduction in data security risks and vulnerabilities for an organisation as a whole.