Seamless Hadoop Data Migration to Azure HDInsight

Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. In this article we are going to discuss best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight.

What Celebal can do for you ?

We accelerate enterprises to migrate the traditional Big Data workloads from on-prem to Azure cloud native HDInsights following the best practices and systematic methodologies.

Here are some of our value adds

We provides a clear impact assessment and migration path for all the workloads that must migrate to serverless data lake
High Speed Data Migration over secured clusters even
Highly scalable Live Incremental and Full Migration without any downtime
Comprehensive detailed assessment report that can be leveraged during pre-migration and post-migration
Automated workload and orchestration migration to target technology on Azure Cloud.
Establish Backup and Disaster Recovery solution as required

Hive Metastore

Migrating of the metastore is very important as it holds the entire information of the tables/views/data.

On Azure HDInsights, we recommend using a custom external metastore. Using external metastore

Multiple Spark applications (sessions) can access it concurrently
Allow a single Spark application to use table statistics without running “ANALYZE TABLE” every execution
Separate compute resources and metadata
Easy upgrades and integration with the new releases of big data frameworks
Automate backup of custom metastore periodically

For migrating the metastore there are two options available

Custom Scripts
DB Replication Tool

Custom scripts are hard to manage and also implementing incremental changes is complex, however our methodology follows using DB Replication Tool, where we setup database replication between on-premise Hive Metastore DB and HDInsights Metastore DB.

Storage Migration

Data migration can be time consuming, therefore migration of the data from on-prem HDFS to Azure Cloud should be massively parallel in terms of deducing the overall time for the data movement.

Since, Azure Hdinsght offers decoupled storage from compute. Storage doesn’t need to be colocated with compute and can either be in Azure storage, Azure Data Lake Storage or both. Advantages of using decoupling storage from compute are:

Scaling storage and compute separately
Reduce cost
Data sharing across clusters
Improve Data Protection and Security

Between Azure blob and Azure Data Lake Storage, we recommend using Azure Data Lake Storage Gen2 for storing the data. ADLS Gen2 will be the central repository of the data for all the related big data workloads. ADLS Gen2 is not designed just for storing the data, it is more than that.

Benefits of ADLS Gen2

Hadoop compatible, therefore as is migration of data is seamless
POSIX Permissions (for managing data level security)
Hadoop/Spark optimized driver for big data analytics
Decoupled storage from compute

Since ADLS Gen2 is hadoop compatible, therefore for as-is and scalable migration of the data we recommend using DistCP

Workloads Migration

Azure offers a choice of fully managed relational, NoSQL, and in-memory databases, spanning proprietary and open-source engines, to fit the needs of modern app developers. HDInsight complements the different Azure Data Services to fit the workload needs.

For example: Orchestration of the workload becomes seamless and simpler using Azure Data Factory than using CRON Jobs or other open-source workflow schedulers

Below given are the recommended service mapping by us

LLAP cluster for interactive Hive queries with improved response time
replacing impala-based queries with LLAP queries.
Orchestration using ADF
Data storage to ADLS/WASBS
Ranger for RBAC and access policies

Security

Enterprise Security Package (ESP) provides multi-user access on Azure HDInsight clusters. HDInsight clusters with ESP are connected to a domain. This connection allows domain users to use their domain credentials to authenticate with the clusters and run big data jobs.

Security package helps in mapping the existing Ranger Access Policies with the domain credentials. And since, the domain will be part of Azure Active Directory, managing and monitoring of the user access becomes simpler.

We have designed an auto-mated way of migrating the Ranger Policies to Azure HDInsight. Our utility follows three simple steps to achieve that

Export on-premises Ranger policies to xml files
Transform on premises-specific HDFS-based paths to WASB/ADLS using a tool like XSLT
Import the policies on to Ranger running on HDInsight.

Migrate Hadoop Data to Azure HDInsight

Write A Comment Cancel Reply

Migrate Hadoop Data to Azure HDInsight

What Celebal can do for you ?

Hive Metastore

Storage Migration

Benefits of ADLS Gen2

Workloads Migration

Security

Related Posts

CT Visa: Simplify Data Platform Modernization with Our Databricks Warehouse Brickbuilder Solution

Celebal Technologies Powers Up Energy Innovation as Launch Partner for Databricks Data Intelligence Platform

Azure Monitor

Write A Comment Cancel Reply