Azure HDInsight is a cloud distribution of Hadoop components. Azure HDInsight makes it easy, fast, and cost-effective to process massive amounts of data. In this article we are going to discuss best-practices for migrating on-premises Apache Hadoop eco-system deployments to Azure HDInsight.

What Celebal can do for you ?

We accelerate enterprises to migrate the traditional Big Data workloads from on-prem to Azure cloud native HDInsights following the best practices and systematic methodologies.

Here are some of our value adds

  • We provides a clear impact assessment and migration path for all the workloads that must migrate to serverless data lake
  • High Speed Data Migration over secured clusters even
  • Highly scalable Live Incremental and Full Migration without any downtime
  • Comprehensive detailed assessment report that can be leveraged during pre-migration and post-migration
  • Automated workload and orchestration migration to target technology on Azure Cloud.
  • Establish Backup and Disaster Recovery solution as required

Hive Metastore

Migrating of the metastore is very important as it holds the entire information of the tables/views/data.

On Azure HDInsights, we recommend using a custom external metastore. Using external metastore

  • Multiple Spark applications (sessions) can access it concurrently
  • Allow a single Spark application to use table statistics without running “ANALYZE TABLE” every execution
  • Separate compute resources and metadata
  • Easy upgrades and integration with the new releases of big data frameworks
  • Automate backup of custom metastore periodically

For migrating the metastore there are two options available

  • Custom Scripts
  • DB Replication Tool

Custom scripts are hard to manage and also implementing incremental changes is complex, however our methodology follows using DB Replication Tool, where we setup database replication between on-premise Hive Metastore DB and HDInsights Metastore DB.

Storage Migration

Data migration can be time consuming, therefore migration of the data from on-prem HDFS to Azure Cloud should be massively parallel in terms of deducing the overall time for the data movement.

Since, Azure Hdinsght offers decoupled storage from compute. Storage doesn’t need to be colocated with compute and can either be in Azure storage, Azure Data Lake Storage or both. Advantages of using decoupling storage from compute are:

  • Scaling storage and compute separately
  • Reduce cost
  • Data sharing across clusters
  • Improve Data Protection and Security

Between Azure blob and Azure Data Lake Storage, we recommend using Azure Data Lake Storage Gen2 for storing the data. ADLS Gen2 will be the central repository of the data for all the related big data workloads. ADLS Gen2 is not designed just for storing the data, it is more than that.

Benefits of ADLS Gen2

  • Hadoop compatible, therefore as is migration of data is seamless
  • POSIX Permissions (for managing data level security)
  • Hadoop/Spark optimized driver for big data analytics
  • Decoupled storage from compute

Since ADLS Gen2 is hadoop compatible, therefore for as-is and scalable migration of the data we recommend using DistCP

Workloads Migration

Azure offers a choice of fully managed relational, NoSQL, and in-memory databases, spanning proprietary and open-source engines, to fit the needs of modern app developers. HDInsight complements the different Azure Data Services to fit the workload needs.

For example: Orchestration of the workload becomes seamless and simpler using Azure Data Factory than using CRON Jobs or other open-source workflow schedulers

Below given are the recommended service mapping by us

  • LLAP cluster for interactive Hive queries with improved response time
  • replacing impala-based queries with LLAP queries.
  • Orchestration using ADF
  • Data storage to ADLS/WASBS
  • Ranger for RBAC and access policies

Security

Enterprise Security Package (ESP) provides multi-user access on Azure HDInsight clusters. HDInsight clusters with ESP are connected to a domain. This connection allows domain users to use their domain credentials to authenticate with the clusters and run big data jobs.

Security package helps in mapping the existing Ranger Access Policies with the domain credentials. And since, the domain will be part of Azure Active Directory, managing and monitoring of the user access becomes simpler.

We have designed an auto-mated way of migrating the Ranger Policies to Azure HDInsight. Our utility follows three simple steps to achieve that

  • Export on-premises Ranger policies to xml files
  • Transform on premises-specific HDFS-based paths to WASB/ADLS using a tool like XSLT
  • Import the policies on to Ranger running on HDInsight.

Write A Comment