Migrating big data workloads to Azure HDInsight – Smoothing the path to the cloud with a plan

Share
  • August 30, 2019

Migrating big data workloads to the cloud remains a key priority as well as a challenge for business leaders. Many are looking to AI and predictive analytics to increase performance, throughput, and to reduce application, data, and processing costs as a way out of the complexities of the big data operations landscape.

Planning is key, and there are some sensible questions to ask to ensure the planning phase runs smoothly and sets the project up for success. The organisation must understand its current environment, determine high priority applications to migrate, and set a performance baseline to be able to measure and compare on-premises clusters versus Azure HDInsight clusters.

  • What does my current on-premises cluster look like, and how does it perform?
  • How much disk, compute, and memory am I using today?
  • Which of my workloads are best suited for migration to the cloud?
  • What are my HDInsight resource requirements?
  • Should I use manual scaling or auto-scaling HDInsight clusters, and with what VM sizes?

SEE ALSO: AI-as-a-Service: AI cloud services should make the technology suitable for everyday use

Overall, organisations that understand the true path to the cloud isn’t paved with rainbows know the need to reduce the complexity of delivering reliable application performance when migrating data from on-premises or a different cloud platform onto HDInsight. Application Performance Management (APM) solutions have a vital role in bringing a host of services that should provide unified visibility and operational intelligence to plan and optimise the migration process. It’s strongly recommended to make use of such solutions in order to not suffer some of the common challenges that crop up time and again.

An APM will automate and optimise some of these major areas to simplify the overall project:

  • Identify the current big data landscape and platforms for baselining performance and usage
  • Make use of AI and predictive analytics to increase the performance and throughput and to reduce the application, data, and processing costs from an elastic cloud environment
  • Automatically size cluster nodes and tune configurations for the best throughput for big data workloads
  • Find, tier, and optimise storage choices in HDInsight for hot, warm, and cold data

An organisation must understand its current environment, determine high priority applications to migrate, and set a performance baseline to be able to measure and compare its on-premises clusters versus its Azure HDInsight clusters.

In the on-premises environment

  • What does my current on-premises cluster look like, and how does it perform?
  • How much disk, compute, and memory am I using today?
  • Who is using it, and what apps are they running?
  • Which of my workloads are best suited for migration to the cloud?
  • Which big data services (Spark, Hadoop, Kafka, etc.) are installed?
  • Which datasets should I migrate?

Azure HDInsight environment

  • What are my HDInsight resource requirements?
  • How do my on-premises resource requirements map to HDInsight?
  • How much and what type of storage would I need on HDInsight, and how will my storage requirements evolve with time?
  • Would I be able to meet my current SLAs or better them once I’ve migrated to HDInsight?
  • Should I use manual scaling or auto-scaling HDInsight clusters, and with what VM sizes?

Baselining on-premises performance and resource usage

To effectively migrate big data pipelines from physical to virtual data centres one needs to understand the dynamics of on-premises workloads, usage patterns, resource consumption, dependencies and a host of other factors.

It’s vital to get these detailed reports of on-premises clusters including total memory, disk, number of hosts, and number of cores used. A cluster discovery report also delivers insights on cluster topology, running services, operating system version and more. Resource usage heatmaps can be used to determine any unique needs for Azure.

It’s also key to gain app usage insights from cluster workload analytics and data insights. When the business can highlight application workload seasonality by user, department, application type, etc., it helps calibrate and make the best use of Azure resources. This type of reporting can greatly aid in HDInsight cluster design choices (size, scale, storage, scalability options, etc.) to maximise the ROI on Azure expenses.

Don’t neglect searching for the best strategy for storage in the cloud by looking at specific metrics on usage patterns of tables and partitions in the on-premises cluster.

Next, consider identifying unused or ‘cold’ data. Once identified, one can then decide on the appropriate layout for the data in the cloud accordingly, and make the best use of the Azure budget. Based on this information, one can distribute datasets most effectively across HDInsight storage options.

Data migration

Migrate on-premises data to Azure

There are two main options to migrate data from on-premises to Azure.

  1. Transfer data over network with TLS
  2. Shipping data offline

Once you’ve identified which workloads to migrate, the planning gets a little more involved, requiring a proper APM tool to get the rest right. For everything to work properly in the cloud, you need to map out workload dependencies as they currently exist on-premises. This may be challenging when done manually, as these workloads rely on many different complex components.

SEE ALSO: Impact of big data on mobile application development

Since cloud adoption is an ongoing and iterative process, customers might want to look ahead and think about how resource needs will evolve throughout the year as business needs change. Predictive analytics, based on previous trends to determine resource requirements in the cloud, can offer a view ahead.

Once in Azure, some apps require certain datasets to remain in Azure in order to work properly, while other datasets can remain on-premises without issue. Like app dependency mapping it’s difficult to determine which datasets an app needs to run properly.

Cluster sizing and instance mapping

As the final part of planning, one will need to decide on the scale, VM sizes, and type of Azure HDInsight clusters to fit the workload type. This depends on the business use case and priority of the given workload.

The planning phase is the critical first step towards any workload migration to HDInsight. Many organisations lack effective quantitative and qualitative guidance during the critical planning process, and may face challenges downstream in areas of workload execution and cost optimisation. A robust APM can help navigate this complexity by providing tools for mapping workload dependencies, forecasting resource usage, and guiding decisions on which datasets to move, and this in turn can make the migration process much more efficient, data driven, and successful.

The post Migrating big data workloads to Azure HDInsight – Smoothing the path to the cloud with a plan appeared first on JAXenter.

Source : JAXenter