Home > Design Patterns > Cloud-based Big Data Processing

Cloud-based Big Data Processing (Buhler, Erl, Khattak)

How can large amounts of data be processed without investing in any Big Data processing infrastructure and only paying for the amount of time the processing resources are actually used?

Cloud-based Big Data Processing

Problem

Building a large enough cluster for processing high volume data not only requires upfront investment but also suffers from underutilization, resulting in waste.

Solution

Instead of creating an in-house cluster, cloud processing resources are utilized for processing large datasets as a cost-saving measure.

Application

A data processing engine deployed on a cluster of machines in the cloud is used to process data on a pay-per-use basis.

A processing engine deployed in a cloud environment is used. Instead of using the in-house cluster, the processing engine makes use of cloud-provided cluster. Apart from requiring the IT team to have cloud skills, the application of this pattern further requires datasets to be available from cloud-based storage device(s). Hence, the Cloud-based Big Data Processing pattern is applied together with the Cloud-based Big Data Storage pattern.

Cloud-based Big Data Processing: Cloud processing resources are used to process large amounts of data while only paying for the duration during which the processing resources are in use. The elastic nature of the cloud can further be utilized to scale-out or scale-in instantly as per the processing load. This also enables running Big Data projects independently from the in-house systems, such as for ad-hoc data analysis or setting up a proof-of-concept Big Data solution environment.

Cloud processing resources are used to process large amounts of data while only paying for the duration during which the processing resources are in use. The elastic nature of the cloud can further be utilized to scale-out or scale-in instantly as per the processing load. This also enables running Big Data projects independently from the in-house systems, such as for ad-hoc data analysis or setting up a proof-of-concept Big Data solution environment.

  1. A large dataset needs to be processed towards the end of the day using a cloud-based cluster.
  2. The cluster remains in use for thirty minutes.
  3. Once the processing is complete, the processing resources are returned to the pool of resources.
  4. The enterprise only incurs a thirty-minute usage charge each day.