Home > Design Patterns > File-based Source

File-based Source (Buhler, Erl, Khattak)

How can large amounts of unstructured data be imported into a Big Data platform from a variety of different sources in a reliable manner?

File-based Source

Problem

Importing large amounts of unstructured data from different sources in an ad-hoc manner involves manual copying of files, which is not only time-consuming but also inefficient.

Solution

Unstructured data is imported as files using a system that automatically looks for files at the configured source location(s).

Application

An agent-based system is used that collects files from the source location and forwards it to the Big Data platform.

A file data transfer engine mechanism is used that internally employs an agent-based system. The file data transfer engine mechanism is configured to add the location of the data source(s) and the target location(s). Using polling or filesystem capabilities, such as a file watcher component, the configured locations are scanned by the agents for new files, and when files appear in those locations, they are forwarded to the target location(s) in the Big Data platform.

It should be noted that this pattern can also be used to ingest semi-structured data, such as webserver log files. Whether importing semi-structured or unstructured data, the File-based Source pattern is only applicable for batch ingress of data. Furthermore, this pattern is normally applied together with the Data Size Reduction pattern in order to reduce data size footprint before persisting data to the storage device.

File-based Source: The manual copying of files is automated through the introduction of a system into the Big Data platform that can be configured in a centralized manner to look for files at more than one location. Such a system removes the inefficiencies linked with the ad-hoc copying of files and provides a central interface for configuring multiple data sources.

The manual copying of files is automated through the introduction of a system into the Big Data platform that can be configured in a centralized manner to look for files at more than one location. Such a system removes the inefficiencies linked with the ad-hoc copying of files and provides a central interface for configuring multiple data sources.

  1. User configures the file data transfer engine mechanism to import data from Data Sources A and B.
  2. Files containing textual data are automatically copied from Data Source A by the file data transfer engine.
  3. The file data transfer engine then automatically inserts the textual data into the configured storage device.
  4. Files containing videos are automatically copied from Data Source B by the file data transfer engine.