Inside Else Inside TEMPlate====>
 

Parallel File Systems for 'Extreme' Enterprise Applications

High-performance parallel file systems are becoming commercially available just in time to support today's "extreme applications."

By Mike Matchett, Sr. Analyst and Consultant, Taneja Group

With the advent of big data and cloud-scale delivery, companies are racing to deploy cutting-edge services. “Extreme” applications like massive voice and image processing or complex financial analysis modeling that can push storage systems to their limits. Examples of some high visibility solutions include large-scale image pattern recognition applications and financial risk management based on high-speed decision-making.

These ground-breaking solutions, made up of very different activities but with similar data storage challenges, create incredible new lines of business representing significant revenue potential.

Every day here at Taneja Group we see more and more mainstream enterprises exploring similar “extreme service” opportunities. But when enterprise IT data centers take stock of what it is required to host and deliver these new services, it quickly becomes apparent that traditional clustered and even scale-out file systems—the kind that most enterprise data centers (or cloud providers) have racks and racks of—simply can’t handle the performance requirements.

There are already great enterprise storage solutions for applications that need either raw throughput, high capacity, parallel access, low latency or high availability—maybe even for two or three of those at a time. But when an “extreme” application needs all of those requirements at the same time, only supercomputing type storage in the form of parallel file systems provides a functional solution.

The problem is that most commercial enterprises simply can’t afford or risk basing a line of business on an expensive research project.

The good news is that some storage vendors have been industrializing former supercomputing storage technologies, hardening massively parallel file systems into commercially viable solutions. This opens the door for revolutionary services creation, enabling mainstream enterprise data centers to support the exploitation of new extreme applications.

High Performance Computing in the Enterprise Data Center

Organizations are creating increasingly more data every day, and that data growth challenges storage infrastructure that is already creaking and groaning under existing loads. On top of that, we are starting to see mainstream enterprises roll-out exciting heavy-duty applications as they compete to extract value out of all that new data, creating new forms of storage system “stress.” In production, these extreme applications can require systems that perform more like high-performance computing (HPC) research projects than like traditional business operations or user productivity solutions.

These new applications include “big data” analytics, sensor and signals processing, machine learning, genomics, social media trending and behavior modeling. Many of these have evolved around capabilities originally developed in supercomputing environments, but are now being exploited in more mainstream commercial solutions.

We have all heard about big data analytics and the commoditization of scale-out map-reduce style computing for data that can be processed in “embarrassingly parallel” ways, but there are now also extreme applications emerging that require high throughput shared data access. Examples of these include some especially interesting business opportunities in areas like image processing, video transcoding and financial risk analysis.

Finding Nemo on a Big Planet

A good extreme application example would be image pattern recognition at scale. Imagine the business opportunity in knowing where customers were located, what kind of buildings they lived in, how they related geographically to each other and/or how much energy they use. Some of the more prominent examples of image-based geographic applications we have heard about include prioritizing the marketing of green energy solutions, improving development and traffic planning, route optimization and retail/wholesale targeting.

For example, starting with detailed “overhead” imagery (of the kind you find on Google Maps' satellite view), it is now commercially possible to analyze that imagery computationally to identify buildings and estimate their shape, siting (facing), parking provisions, landscaping, envelope, roof construction and pitch, and construction details. That intelligence can be combined with publicly available data from utilities, records of assessments, occupancy, building permits and taxes, and then again with phone numbers, IP, mail and email addresses (and fanning out to any data those link to) in order to feed a “big data” analysis. At scale this entails processing hundreds of millions of imagery and data objects over multiple stages of high performance workflow.

A World of Devices Hungry for Content

As another example, the demand and use cases for rapid transcoding of video are growing every day thanks to the exploding creation and consumption of media on mobile devices. In today’s world of Internet-connected devices, each piece of video that is created gets converted via “transcoding” into potentially 20 or more different formats for consumption.

Transcoding starts with the highest resolution files and is usually done in parallel on a distributed set of servers. Performance is often paramount ,as many video applications are related to sports or news and have a very short time window of value. Competitive commercial transcoding solutions require fast storage solutions optimized for both rapid reads and massive writes.

Money, Money, Money…

In the financial sector, revenue is all about numbers, speed and making the best decision at the right time while controlling risk.

We are seeing that in financial services firms, data capture, algorithm development, testing and risk management projects are all pushing the performance boundaries of traditional storage. Hedge funds and trading firms are starting to take advantage of parallelism in order to analyze more positions faster and deploy competitive trading strategies. Using scalable systems that support massively parallel data access, researchers can analyze larger data sets and test more scenarios delivering faster, more effective models. Similarly, risk managers are increasing their ability to assess total market exposure from only once or twice a day to much shorter intervals.

All of this goes straight to the bottom line and provides competitive advantage.

Extremely Cloudy Applications

If there is such thing as “normal” cloud storage today, it’s considered to be slower than “Web speed.” But it makes sense that businesses considering extreme applications will seek the agility and elasticity of cloud hosting rather than building internal infrastructure, especially where the main source of data is a Web 2.0 application.

As cloud providers like Amazon Web Services overcome data IO and storage challenges to provide cloud hosting for IO-intensive big data and video translation, we expect to see many service providers vying to support even more extreme applications.

Parallel File Systems to the Rescue/Rescue/Rescue/…

Extreme applications provide several interesting storage system challenges that can be answered by parallel file systems.

Parallel file systems are based on scale-out storage nodes, with an ability to spread and then serve huge files from many nodes and spindles at once. Unlike scale-out clustered NAS, which is designed for serving many files independently each to different clients at the same time (e.g. hosting home directories in a large enterprise or fully partitioned/shared big data blocks), fully parallel file systems are great for serving huge shared files to many inter-related processing nodes at once.

Big data solutions based on Apache Hadoop (with HDFS) are also designed around scale-out storage. But these essentially carve up data into distributed chunks. They are aimed at analytics that can be performed by isolated “mapped” jobs on each node’s assigned local data chunk. This batch style approach enables a commodity-hardware architecture because localized failures are simply reprocessed asynchronously before cluster-wide results are collected and “reduced” to an answer.

However, extreme apps, including many machine-learning and simulation algorithms, rely on high levels of inter-node communication and sharing globally accessed files. This synchronized cluster processing requires high parallel access throughput, low latency to shared data, and enterprise-class data protection and availability—far different characteristics than HDFS provides.

Industrialization of Extreme Performance

Robust supercomputer parallel file systems are emerging from academia and research and are ready to deploy in commercial enterprise data centers. There are now a number of commercialized Linux-centric parallel file systems based on open source Lustre (e.g. from DDN, Terascala, et.al.) for Linux-based cluster computing. And for IT enterprise adoption of extreme applications supporting multiple operating systems with enterprise data protection, we see GPFS (General Parallel File System from IBM) setting the gold standard.

Parallel file systems can be procured and deployed on many kinds of storage nodes, from homegrown clusters to complete appliances. For example, DDN has industrialized a number of parallel file systems to host extreme applications in the enterprise market. Their GRIDScaler solution integrates and leverages parallel file services on their specialized HPC-performing storage hardware. This kind of integrated “appliance” solution can provide a lower TCO for enterprises due to baked-in management, optimized performance, reduced complexity, and full system support.

Extremely Compelling

New data-intensive solutions are enabling the exploitation of huge amounts of data to extract new forms of knowledge and insight. These new extreme applications can ultimately create new revenue streams that could disrupt and change whole markets.

Big data analysis is one type of extreme application, but it is only the tip of the iceberg when it comes processing large amounts of new data in new ways. New applications that demand parallel file access, high throughput, low latency, and high availability are also on the rise, and more and more enterprises (and service providers) will be tasked to deploy and support them.

Luckily, IT can support these challenging extreme applications by leveraging the vendor trends in industrializing technologies like parallel file systems. Technical excuses are diminishing, and the competition is heating up—it is definitely time for all enterprises to move forward with their own extreme applications.

If you are in IT and haven’t been asked to support an extreme application yet, you should expect to very soon.

  This article was originally published on Tuesday Apr 30th 2013
Home
Mobile Site | Full Site