Inside Else Inside TEMPlate====>
 

The Power of Distributed Object Storage

By Jeff Boles

Distributed Object Storage provides scalability, availability and flexibility that are extremely important in cloud environments.

Several years ago, Taneja Group predicted the inevitable emergence of what we called "cloud-based storage." Today, this technology is behind most cloud offerings for unstructured data, whether public or private. We defined cloud-based storage as the highly scalable, RESTful API accessible, object-based technology that is no longer just an Amazon S3 offering, but is served up by all manner of product vendors and providers.

Adoption for these technologies has taken off in a number of different verticals and use cases. Currently, we see an increasing amount of diversity among solutions in the market, especially when it comes to vendors offering products for customers to build their own clouds for unstructured data. The vendors of such products promise to give customers unprecedented flexibility and power in storing and accessing data, and in scaling the storage system for that data. But the reality is that truly delivering on the promises of cloud, especially when it comes to scale, requires a unique architecture that is focused on the ideals and complexity of data “distribution.”

In moving into the age of storage in the cloud, we believe the practitioner should become intimately familiar with this term—distribution—and in an entirely new context versus age-old distributed computing.

Distribution is so important, we’ve taken to recognizing a particular type of object storage as “Distributed Object Storage.” That means it is designed for distribution and will in turn unlock unique efficiencies and capabilities when distributed across locations and grown to high levels of scale. Since those are fundamental long-term goals driving customer cloud initiatives, understanding distribution—and the architecture for achieving it—should guide product and architectural decisions for every cloud data storage initiative.

Object Storage—an Architectural Shift

In a nutshell, object storage was first envisioned a number of years ago to tackle the complexities in programmatically storing and accessing data. For many application models, traditional file system semantics simply created enormous overhead for accessing data. Many hundreds of lines of code and interactions could be required to negotiate file system attachment, security, file location and name resolution. When locating files and data many times over, especially amidst thousands or millions of potentially unrelated files stored in the same location, these interactions excessively encumbered the use of data in applications.

Object storage, prior to the popularity of cloud, came about to provide more versatile access. Using a single unique identifier per object, objects in an object storage system could be stored, retrieved and manipulated without concern over file system name spaces and semantics.

The idea of cloud computing has taken shape over the past several years in answer to demands for further abstraction and automation so that IT can operate with greater efficiency and at greater scale. Object storage has accomplished this in several ways:

  • Scale—object storage, in dispensing with much of the complexity of traditional file systems, is inherently more scalable. It allows vendors to design systems that can easily store millions or billions of data objects with very low complexity and little management overhead. Moreover, data can be easily cast across multiple storage repositories with little complexity since storage and access is performed and resolved by object ID. Single administrators are capable of managing petabytes of space.
  • Flexibility in protection and data management—treating pieces of data as standalone digital objects has allowed vendors to develop innovative redundancy, protection and management tools that can be applied to objects and arbitrary groups of objects. This allows users to meet a wide variety of service levels from a single object pool. It is also much more versatile and cost effective than RAID or other technologies that must be applied to entire file systems/volumes.
  • Accessibility—the simplicity of object addressability is inherently compatible and extensible for use with access protocols like HTTP. It has been built into SOAP and RESTful implementations that have enabled very lightweight and easy-to-develop programmatic interactions. Today, object storage supports all manner of applications across all types of business and consumer devices.

Storage has gradually and continuously evolved over the past several decades in the interest of enabling better accessibility to larger and more varied types of data. Object based storage, commonly referred to as cloud-based storage, has become the latest incarnation of this evolution.

Today, object storage with these characteristics is enabling distributed, more easily accessible, and highly scaled storage systems that have fulfilled critical demands for large active archives, medical image systems, digital media sites, content depots and much more.

The Evolution of Cloud-Based Storage

Figure 1: Object-storage is an on-going advancement in access to data, bringing further simplification to the way we interact with data.

Object Doesn’t Always Go The Distance

But as object storage has become the prevalent technology behind cloud and high-scale unstructured storage infrastructures, differences among products on the market have rapidly emerged. While nearly every object vendor positions their storage as a technology for high scale and multiple locations, architectures vary greatly in how well they enable both without sacrificing efficiency.

In reality, limited sophistication in features that manage the “distribution” of data means that many object storage systems are better in localized private clouds than in the globally dispersed, high scale usages that many customers initially pursue. Such localized architectures are inherently limiting in the age of the cloud. In a storage system where an object is the unit of storage, this is an artificially imposed constraint and may mean that customers run into physical limits on scaling long before they’ve realized the full value of their cloud initiative. Worse yet, some customers are bit by such limitations only when eventually growing from an initial deployment into larger deployments and multiple sites.

Superior architectures can deliver superior resiliency, while also enabling other key high-scale services, such as security and response optimization.

Distributed Object Storage

For high-scale systems, there is tremendous value to be derived from the ability to distribute data. Distributing data can make storage scalable beyond the boundaries of a single site or location.

Moreover, distributing data with the right architecture can counter-intuitively enhance data availability. Scattering data across the internet intuitively seems as if it would subject data access to more risks of outage‐and with the wrong architecture, it can. But with the right architecture, data distribution can make data more available through increased redundancy and more paths for accessibility. It thereby offsets increased risks and complexity that arise with extreme scale and multiple location dependencies.

We’ve taken to calling storage systems that handle this object storage the right way “Distributed Object Storage.”

Currently, we generally see two approaches among object-based storage systems: full object replication and an approach rooted in “Information Dispersal Algorithms” (IDA).

IDA was a theoretical approach to storing information first proposed in 1989. It's based on erasure coding technology that far predated information dispersal. Whether discussed as information dispersal or erasure coding, the term implies that data can be distributed in sub-file chunks across many systems or system components, and then reassembled from a subset of those chunks—perhaps 7out of 8, 11 out of 15, or 9 out of 12. The number of chunks and the level of protection can be varied by application.

Erasure coding has been used in many other ways in the data storage industry (from DVDs to Bit Torrent to RAID). In high performance and low latency environments, standardized approaches have suffered from computational overhead. They remain a focal point for continued research, but in the environment of the Internet and cloud solutions, these standardized approaches excel.

Today there are several product vendors using information dispersal approaches with fairly standard algorithms, including Amplidata, Caringo, Cleversafe, EMC, HDS, Scality, and Symform.

For the purposes of distribution, erasure coding has tremendous efficiency implications in both storing and retrieving data. With erasure coding, single transactions can fully store data in a protected manner, without the extra transactional steps typically involved in replication-based models. Once stored, data can be retrieved with optimal performance—by selecting the bits from repositories with the lowest latency—and is simultaneously protected against service outage from individual storage repositories or locations.

Taneja Group has formally labeled information dispersal approaches as "Distributed Object Storage technologies." And we have labeled products that work with full replicas of objects as "Replicated Object Storage technologies."

The key capabilities of Distributed Object Storage solutions based on dispersal and erasure coding stand to have high impact in cloud application and storage systems, and they merit a closer look. Specifically, Distributed Object Storage architectures (versus Replicated Object Storage) stand to create significant differences in the most fundamental capabilities that drive customers to cloudy object storage solutions in the first place$mdash;enhanced data access, availability and data distribution.

Let’s take a look:

Efficient distribution of data. Distributed Object Storage delivers more efficient distribution of data. It can scatter sub-file chunks across multiple systems and create the parity required for reassembly from a subset of those chunks without the overhead of full object or file system replicas required for replicated object storage. Dispersal may require only 30% to 60% overhead versus 200% or 300% consumed by completely replicating objects. Moreover, the process of creating these replicas happens at the time of file creation, usually with a storage system or gateway driven operation, rather than a slower and more vulnerable post-write replication job.

Enhanced availability. Distributed Object Storage protects availability by serving up data from a subset of chunks. As a result, organizations require only a portion of the original storage systems to be available at any given time. Across the Internet, an organization with many locations can get global access to data that may surpass even Tier 1 mission critical storage arrays within any single site.

In addition, a single site or region failure may be inconsequential to the other 90% of global users with access to other data centers, and this can reduce the single points of risk. For many customers, this heightened availability justifies their pursuit of distributed object storage behind key applications.

Replicated approaches can deliver similar data availability, but with much greater transmission overhead, greater latency before reaching a protected state and with direr consequences upon accessibility during a single system failure. Typically all customers of a single system experience an outage, and have few means for automated redirection.

Enhanced and efficient access. Distributed object architectures also supply a superb answer to the challenges of accessing data across the latency-prone global Internet. By requiring only partial data fragments for complete data reassembly, Distributed Object Storage solutions can pick and choose storage locations with the best response time. This can approximate some of the capabilities of Content Delivery Networks by making sure data access is provided as close as possible to the end node requesting it.

The best such architectures also accelerate data access by intelligently placing data chunking and reassembly functions in optimal places for the best performance—often gateway type devices. They may also streamline chunking and reassembly functionality so that it can be used natively in software code and on lightweight devices. Once again, replicated approaches create dependencies on single data locales where system failures or connectivity outages can easily take data offline for sets of customers.

Finally, as an added benefit, erasure coding can enhance security, and thereby make platforms even more suitable for multi-tenancy use cases behind the cloud.

Evaluating Vendors

Among the solutions on the market best suited for distribution, there is still much variation. In our view, a few key areas stand out as the most critical dimensions to assess for any high scale and/or cloudy storage architecture that will cross sites and offer a long lifespan. Look to your vendors to provide sufficient depth and validation in each of these areas:

  1. Reliability—How reliable is the vendor’s solution for petabyte-plus data volumes? Does vendor allow flexibility in setting reliability levels? What features does vendor offer to ensure reliability (integrity checks, error correction, etc.)?
  2. Scalability—How large can the vendor’s solution grow—is there a limit? How easy is it to scale capacity and performance?
  3. Security—Security features vary by vendor—what type of encryption is offered? Is key management required? etc.
  4. TCO—How do vendor costs compare not only for hardware and software, but also for power, cooling, floor space, admin staffing, etc.?

Finally, many of these solutions have been in the market long enough to have a track record. Efficient Distributed Object Storage requires architectural underpinnings that best show up at scale and are hard if not impossible to evaluate in a test lab. Track records are more important than ever.

  This article was originally published on Tuesday Jan 8th 2013
Home
Mobile Site | Full Site