Inside Else Inside TEMPlate====>
 

Data Storage: Metadata Everywhere (And Not a Drop to Drink)

By Henry Newman

Because they can't handle the many different types of metadata in use today, POSIX systems are likely on the way out.

This article's catchy title comes from The Rime of the Ancient Mariner by Samuel Taylor Coleridge, but a co-worker suggested I might want to title it "Metadata everywhere causing everyone to drink" instead.

The word "metadata" is used to describe lots of different things. As a storage guy, my introduction to metadata was in discussing file systems and requirements for metadata. Ask a librarian, and metadata is something totally different. Ask a database person, and again you will get a different answer. Look at a jpg header and you have different metadata about the file, not the file system. "Metadata" might be one of the most over used and confusing terms being used today. It might as well be a pronoun like “that,” which can mean just about anything any time.

I want to talk about a few types of metadata. I'll share some of my thoughts on what needs to happen in the future and what I think will happen and why.

High-level storage systems metadata

High-level storage systems metadata" is a term I just made up. The reason I used the words "high-level" is that I view low-level storage systems metadata as information about how the controller views the underlying storage under its control. Low-level information includes information on disks in RAID groups, LUN information and data on virtualization of the drives, replication and lots of other stuff.

To me "high-level" implies that users are going to access information via an interface, file system, protocol or REST. We used to just call it file system metadata. That term had specific meanings, connotations and implications for about 25 years. For example, file system metadata had to support POSIX file system standards. It included things like user id (UID), group id (GID), access time (atime), create time (ctime) and other categories that are part of standard POSIX, plus additions for things like NFS and access control lists (ACLs). All of this information for file systems was stored in the inode, whether that was a local file system or on a NAS device. The specific minimal set was defined by the POSIX standards, and NFS added things like ACLs. In addition, there were extensions some vendors added using POSIX-extended attributes for things like high hierarchical storage management. The inode also stored the location within the file system on disk. Different vendors implemented things a bit differently, but all of this was in the inode or an extended attribute that was viewed as part of the inode.

Well that was then; this is now. With new interfaces such as REST, POSIX-compliant and like inodes are just one of the metadata components that users might see.

The problem with POSIX inodes and the standard was that it was not very extensible. Of course, extensibility was theoretically built in with POSIX extended attributes. But as we all know, theory and reality are often different. Issues arose because each vendor might—and often did—use extended attributes, but the attributes between vendors and various file systems were never agreed upon. About a decade ago, a group of people in the US government and HPC tried to extend POSIX and were rejected by the OpenGroup which controls POSIX. Extended attributes were defined only for a specific file system. POSIX, therefore in my opinion, is pretty limited for adding modern metadata information.

REST interfaces, on the other hand, are easy to extend compared to file systems. Many REST implementations are accomplished with databases, and adding a new bit of information is pretty easy. The mapping of objects to their storage locations is often similar to what is done in standard file systems today, but developers have learned a great deal over the last 25 to 30 years. There are now lots of new things users can provide per object like checksums and field. All of this gives REST a big advantage, but from what I have seen, there is not a great deal of coordination so that everyone gets the same per object metadata and you can easily move from one object store to another.

Per file/object metadata

Lots of file types (soon they might be named "object types") have their own metadata which describes the file or object. For example, a jpg has lots of information describing the file. For medical images, there are Dicom files, and the format is controlled by National Electrical Manufacturers Association (NEMA), many of whose members are vendors that make scanning equipment like MR, CT and X-rays. In weather, there is GRIB (Gridded Binary or General Regularly-distributed Information in Binary form) which has been agreed upon by the World Meteorological Organization. Another example would be oil and gas exploration companies, who also have standard formats, and yet another example would be the electric power grid community.

Needless to say, there are lots and lots of formats for files—more than I know about and more than anyone likely knows about. Each of these formats are very specific to the data type and function of the file. Additionally, there are self-describing formats such as HDF-5 (Hierarchical Data Format), which is used a great deal in high performance computing. HDF-5 allows the user to define the format so other users can use the data, and so that after twenty years the users themselves can actually figure out and remember what they were doing.

None of these formats with metadata are searchable by that metadata within a POSIX file system. There are, of course, applications that allow you to index almost every file type of metadata. But these applications are not interchangeable and only work for specific file types generally by industry.

What if an object store could ingest every file type as it had a framework that would allow all known file types to have the per file specific metadata indexed? That would be a pretty killer application and would likely drive out of business a number of software vendors that work in specific application fields with specific file types. I am sure that this is not far from reality, and it gives object storage a very, very significant advantage over POSIX file systems.

What about unstructured data

A great number of people are discussing and writing about unstructured data and how it can be made searchable. We have open source applications such as Hadoop and commercial competitors and implementations from a variety of vendors. Groups like Oasis are trying to tackle putting some structure around unstructured data. When you do this, you change the information from unstructured to structured.

The problem is getting everyone who is a stakeholder to agree to a common structure for a data type. There is a great deal of work being done in this area for all kinds of new data types and interchange formats. The unstructured data of yesteryear might become the structured data of the future—once file formats are agreed upon and someone makes the old unstructured data useful as new structured data. It will likely have to be read in, with the structured data areas created and populated and then written out in the new structured format. This method is also used to update old structured formats to new structured formats.

None of this is fast and none if this is easy, but it is happening all of the time.

The Future

Object storage with REST interfaces has very significant advantages in the area of metadata over the meager amount of information that is available as part of the POSIX file system interface You could argue that thirty years ago when the POSIX interface was first being discussed, this type of metadata was not important. The cost of storing and accessing metadata given the availability and cost of bandwidth and storage space made it far too costly to be able to address. This argument is, in my opinion, correct.

The problem is that that POSIX did not change—and the world did change. The way I see it, in the long term, POSIX file systems and the limited information available are going to be a thing of the past.

You might ask me how long will it take? My guess would be about ten years for the following reasons:

  1. Though REST frameworks and object storage are coming on strong, there still isn't broad industry adoption, and scalability and inter-operatabilty issues have not been fully worked out.
  2. Some of the standards need to be flushed out, especially for access control and other security functionality.
  3. Some POSIX file systems scale to thousands of clients and over 1 terabyte per second. There are no object storage systems that I am aware of that can do both of those things today. This is required by a number of high-end application domains like weather forecasting.
  4. REST interfaces do not allow reading the file/object into an application while it is being transferred, that I am aware of. This is important for some applications which need to process the data ASAP. Applications such as security cameras with image recognition come to mind. This will require a restructuring of many applications.
  5. There are lots and lots of old applications that will require having the I/O sections rewritten to support new interfaces.

There are many good reason that REST will likely dominate in the next decade, but the biggest reason is you cannot just dig your heels in and say, "I am not going to do XY or Z," when the world around you is changing. That might work for a while, but long term it always ends badly in so many different areas.

Photo courtesy of Shutterstock.

  This article was originally published on Thursday Mar 20th 2014
Home
Mobile Site | Full Site