The Evolution of Data Processing
The evolution of data processing has always been a rapidly changing area, but the exponential growth of streaming and social media is introducing new challenges for those managing data. Enterprises are looking for solutions that can help them to make sense of these new sources, analyze them in real time, and act on opportunities with speed.
We hope this post will give you some insight into the hows and whys digital information becomes overwhelming, what we're doing about it here at Forrester, and where we think the future is taking us.
We'll start by looking at the basics of data—what is it, and what do we need it for? We'll then address the different types of data and data formats, and why they are important. Finally, we'll step outside of Forrester's office to see how these factors play out across other organizations.
The Basics: What Is Data? And Why Do We Need It?
"Data" can be any kind of recorded information: something as simple as a few lines in a spreadsheet or a text file of measurements, or something as complex as streaming video and audio feeds from a web cam. The most commonly used data (e.g., those in a spreadsheet or text file) are usually referred to as "XML," and we'll cover XML in more detail below. But regardless of what type of data is involved, the essential feature is that it has a defined format that allows for easier access and manipulation. For example, one could build an application to search for all orders with a certain product ID using the built-in XML capability of Microsoft Access.
Data is stored and used for a variety of purposes. Historically, data has been used to record events and activities that have already happened in the past. For example, sales data is used to track previous sales (which may be useful for future business decisions), as well as to prepare reports or complete tax forms. In contrast, streaming data on a social media site like Twitter allows us to observe what's happening right now and act in real time (or at least "near-real time") if necessary.
Getting Started with a Data Management Framework
We think of data in terms of the following:
Data sources (XML, CSV, database, Excel spreadsheet, etc.) are where data lives. Data can be stored in multiple sources. Data movement (ETL) refers to how the data gets from one source to another. Transformation occurs when the data is moved between sources. For example, if you have a master database that holds several years' worth of sales figures and you want to send them to an Excel spreadsheet for your own analysis, this would be considered a transformation from database to spreadsheet Transformation also refers to changing the structure or format of data for easier access or use. An example would be converting the data from a database to XML. And finally, data usage is the utilization of data for reporting and analysis, where business decisions are made based on the information in the data (e.g., in spreadsheets or through queries).
The graphic below shows how these processes fit together. Each of these processes is explained in more detail below.
Data Sources
XML: Extensible Markup Language (XML) data cannot be understood without knowing something about how it is structured and what tags are used to represent various elements within XML documents. An XML document is composed of elements, attributes, and text. Elements are used to represent two types of information: information entities that have a specific identity and attributes that are associated with the identity.
For example, an item in a store's inventory could be represented by the following XML:
<Product> <name>XR-5 Hoverboard</name> <description>Real hover power!</description> <price>9999.99</price> </Product >
Here, "name," "description," and "price" are attributes associated with the specified entity (the product). The "name" attribute is the name assigned to this particular product, while "description" provides more information (e.g., what the item is called), and "price" would be the price associated with that particular product.
Attributes associated with an entity (e.g., a specific product) can be used to represent additional information about that entity, such as when a person fills out a form, where he or she lives, and so on. Attributes are also often used for security purposes by ensuring data has been properly verified.
Categories: Categorization is a basic principle of information management. Categories allow you to identify entities on the basis of attributes (e.g., in an inventory system, items would be assigned to a category, like "toys," and then organized into subcategories, like "girls," "boys," and so on). The advantage to using categories is that they provide a way of organizing data according to common attributes. For example, if in your data you have both products with prices of $979.99 and products with prices of $10,000.00, this information would be stored simply as two different products due to different values for the price attribute (e.g. 979.99 and 10,000.00).
Using categories allows you to find both products as well as their respective price (assuming the IT department has organized the data in such a way that every value of "price" is stored in the same category), and this is a much simpler way to represent the value of different price attributes. We'll talk more about categorization in the next section.
Data formats: XML itself does not define how data can be represented, for example, whether information should be structured as lines or paragraphs of text, whether it should be broken up into different sections within an XML document or stored in separate files, and so on. These choices are up to the data producer. There are some basic standards that would be useful to everyone, but they do not define every possible format a data file could be in. Thus, it always helps to know which standard is being used as a starting point for understanding your data.
Common formats: CSV (Comma Separated Values) and XML are two of the most common formats for storing tabular data. CSV stands for "comma delimited" and just means that each line of the file you're sending must contain all its attributes separated by commas (e.g., item-name, price).
Conclusion: XML and CSV are the two most common ways of storing tabular data in database management systems (DMS).
Data Movement (ETL)
Data is stored in many different locations, and often there are multiple copies of the same data with varying attributes. This is referred to as "data fragmentation." For example, if you have a database containing several years' worth of information, you may want to analyze this data in order to see trends that would be useful for making decisions about new products or marketing approaches. The simplest way to do this would be to use a specialized program that can move the data from one place (where it currently resides) to another location (e.g., an Excel spreadsheet).