Big Data: Power is Nothing Without Control Access to large, diverse quantities of data has the capability to revolutionize our world, but only if we can harness the tools and technology to manage, store and analyze it.

“There hasn’t been a murder in six years. There is nothing wrong with the system, it is perfect,” says Chief John Anderton, played by Tom Cruise, in the 2002 sci-fi thriller Minority Report. The system Anderton is referring to consists of human oracles who can predict the future with great accuracy, allowing the police force to prevent a crime before it happens.

Although the concept explored in Minority Report is purely fictional, New York University School of Law professor Anne Milgram believes it might not be too far from reality. Delivering an October 2013 TED talk in San Francisco, Milgram described her experience from 2007 to 2010 as attorney general for the state of New Jersey, where she introduced smart data and rigorous statistical analysis to all stages of the criminal justice process, starting with police departments. As a result, Camden, which long had been among the most dangerous cities in the U.S., saw its murder rate drop by 41 percent and its overall crime rate fall by 26 percent during Milgram’s time in office. In her TED talk Milgram also discussed a universal risk assessment tool for judges that she had developed at the Laura and John Arnold Foundation. By analyzing the pretrial records of 1.5 million criminal cases in some 300 U.S. jurisdictions, the tool identifies highly relevant risk factors among defendants, which judges can use to conclude how likely someone is to commit a new crime after being released.

Crime prevention is just one of many applications of Big Data, which, from self-driving cars to health care, is revolutionizing the world. But before Big Data can be successfully employed, a few issues need to be addressed. The massive, fast-changing and diverse quantities of information that define Big Data are extremely difficult for conventional infrastructure, technology and skills to manage. Furthermore, an automated approach is needed to extract value from data that is increasing in volume and complexity at an exponential rate. This can be done through machine learning, in which computer algorithms learn from the past to estimate the future — through forecasting, not foretelling.

Understanding the Dimensions of Data

In the past two decades, our world has changed substantially as the Internet has established itself as the medium propelling the wave of globalization. The Internet contains an immense volume of information, and every contact with it causes more and more data to be created. From the moment we log on, our every move leaves a footprint that often contains very detailed information about who we are and what we do — information that is only a small fraction of the ways data is being created. Every day 2 billion active Facebook users share photos and text messages and Twitter users post 500 million tweets.

Data creation, storage, retrieval and analysis are challenging in a Big Data world and cannot be handled by conventional tools. We can solve this problem by focusing our attention on three important dimensions of Big Data: volume, velocity and variety. (There is also a fourth characteristic, veracity, which refers to data quality and suggests that analytic outputs could deteriorate unless the underlying data is handled with care and honesty.)

The amount of data stored or processed is referred to as volume. A typical PC might have had ten gigabytes of storage a decade ago; today some desktop users find one terabyte to be insufficient. Although Facebook doesn’t officially report how much data the social network produces, a 2014 company blog revealed that its data warehouse stores more than 300 petabytes of information and processes more than 600 terabytes of new data every day. The proliferation of smartphones, the sensors embedded into everyday objects — the so-called Internet of Things (IoT) — and the undergoing smart automation of our daily lives result in billions of new, constantly updated data feeds of environmental, locational and other sources of information.

Data velocity refers to the speed at which data arrives, and it is often the case that data simply streams from the source. Clickstreams and ad impressions, for example, capture user behavior at millions of events per second, allowing retailers to track web clicks to identify trends that improve storage, pricing and campaigns. Online gaming platforms support millions of concurrent users, each producing multiple inputs per second.

Big Data isn’t just numbers, dates or sequences of characters; it also consists of geospatial data, 3-D data, audio, video and unstructured text, which includes log files and social media. Traditional Structured Query Language (SQL) database systems were designed to operate on a single server, making increased capacity expensive and finite. They were not designed to address a large variety of unstructured data, frequent updates and typically less predictable data characteristics. As applications have evolved to serve a large number of users and as application development practices have become agile, the traditional relational database model has become a liability for many companies.

Applications and Existing Solutions

For organizations of all sizes, data management has shifted from an important competency to a critical differentiator that can determine market winners and losers. Fortune 1000 companies and government bodies already are employing the innovations of tech sector leaders such as Amazon.com, Apple, Facebook, Google, IBM Corp., Microsoft Corp. and Tesla. They have learned that Big Data is not a single technology, technique or initiative but rather a major driver of the modern digital world. Google returns a page relevant to your query by ranking pages based on your click history. Apple’s natural language assistant Siri and Microsoft’s Cortana handle voice commands. Facebook tracks users to ease tagging activity through facial recognition and suggests pages and products that match user interests. Tesla generates highly accurate road maps for driverless cars. Amazon finds users similar to you and recommends products related to their purchase histories.

The big tech companies are defining new directions and reevaluating existing strategies to transform their businesses. For less-pioneering companies looking to extract value from Big Data, it’s necessary to have the right hardware and software to tackle its three major dimensions.

Volume. Managing a company’s data on a single machine has always been challenging. One solution is to distribute data among multiple machines, usually called nodes. Although this is a reasonable approach, it presents challenges. How fast is the system? How accessible is it? How scalable is it? How fault tolerant is it? How secure is it?

Hadoop is an open-source Big Data software framework with its own distributed file system, HDFS, which enables users to tie together multiple machines to create large filesystems. Part of the Apache Software Foundation project, Hadoop is a manifestation of the MapReduce paradigm — an abstraction model of how to process or generate Big Data. It consists of two steps: a Map step that extracts data into tuples (key/value pairs) and a Reduce step that aggregates those pairs.

The Hadoop ecosystem supports a cluster management tool called YARN, as well as a number of open-source commercial tools and solutions. Examples include Hive, which supports SQL statements, and Pig, a high-level platform that reduces the complexity of writing direct MapReduce tasks.

Velocity. Efficiency in processing data is crucial for dealing with large volumes. Latency when reading data directly from hard disks has become a serious issue — and the reason in-memory technologies have been developed. These not only deal with the distribution of data across machines, but they also manage access to the data, through dynamic random-access memory (DRAM) or flash solid-state drives (SSD). The memory-access problem is even more pronounced for streaming data.

One solution is MemSQL, a distributed relational database management system that is memory-optimized. It provides the option of reading and writing directly from the memory to facilitate real-time analytics. There is also software for high-velocity data in the Hadoop ecosystem. Apache Spark is a distributed, in-memory data-processing platform; because it does not have its own distributed system, it runs on Hadoop by the distributed memory abstraction of resilient distributed datasets (RDD). Apache Storm is similar to Spark but focuses on streaming data.

Variety. For decades relational databases, particularly SQL, were the industry standard. Structuring predefined data in a table of rows and columns was sufficient for traditional business applications, but with the advent of Big Data, the demand for non-SQL, or NoSQL, databases has grown. Widely used NoSQL databases include Apache Cassandra, MongoDB, Neo4j and Redis. They can handle a variety of unstructured data, from texts to social media and email.

Setting up a scalable and fully functional Big Data management system is time-consuming and expensive. Not surprisingly, many small- to medium-size businesses are turning to cloud-computing providers such as Amazon Web Services, Google App Engine and Microsoft Azure for their data management needs. Handling Big Data is only the first step, however — what really matters is the ability to extract the information assets it contains. Data itself is a raw resource, and if you do not have a solid analytical approach, it can be difficult to use.

Business intelligence solutions such as Microsoft Power BI, Tableau and Qlik are good platforms for data retrieval, visualization, transformation and reporting, but they have limited analytical depth. To get the most value from the raw data, companies must be able to interpret it. That is, people are needed to perform the analysis. However, the current supply of qualified analysts is insufficient to meet the rising demand for information processing.

An increasingly popular option is to use machine learning to automate the extraction of useful information from data. Machine learning algorithms build knowledge from historical data and then forecast the outcome of an event by applying the acquired knowledge to unseen data. The process resembles how a child learns to talk or walk: Initially, the child doesn’t know anything, but he gradually improves through environmental feedback. In the case of machine learning, we use different kinds of feedback to optimize an objective function, just like the child has the objective to talk or walk.

These algorithms are categorized into three major types: supervised, unsupervised and semisupervised. In supervised learning the algorithm learns from labeled data; unsupervised learning finds structure in the data without the need for examples. Semisupervised learning, such as reinforcement learning, uses infrequent data from the environment to increase efficiency when executing a particular task. Supervised learning, and in particular deep learning, currently dominates the direction taken by major modern-day research because it has proved its ability to match the performance of humans without possessing expert knowledge.

So how do deep learning models actually work? The core element of deep learning is the neural network, a nonlinear learning model that consists of artificial neurons, layers of these neurons and activation functions. It is an end-to-end model, which means that the data is fed in the beginning and the desired outcome is stated at the end (e.g., the required class if you classify an image, or the required value if you try to predict a stock price). The raw data goes through layers of neurons. The earlier stages detect simpler features; complex combinations of the simpler structures are found in the deeper layers. In the end, the network produces a prediction. Researchers compare the prediction with the ground truth information collected in the field by computing the discrepancy, then fixing the intensity of the network edges to compensate for the error. They repeatedly do this until the model converges to a network that produces small errors on the training set. In general, the more data the model uses, the better the results. This is where Big Data really shines.

The main advantage of these algorithms is that features are not manually selected. However, there is still some work to be done before that, such as data wrangling and labeling, to achieve valuable results. Obtaining clean and manually labeled data can be expensive, but this is not a problem for companies with a critical mass of clients, as their client databases are gold mines for such a resource.

Limitations and Potential Pitfalls

Big Data surely looks like the philosopher’s stone, doesn’t it? Researchers amass a lot of data related to the problem at hand, use the appropriate software and obtain their solution. By feeding enough labeled data into a state-of-the-art deep learning model, we should get a solution at human performance levels.

Not quite. The problem lies in the fact that even the best experts are unable to tell what the models are learning; a recent article in MIT Technology Review describes this as “the dark secret at the heart of AI.” Researchers can give you their ideas on what might be happening, but when it comes down to accountability, no one is sure what to do. Would you allow your business strategy to be run by something you don’t understand? If we know what is being learned, it is easier to determine whether there are any biases.

Of course, there is no way to design a flawless system from the get-go. No matter how good people are at identifying problems in advance, the world has simply become too complex. Only by understanding a system completely can something be done if there is a failure. When that happens, there is a never-ending cycle of improvement. A problem happens, you localize and fix it, then you rinse and repeat.

In the case of deep learning, what do you do when something happens and you are unable to localize the issue? There is no guarantee that the same mistake won’t happen again. The model can be retrained to potentially eliminate the problem, but that doesn’t prevent the possibility of new issues showing up. Not knowing what is going on underneath the hood is a huge disadvantage.

Data is unquestionably one of the most valuable resources in the modern digital world. Its applications are exciting and already omnipresent in our lives; and while Big Data is a powerful tool, it is far from perfect. It is just a means to an end and definitely not the ultimate solution. As Chief Anderton learns in Minority Report, how we decide to use data is what matters most.

 

Thought Leadership articles are prepared by and are the property of WorldQuant, LLC, and are circulated for informational and educational purposes only. This article is not intended to relate specifically to any investment strategy or product that WorldQuant offers, nor does this article constitute investment advice or convey an offer to sell, or the solicitation of an offer to buy, any securities or other financial products. In addition, the above information is not intended to provide, and should not be relied upon for, investment, accounting, legal or tax advice. Past performance should not be considered indicative of future performance. WorldQuant makes no representations, express or implied, regarding the accuracy or adequacy of this information, and you accept all risks in relying on the above information for any purposes whatsoever. The views expressed herein are solely those of WorldQuant as of the date of this article and are subject to change without notice. No assurances can be given that any aims, assumptions, expectations and/or goals described in this article will be realized or that the activities described in the article did or will continue at all or in the same manner as they were conducted during the period covered by this article. WorldQuant does not undertake to advise you of any changes in the views expressed herein. WorldQuant may have a significant financial interest in one or more of any positions and/or securities or derivatives discussed.