By Tim Chartier, PhD, Davidson College
Big data is often defined as having three v’s: Volume, velocity, and variety. We stand in a data deluge that is showering large volumes of data at high velocities with a lot of variety. With all this data comes information, and with that information comes innovation potential. Let’s take a closer look at these “three v’s” of big data and how they help us understand this highly complex field.
Which would you say is bigger: The complete works of Shakespeare or an ordinary DVD? The complete works of Shakespeare fit in a big book, of roughly 10 million bytes. But any DVD, or any digital camera, for that matter, will hold upwards of four gigabytes, which is 4 billion bytes. A DVD is 400 times bigger. All the printed words in the Library of Congress would be 10 trillion bytes, 10 terabytes. That’s one very large wall full of DVDs, but it’s also about the size of a single high-end personal hard drive. That is, you might carry all the books in the Library of Congress on a single device the size of just one book.
Data is not merely being stored: We access a lot of data over and over. Google alone returns to the web each day, to process another 20 petabytes. What’s that? It’s 20,000 terabytes, 20 million gigabytes, 20 quadrillion bytes. How big do you want to go? Google’s daily processing gets us to one exabyte every 50 days. And 250 days of Google processing may be equivalent to all the words ever spoken by humankind to date, which have been estimated at five exabytes. Nearly one thousand times bigger is the entire content of the World Wide Web, estimated at upwards of one zettabyte, which is 1 trillion gigabytes. That’s 100 million times larger than the Library of Congress. Of course, there is a great deal more that is not on the web.
This is a transcript from the video series Big Data: How Data Analytics Is Transforming the World. Watch it now, Wondrium.
But let’s turn to the velocity of data. Let’s start a clock, to see what this feels like. Not only is there a lot of data, but it’s coming at very high rates. High-speed Internet connections offer speeds 1,000 times faster than dial-up modems connected by ordinary phone lines. Here are some things that are happening every minute of the day.
- YouTube users upload 72 hours of new video content.
- In the United States alone, there are 100,000 credit card transactions.
- Google receives over 2 million search queries.
- 200 million email messages are sent.
It can be hard to wrap one’s mind around such numbers. How much data is being generated? Let’s turn to Facebook: In only 15 minutes, the amount of photos uploaded to Facebook is greater than the number of photographs stored in the New York public photo archives. That’s every 15 minutes! Now think about the data over a day, a week, or a month.
Learn more about the tremendous scope and power of data analytics
The cost of a gigabyte in the 1980s was about a million dollars. So, a smartphone with 16 gigabytes of memory would be a $16 million device.
Finally, there is variety. One reason for this can stem from the need to look at historical data, but data today may be more complete than data of yesterday. The cost of a gigabyte in the 1980s was about a million dollars. So, a smartphone with 16 gigabytes of memory would be a $16 million device. Today, someone might comment that 16 gigabytes isn’t much memory. This is why yesterday’s data may not have been stored or have been stored in a suitable format compared to what can be stored today. Now, consider satellite imagery. The images come in a large variety of aspect ratios. While we know that a satellite image will contain pixels, we don’t necessarily know what is in the picture, or not in the picture. I don’t necessarily know where to look and I may not even know what to look for.
Learn more about how to put data to work in your own life
The Three V’s
So, we stand in a data deluge that is showering large volumes of data at high velocities with a lot of variety. With all this data comes information, and with that information comes the potential for innovation. Steve Jobs, the charismatic co-founder of Apple, was diagnosed with pancreatic cancer in 2003. He became one of the first people in the world to have his entire DNA sequenced, as well as that of his tumor. It cost him a six-figure sum but now he had his entire DNA. Why? When doctors pick medication, they hope the patient’s DNA is sufficiently similar to the patient in the drug trial. Steve Jobs’s doctors knew his genetic makeup and could carefully pick treatments. When one treatment became ineffective, they could move to another. While Jobs eventual died from his illness, having all the data and all that information added years to his life.
Human beings tend to distribute information through what is called a transactive memory system, and we used to do this by asking each other.
We all have immense amounts of data available to us every day. Search engines almost instantly return information on what can seem like a boundless array of topics. For millennia, humans have relied on each other to recall information. The Internet is changing that, and how we perceive and recall details in the world. Human beings tend to distribute information through what is called a transactive memory system, and we used to do this by asking each other. We also have lots of transactions with smartphones and other computers. They can even talk to us. In a study covered in Scientific American, Daniel Wegner and Adrian Ward discuss how the Internet can deliver information quicker than our memories can. Have you tried to remember something and meanwhile a friend types it into a smartphone, gets the answer, and if it is a place, already has directions? In a sense, the Internet is an external hard drive for our memories.
Learn more about strategies that help manage the data deluge
Commercial Applications of Big Data
Accordingly, we have a lot of data, with more coming. We aren’t just interested in the data; we are looking at data analysis, and we want to learn something valuable we didn’t already know. For example, UPS must decide on a delivery route for packages to save time and gas. Consider 20 drop-off points: Which route is the best? Seems simple enough, but checking all possible routes isn’t that easy. You have 20 choices for the first stop, 19 for the second, and so forth. In all, there are about 2×10 to the 18th power. How big is that number? That’s five times the estimated age of the universe. Clearly, we aren’t checking that number of combinations on a computer each time a driver needs a route. Keep in mind, that’s only 20 stops.
UPS has about 55,000 drivers every day. Until recently, UPS drivers had a general route to follow. It allowed for decisions on the part of the driver. UPS now has a program called ORION, or On-Road Integrated Optimization and Navigation to help. It uses math to decide on routes. They can be counterintuitive but save time in the end. It doesn’t find the best route, but a lot of research has been done to find good solutions to this problem. Keep in mind, UPS has a harder problem than simply finding a route to save time. They also must consider other variables like promised delivery times. How much can this save? Consider these two numbers: Thirty million dollars, the cost to UPS per year if each driver drives just one more mile each day than necessary. Or eighty-five million: The number of miles the analytics tools of UPS saves per year. Data analysis doesn’t always involve exploring a data set that is given. Sometimes, questions arise and data hasn’t even been gathered. Then, the key is knowing what question to ask, and what data to collect.
As an example, let’s join Oren Etzioni on a flight from Seattle to Los Angeles for his younger brother’s wedding. Wanting to save money, Oren bought his ticket months before the “I dos” were said. During the flight, Oren asked neighboring passengers about their ticket price. Most had paid less, even though many had bought their tickets later. For some of us, this might simply tell us not to worry so much about choosing close to the date of a flight. But Oren was Harvard’s first undergraduate to major in computer science. He graduated in 1986. To him, this was a problem for a computer to solve. He’d seen the world this way before. He helped build MetaCrawler, which was one of the first search engines, bought by InfoSpace. He made a comparison-shopping website, also snatched up. Another startup was bought by Reuters.
So, Oren gave 12,000 price observations grabbed by his computer programs from a travel website over 41 days. He ended up with something that could save customers money, and not just by comparing current prices. It didn’t know why airlines were pricing the way they did, but it could help predict whether fares were more likely to go up or down in the near future. When it became a venture capital-backed startup called Farecast, it began crunching 200 billion flight-price records. Then? Microsoft bought it in 2008, for $110 million, and integrated it into the Bing search engine. What made it possible to predict future fares? Data—lots of it. How big and what’s big enough depends, in part, on what you are asking and how much data you can handle. Then, you must consider how you can approach the question.
UPS can’t look for the optimal answer, but they can save millions of dollars finding much better answers. They can do this by asking questions only answerable with the data that is streaming in and available in today’s data explosion.
Learn more about when there is a cause-and-effect relation or mere coincidence is involved
Common Questions About Big Data
An example of Big Data is the aggregation of petabytes or millions of personal records of people containing multiple pieces of information pertaining to their identity.
Big Data is largely used to get to know a person from the inside out to understand behavior in an effort to better sell them things.
Big Data is thought to have four V’s that pertain to its usefulness. The four V’s are velocity, veracity, variety, and volume.
Big Data can be characterized by many types; however, at the most basic level of data, it’s either structured, unstructured, or semi-structured. This categorization will determine how much work must go into understanding and using it.