INTRODUCTION
A new phenomenon is turning the world of technology upside-down. It is “Big Data.”[1] In this digital age, our everyday electronic necessities utilize an increasing amount of data. Multiply this by all of our spouses, colleagues, and neighbors and this makes all of this into a really big stack of data. Further multiply this by all of our colleagues’ spouses and neighbors, and you get “Big Data.” This paper will look at the emerging Big Data technology which is taking hold in the information technology industry. Part II will briefly define Big Data. Part III will discuss the importance of Big Data. Part IV will quickly review the various tools and technologies that are being used to cull the Big Data resources. Part V looks at the major players and users of Big Data. Finally, Part VI concludes that Big Data represents an exciting New World for information technology.
WHAT IS BIG DATA?
New phenomena tend to have issues defining themselves. Big Data is no different. There are a number of definitions for Big Data. One commentator defines “big data” as datasets that are too large for the average database to process.[2] Another defines it as a ginormous amount of “bits and bytes strung across constellations of databases.”[3] Taking just these two definitions, one can extrapolate that Big Data is a nebulous, exoteric system of data, one that is constantly growing, and one that provides a terrene of potential wealth, as well as efficiency in business operations and qualitative data management. This is the brave new world of Big Data.
WHY DO WE NEED BIG DATA?
The nominal answer to why we need Big Data would be a rather perfunctory “because it exists.” Data that exists in such huge quantities becomes more and more necessary as our world becomes more and more digitally-dependent. Nevertheless, a better question might be: “Why does Big Data matter?”
For starters, the amount of Big Data is huge. Terabytes and gigabytes of data are thrown out by some commentators.[4] Others use petabytes and exabytes.[5] Still others prefer zettabytes.[6] To put this into perspective, a gigabyte is 1,000,000,000 bytes; a terabyte is 1,000,000,000,000 bytes. A petabyte is 1,000,000,000,000,000; an exabyte is 1,000,000,000,000,000,000. And, a zettabyte is 1,000,000,000,000,000,000,000 bytes. These are staggering numbers, with the amount of data growing each year.
Management of this data tsunami is important on a number of different planes. First, there is the wealth generation, and the corollary expense reduction, aspect of Big Data. Companies that are able to manage the large influx of data will be able to grow exponentially.[7] For example, through better and more efficient use of Big Data in its industry, healthcare companies could realize an annual $300 billion value, mostly as reductions in health-related costs.[8] Other private sector firms could increase profit margins by as much as 60 percent by analyzing Big Data for customer uses or preferences data that passes through their day-to-day operations. Additionally, public sector agencies (i.e., government) could realize billions of dollars in operational savings through better efficient data management.[9]
Another consideration is the job growth opportunities that will be developed as more companies become more aware to the needs of managing their own Big Data systems in-house.[10] By conservative estimate, the talent needed to analyze Big Data is estimated between 140,000 to 190,000 new positions in the next five years.[11] This statistic notes that it will take a number of years to properly educate and train this vast number of new data scientists, and consequently, there could be a drought of experienced data professionals equipped to handle Big Data analytics.[12] This figure does not contemplate the 1.5 million support specialists and managers that will be needed to supplement the data scientists.[13]
BIG DATA TOOLS & TECHNOLOGIES
A latent Big Data cottage industry that has begun to take shape within the last couple of years. Below are some of the tools and technologies that are currently being used or developed to corral the burgeoning Big Data.
MapReduce
MapReduce originated from Google.[14] The basic algorithm provides processing of huge amounts of data through a two-step process.[15] First, in the map step, queries are processed then converted into different sets of values.[16] Second, in the reduce step, data outputs from the map step are combined to form a more reduced set of tuples.[17]
Hadoop
Apache Hadoop is an open-source framework for handling Big Data, and is the most popular implementation of MapReduce.[18] It can process data from multiple data sources with highly intelligent ability to sort data for user tendencies and patterns, thereby facilitating better business decisions.[19] It is multi-purpose software emphasizing real-time data, including location-based data sources for weather or traffic-reporting, web-based or social media data, or transactional data.[20] Hadoop is currently the most popular and largest data analytics database system on the market, based, in part, on its open-source foundation.[21] In line with general Big Data revenue growth, Hadoop has a projected annual growth rate of 60%.[22]
NoSQL
NoSQL is a non-relational database which focuses on unstructured or semi-structured data.[23] In exchange for read-write consistency found in traditional databases, it focuses on scalability and distributed processing.[24] It is an open-source system, and there are various database types which fall under the NoSQL classification, including key-values stores, document stores, column stores, and graph stores.[25]
Accumulo
Accumulo is a column-oriented database from the Apache family of Big Data software. While traditional row-oriented databases favor poorly on query performance, column-oriented databases allow mass data compression and, thus, faster query performance.[26]
Hive
Apache Hive is a Java-based, bridge application which works in conjunction with a Hadoop cluster.[27] Originally proprietary from Facebook, it is now an open-source system, and supports Hadoop with analysis of large chunks of data.[28]
PIG
Apache PIG is utilized in a Perl-like language, as opposed to the more common SQL language, and works in conjunction with a Hadoop cluster.[29] It was originally developed by Yahoo! and is also, now, open-source.[30]
WibiData
WibiData is a web analytics software which allows websites to maximize user data through real-time user behavior, and thereby enable the website to provide more personalized content.[31]
PLATFORA
PLATFORA is a software that allows Hadoop to turn user’s queries to automate Hadoop jobs.[32] This, in turn, allows the end user to simplify and organize Hadoop datasets more effectively.[33]
SkyTree
A machine learning and data analytics platform, SkyTree makes machine learning a viable alternative to traditional manual review or exploration of data, which, given the voluminous amount of Big Data, makes manual exploration uneconomical or unfeasible.[34]
WHO USES BIG DATA & WHY?
Based on the customers for the above-mentioned tools and technologies, there are major Silicon Valley players in Big Data. Cloudera is the major developer of Hadoop clusters.[35] Hadoop is used for intensive data analysis at Yahoo!, Facebook, LinkedIn, and eBay.[36] Twitter, Hulu, and IBM are also using open source Big Data analytics frameworks for assisting with user data analysis and production.[37] New job postings for Hadoop engineers from AT&T Interactive, Sears, PayPal, AOL, and Deloitte indicate that a human resource investment in Big Data analytics.[38] These are only a few major companies, but it is anticipated to grow as the need to harness Big Data for commercialization becomes more apparent to companies.
CONCLUSION
Big Data represents an exciting new world for the IT industry. Specifically, Big Data will have a lasting impact on database professionals.[39] With the emergence of Big Data, it will no longer be enough to just buy a database and hook it up to a hard drive. Rather, distributed databases and multiple servers and hard drives will be the norm.[40] Big Data also will impact other industries. Issues left unsettled involve data policies, data security, privacy, and intellectual property.[41] Nevertheless, the Big Data phenomenon will remain a fascinating aspect of IT in the near future.
By Brent Yonehara. Originally published at http://yonaxis.blogspot.com/2014/02/big-data-review-of-emerging-phenomenon.html (Feb. 15, 2014).
[1] According to one commentator, Big Data originated in 1944 by Fremont Rider, the librarian of Wesleyan University, see Gill Press, A Very Short History of Big Data, Forbes, retrieved February 13, 2014 from http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/ (May 9, 2013, 9:45 AM). Another commentator, and perhaps more precisely, originates Big Data from the 1990s during lunch conversations at Silicon Graphics, Inc., see Francis X. Diebold, A Personal Perspective on the Origin(s) and Development of “Big Data”: The Phenomenon, the Term, and the Discipline, retrieved February 13, 2014 from http://www.ssc.upenn.edu/~fdiebold/papers/paper112/Diebold_Big_Data.pdf (Nov. 26, 2012).
[2] See James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh & Angela Hung Byers. Big data: the next frontier for innovation, competition, and productivity. McKinsey Global Institute White Paper: Executive Summary. Retrieved Nov. 18, 2013 from http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation. (May 2011), at p. 1.
[3] See Joe McKendrick, Data Integration Evolves Into a Science as Much as an Art, Database Trends & Appls. (Sept. 2011), 4-8, at p. 5.
[4] Id.
[5] See Manyika et al., supra note 2, at 4.
[6] See News. Computer Weekly. Computer Database. Document URL: http://go.galegroup.com/ps/i.do?id=GALE%7CA252643305&v=2.1&u=regis&it=r&p=CDB&sw=w&asid=94762bb8773544aefdd21d97334d96be, GDN: GALE|A252643305, (Mar. 11, 2011).
[7] See McKendrick, supra note 3, at 5; Manyika et al., supra note 2, at 2.
[8] See Manyika et al., supra at 2.
[9] Id.
[10] See Carolyn Duffy Marsan. ‘Big data’ creating big career opportunities for IT pros; new jobs expected for developers, administrators, as well as emerging ‘data scientist’ role. Network World. Retrieved November 18, 2013 from http://www.networkworld.com/news/2012/030612-big-data-careers-256939.html, (Mar. 6, 2012, 8:07 AM EST).
[11] See Manyika et al., supra at 11.
[12] Id.
[13] Id.
[14] See Big data. Wikipedia. Retrieved November 18, 2013 from http://en.wikipedia.org/wiki/Big_data. (2013, November 17).
[15] See Thoran Rodrigues, 10 emerging technologies for Big Data. TechRepublic. Retrieved from http://www.techrepublic.com/blog/big-data-analytics/10-emerging-technologies-for-big-data/. (Dec. 4, 2012) (interviewing Dr. Satwant Kaur on the 10 emerging technologies that will be the biggest Big Data technologies in the near future).
[16] Id.
[17] Id.
[18] See Nicholas Kolakowski, Microsoft Windows Azure server will leverage Apache Hadoop. eWeek. Retrieved November 18, 2013 from http://www.eweek.com/c/a/Cloud-Computing/Microsofts-Windows-Azure-Server-Will-Leverage-Apache-Hadoop-207862/ (Oct. 11, 2011). .
[19] Id.
[20] See Rodrigues, supra note 15.
[21] See Matt Asay, Why proprietary Big Data technologies have no hope of competing with Hadoop. ReadWrite. Retrieved November 18, 2013 from http://readwrite.com/2013/10/28/why-proprietary-big-data-technologies-have-no-hope-of-competing-with-hadoop#awesm=~onG7TeTMleP9NP. (Oct. 28, 2013).
[22] Id.
[23] See Rodrigues, supra.
[24] Id.
[25] Id.
[26] Id.
[27] Id.
[28] Id.
[29] Id.
[30] Id.
[31] Id.
[32] Id.
[33] Id.
[34] Id.
[35] See Cloudera, Inc., About Us. Retrieved November 18, 2013 from http://www.cloudera.com/content/cloudera/en/about.html. (2013).
[36] See Marsan, supra note 10.
[37] See Kolakowski, supra note 18.
[38] See Marsan, supra.
[39] See McKendrick, supra.
[40] See Marsan, supra.
[41] See Manyika et al., supra at 11.