In the second installment of our San Francisco series of engineer-to-engineer lectures, Jeff Hammerbacher described the challenges of building data-intensive, distributed applications and how using Hadoop saved the day at Facebook. Speaking to an audience of approximately thirty Hadoop experts and enthusiasts hailing from all around the Bay Area, the Valley, and even Seattle, he also discussed what’s wrong with today’s analytical platforms and what will shape the platform of the future.
And Jeff should know. After studying Mathematics at Harvard and wearing a suit as a quantitative analyst on Wall Street, he conceived, built, and led the data team at Facebook. He then went on to start Cloudera, the leader in commercializing Apache Hadoop, where he currently works as Chief Scientist and VP of Products. Jeff also served as Contributing Editor for a book: Beautiful Data: The Stories Behind Elegant Data Solutions, the proceeds of which are split between Creative Commons and Sunlight Labs.
The Scoop on Hadoop
Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data. It’s an open source implementation of the MapReduce approach to processing data. MapReduce was invented at Google to deal with the massive quantities of data necessary to index the web. There are two main components to the system: the Hadoop Distributed File System (HDFS) which stores and maintains data across many machines, and the MapReduce engine which processes the data.
But the talk didn’t really go into Hadoop internals — as Jeff pointed out, the documentation is readily available online. Rather, the talk was about how and why Hadoop will provide the foundation on which the next generation platform for analytics will be built. Making bold predictions about technology is hard. Jeff quoted Larry Ellison’s quip that “the computer industry is the only industry that is more fashion-driven than women’s fashion.” And yet, using real-world examples from his experience at Facebook, Jeff makes a compelling sell.
Bottlenecks, Costs, the Black Box, and the Kitchen Sink
A typical architecture for large-scale data analysis includes a data source, a data warehouse, ETL (aka: “extract-transform-load”; the step that gets data out of and into RDBMSs and converts source data to the data warehouse’s format), and business intelligence and analytics systems – all of which are usually centered around relational databases. However, Jeff stressed that a relational database is a specialty and not a foundation, arguing that the abstractions provided by them are no longer useful on their own for analytical data management.
One reason is that over the past few years, there has been an explosion in data volume primarily originating from machine-generated logs. By simply tweaking an Apache log, you can grow your data volume and complexity by several orders of magnitude. As we’ll see in Facebook’s case, their relational database approach simply didn’t scale and they soon needed new tools to handle the load.
Another point Jeff made was that the percentage of data that actually gets stored in a relational database is shrinking. What do you do with all the unstructured data (accounting for 95% of the digital universe) that doesn’t necessarily make sense to persist relationally? Do you still need expensive relational data warehouses and proprietary boutique servers? Jeff’s team at Facebook made a bet on commodity hardware which turned out to be a good move, ultimately pushing the complexity out of hardware and into the software layer.
They also bet on open source data stores built by consumer web firms, arguing that web properties have the most representative problems: scalability and unstructured data management. Jeff stated that most production-quality data stores came from enterprise software firms in the mid-1990s, but now a growing percentage of the world’s data is persisted in open source data stores. He also mentioned that a nice side-effect of adopting open source solutions is that it’s much better to have a modular collection of open tools rather than an opaque abstraction. Why? Because there’s great benefit in being able to pick and choose solutions and understand what’s going on under the hood.
Jeff noted that another problem is that, in many cases, enterprise software does not service developers well. Many relational data warehouses simply just expose SQL; but to get real traction/adoption from developers, you need more than that… You need open applications for analysis, not just a SQL interface. He feels that “in addition, these data stores often expose a proprietary interface for application programming (e.g. PL/SQL or TSQL), but not the full power of procedural programming. More programmer-friendly parallel dataflow languages await discovery, I think. MapReduce is one (small) step in that direction.”
Where is this new platform going to come from? Any new platform must be centered around addressing these new user needs, which is hard to achieve by re-implementing an old spec in a new, clever way. He cautioned that implementing a new, successful cut of the ANSI SQL spec would be a real undertaking. Not only would it take ages before you had anything to show, but it would likely suffer the same scalability problems of previous implementations.
Facebook and Hadoop are Now Friends
Using Facebook as a real-world example, Jeff described the challenge of measuring how changes to the site improved or impaired user experience. Their original data analysis system featured source data living on a horizontally partitioned MySQL tier and a cron job running Python scripts that pinged stats back to a central MySQL database. The main problem with this setup was that it made intensive historical analysis difficult since the source data was spread over many machines and aggregating the data to the analytics database was a slow, inefficient process. Plus, when it barfed, it took three days to replay the edit logs in order to diagnose the problem.
So Facebook hired a data warehouse engineer to build a 10TB Oracle warehouse. This worked for a bit and would’ve been fine for small and medium-sized businesses, but ultimately didn’t scale — particularly when they turned on impression logging which generated over 400GB of data on the first day! This quickly grew to 1TB of data per day in 2007.
You might suggest that since disks are cheap, why not throw more storage at the warehouse? It turns out that, in addition to the problem of data volume, there was also a bottlenecking CPU utilization problem. The ETL process ended up taking more than a day to aggregate, import, and load the necessary data for analysis. Jeff went on to explain that proprietary ETL vendors have lots of downsides and generally don’t scale well for large sets of databases (on the order of thousands, in Facebook’s case). In addition, when “warts” start to show up in proprietary vendors, the closed nature of the software prohibits developers from tinkering with the source to diagnose and resolve problems.
Meanwhile, his team started to play with Hadoop on the side as an open source alternative. They got a Hadoop cluster to replace the data collection and processing tiers. So the new architecture still has multiple data sources (log files, MySQL) but is now fed into HDFS instead. Work is done via MapReduce and the artifacts are then published to Oracle RAC servers for consumption by business intelligence and analysis. It also simultaneously publishes results back to the MySQL tier.
From “Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop“, slide 21
Initially, this shift was met with a lot of resistance mostly because Hadoop is Java-based and, since the majority of Facebook’s services were written in C++, the developers there weren’t comfortable in Java. But it wasn’t long before the new platform showed its strengths:
- Switching to this system greatly reduced latency because the ETL process is no longer done in flight – it’s done after persistence in Hadoop.
- Hadoop enabled Facebook to efficiently crunch extremely large data sets on the order of multi-petabytes, previously impractical under the old system.
- The Hadoop data warehouse became easily accessible to developers which turned out to be a real bonus. Developers previously found SQL to be an unfriendly environment because they couldn’t predict the impact of running SQL (it was easy for them to hose themselves and others) and because the dev environment for SQL was crude. After switching, however, they found that a lot of Facebook’s developers started freely playing with the data set which fostered innovation and led to new features.
Shaping a New Platform
Jeff emphasized that while Hadoop provides a great foundation for data analysis, it’s not the whole story. Today, there are many technologies built on top of Hadoop that need to be considered for your system. For example: there is Hive, a system for offline analysis; there is HBase, an open source implementation of Google’s BigTable to name a couple. He remarked that the abstraction layer needs to be redrawn to include the functionality provided by ETL, master data management (MDM), stream management, reporting, online analytical processing (OLAP), and search tools; all with a unified UI.
Jeff explained that SQL Server 2008 R2 is a good model. SQL Server is no longer just a database – there are a bunch of associated products in the box offering a full suite of features. You still have the old features like SQL Server Integration Services (ETL), SQL Server (data warehouse), SQL Server Reporting Services, SQL Server Analysis Services, and full-text search. But now you also get a bunch of new features such as stream management (StreamInsight) providing real-time analytics, OLAP (PowerPivot) enabling rapid navigation of subsets of data, collaboration via SharePoint, MDM for integrating disparate data sources and entity resolution, and features that aid in scaling your servers out to a many-node SQL solution. Jeff remarked that it’s “kind of scary that Microsoft has started to do a lot right within the last 5 years.”
Providing a full suite of features is also what Cloudera does well, but for Hadoop. They’re not the primary developers of this stuff (currently only 3 out of 17 contributors on HDFS), but they do an excellent job at packaging and polishing Hadoop and make their money in training, services, and support. And, like Microsoft, they eat their own dogfood: using the tools they build to solve their own business problems. Jeff joked that it’s “interesting being a vendor now – I can see what we put these other vendors through [while at Facebook].”
Many thanks to Jeff for the great talk, Greylock for helping with the logistics (and providing the delicious pizza and beer), and to everybody that came out! Be sure to check out the next talk on June 10th when our own Sasha Aickin, Redfin’s head of user experience, will weigh HTML 5 vs. Native Apps.
Pingback: how to buy viagra online in australia
Pingback: i need help writing an essay
Pingback: business school essay writing service
Pingback: essay writing service best
Pingback: sat essay writing help
Pingback: pay for essay writing
Pingback: college essay writing service reviews
Pingback: buy essay papers online
Pingback: best essays writing service
Pingback: help me write my college essay
Pingback: custom essay writing online
Pingback: best mba essay writing service
Pingback: optum rx pharmacy help desk
Pingback: how much does viagra cost in a pharmacy
Pingback: european pharmacy org buy strattera online
Pingback: differin gel online pharmacy
Pingback: hydrocodone pharmacy online no prescription
Pingback: cialis manufacturer coupon lilly
Pingback: generic viagra india pharmacy
Pingback: tadalafil cialis bestprice
Pingback: cialis super active
Pingback: cialis manufacturer
Pingback: selling cialis in us
Pingback: pharmacy online india
Pingback: wedgewood pharmacy prednisolone
Pingback: canadian pharmacy no prescription generic cialis
Pingback: viagra online without prescription usa
Pingback: long term effects of cialis
Pingback: can you buy generic viagra over the counter in canada
Pingback: legitimate online pharmacy usa
Pingback: what is tadalafil used for
Pingback: purchase generic viagra in canada
Pingback: cheap generic viagra in canada
Pingback: buy generic viagra canada online
Pingback: how to get viagra for women
Pingback: cheapest viagra in united states
Pingback: buy viagra tablet online india
Pingback: cialis leg pain
Pingback: how quickly does cialis work
Pingback: tadalafil 20mg side effects
Pingback: cialis profesional vs. super active vs. cialis
Pingback: sulfamethoxazole-tmp da 800-160
Pingback: gabapentin mirtazapin
Pingback: metronidazole germany
Pingback: tamoxifen apotek
Pingback: valtrex paris
Pingback: pregabalin是什么药
Pingback: lasix efeitos
Pingback: lisinopril sneezing
Pingback: metformin recommendations
Pingback: qu'est-ce que rybelsus
Pingback: semaglutide weight loss
Pingback: flagyl faq
Pingback: zoloft headaches go away
Pingback: lexapro increased anxiety
Pingback: belgium drivers license
Pingback: keflex what is it used for
Pingback: cat fluoxetine dose
Pingback: woodland hills personal injury attorney
Pingback: buy sildenafil 20 mg tablets
Pingback: สล็อตวอเลท
Pingback: ciprofloxacin coupon
Pingback: glue sniffer strain
Pingback: what is cephalexin for
Pingback: bactrim generic
Pingback: bactrim dosage 800/160 for uti
Pingback: สมัคร สล็อตออนไลน์
Pingback: คอมพิวเตอร์
Pingback: bacon999
Pingback: 티비위키
Pingback: https://mostbetcasino.pt/jogo-aviator/
Pingback: พูลวิลล่าพัทยา
Pingback: พอตไฟฟ้า
Pingback: amoxicillin for strep
Pingback: cozaar uses
Pingback: switching from citalopram to lexapro
Pingback: [augmentin].
Pingback: when to take contrave
Pingback: diclofenac otc
Pingback: flexeril 10mg high
Pingback: aripiprazole reviews
Pingback: another name for amitriptyline
Pingback: allopurinol alternatives
Pingback: faceless youtube niches
Pingback: celexa nausea
Pingback: what is bupropion
Pingback: best ashwagandha brand reddit
Pingback: does augmentin treat strep throat
Pingback: how long does it take for celebrex to take effect
Pingback: how long does it take for baclofen to kick in
Pingback: ruger bolt action rifle
Pingback: ติดตั้งโซลาเซลล์
Pingback: adhd virginia
Pingback: abilify long-term side effects
Pingback: remeron and anxiety
Pingback: repaglinide chemical name
Pingback: semaglutide for prediabetes
Pingback: protonix ingredients
Pingback: diltiazem side effects hair loss
Pingback: effexor vs cymbalta
Pingback: acarbose cirrhosis
Pingback: actos mecanicos
Pingback: ezetimibe stroke
Pingback: robaxin interactions
Pingback: ข่าวบอล
Pingback: Hdtoday
Pingback: ivermectin online
Pingback: venlafaxine side effects sexually
Pingback: spironolactone breast cancer
Pingback: ks quik
Pingback: warnings for voltaren
Pingback: synthroid migraine
Pingback: tizanidine vs robaxin
Pingback: xelevia 100mg sitagliptin
Pingback: tamsulosin for bladder