Engineer-to-Engineer: Evolving a New Analytical Platform with Hadoop - Redfin Real Estate News

Engineer-to-Engineer: Evolving a New Analytical Platform with Hadoop

by
Updated on October 5th, 2020

In the second installment of our San Francisco series of engineer-to-engineer lectures, Jeff Hammerbacher described the challenges of building data-intensive, distributed applications and how using Hadoop saved the day at Facebook.  Speaking to an audience of approximately thirty Hadoop experts and enthusiasts hailing from all around the Bay Area, the Valley, and even Seattle, he also discussed what’s wrong with today’s analytical platforms and what will shape the platform of the future.

And Jeff should know.  After studying Mathematics at Harvard and wearing a suit as a quantitative analyst on Wall Street, he conceived, built, and led the data team at Facebook.  He then went on to start Cloudera, the leader in commercializing Apache Hadoop, where he currently works as Chief Scientist and VP of Products.  Jeff also served as Contributing Editor for a book: Beautiful Data: The Stories Behind Elegant Data Solutions, the proceeds of which are split between Creative Commons and Sunlight Labs.

The Scoop on Hadoop

Hadoop is an open source framework that enables data-intensive distributed applications to efficiently process gigantic amounts of data.  It’s an open source implementation of the MapReduce approach to processing data.  MapReduce was invented at Google to deal with the massive quantities of data necessary to index the web.  There are two main components to the system: the Hadoop Distributed File System (HDFS) which stores and maintains data across many machines, and the MapReduce engine which processes the data.

But the talk didn’t really go into Hadoop internals — as Jeff pointed out, the documentation is readily available online.  Rather, the talk was about how and why Hadoop will provide the foundation on which the next generation platform for analytics will be built.  Making bold predictions about technology is hard.  Jeff quoted Larry Ellison’s quip that “the computer industry is the only industry that is more fashion-driven than women’s fashion.”  And yet, using real-world examples from his experience at Facebook, Jeff makes a compelling sell.

Bottlenecks, Costs, the Black Box, and the Kitchen Sink

A typical architecture for large-scale data analysis includes a data source, a data warehouse, ETL (aka: “extract-transform-load”; the step that gets data out of and into RDBMSs and converts source data to the data warehouse’s format), and business intelligence and analytics systems – all of which are usually centered around relational databases.  However, Jeff stressed that a relational database is a specialty and not a foundation, arguing that the abstractions provided by them are no longer useful on their own for analytical data management.

One reason is that over the past few years, there has been an explosion in data volume primarily originating from machine-generated logs.  By simply tweaking an Apache log, you can grow your data volume and complexity by several orders of magnitude.  As we’ll see in Facebook’s case, their relational database approach simply didn’t scale and they soon needed new tools to handle the load.

Another point Jeff made was that the percentage of data that actually gets stored in a relational database is shrinking.  What do you do with all the unstructured data (accounting for 95% of the digital universe) that doesn’t necessarily make sense to persist relationally?  Do you still need expensive relational data warehouses and proprietary boutique servers?  Jeff’s team at Facebook made a bet on commodity hardware which turned out to be a good move, ultimately pushing the complexity out of hardware and into the software layer.

They also bet on open source data stores built by consumer web firms, arguing that web properties have the most representative problems: scalability and unstructured data management.  Jeff stated that most production-quality data stores came from enterprise software firms in the mid-1990s, but now a growing percentage of the world’s data is persisted in open source data stores.  He also mentioned that a nice side-effect of adopting open source solutions is that it’s much better to have a modular collection of open tools rather than an opaque abstraction.  Why?  Because there’s great benefit in being able to pick and choose solutions and understand what’s going on under the hood.

Jeff noted that another problem is that, in many cases, enterprise software does not service developers well. Many relational data warehouses simply just expose SQL; but to get real traction/adoption from developers, you need more than that…  You need open applications for analysis, not just a SQL interface.  He feels that “in addition, these data stores often expose a proprietary interface for application programming (e.g. PL/SQL or TSQL), but not the full power of procedural programming.  More programmer-friendly parallel dataflow languages await discovery, I think.  MapReduce is one (small) step in that direction.”

Where is this new platform going to come from?  Any new platform must be centered around addressing these new user needs, which is hard to achieve by re-implementing an old spec in a new, clever way.  He cautioned that implementing a new, successful cut of the ANSI SQL spec would be a real undertaking.  Not only would it take ages before you had anything to show, but it would likely suffer the same scalability problems of previous implementations.

Facebook and Hadoop are Now Friends

Using Facebook as a real-world example, Jeff described the challenge of measuring how changes to the site improved or impaired user experience.  Their original data analysis system featured source data living on a horizontally partitioned MySQL tier and a cron job running Python scripts that pinged stats back to a central MySQL database.  The main problem with this setup was that it made intensive historical analysis difficult since the source data was spread over many machines and aggregating the data to the analytics database was a slow, inefficient process.  Plus, when it barfed, it took three days to replay the edit logs in order to diagnose the problem.

So Facebook hired a data warehouse engineer to build a 10TB Oracle warehouse.  This worked for a bit and would’ve been fine for small and medium-sized businesses, but ultimately didn’t scale — particularly when they turned on impression logging which generated over 400GB of data on the first day!  This quickly grew to 1TB of data per day in 2007.

You might suggest that since disks are cheap, why not throw more storage at the warehouse?  It turns out that, in addition to the problem of data volume, there was also a bottlenecking CPU utilization problem. The ETL process ended up taking more than a day to aggregate, import, and load the necessary data for analysis.  Jeff went on to explain that proprietary ETL vendors have lots of downsides and generally don’t scale well for large sets of databases (on the order of thousands, in Facebook’s case).  In addition, when “warts” start to show up in proprietary vendors, the closed nature of the software prohibits developers from tinkering with the source to diagnose and resolve problems.

Meanwhile, his team started to play with Hadoop on the side as an open source alternative.  They got a Hadoop cluster to replace the data collection and processing tiers.  So the new architecture still has multiple data sources (log files, MySQL) but is now fed into HDFS instead.  Work is done via MapReduce and the artifacts are then published to Oracle RAC servers for consumption by business intelligence and analysis.  It also simultaneously publishes results back to the MySQL tier.

Data Flow Architecture at Facebook

From “Facebook’s Petabyte Scale Data Warehouse using Hive and Hadoop“, slide 21

Initially, this shift was met with a lot of resistance mostly because Hadoop is Java-based and, since the majority of Facebook’s services were written in C++, the developers there weren’t comfortable in Java. But it wasn’t long before the new platform showed its strengths:

  • Switching to this system greatly reduced latency because the ETL process is no longer done in flight – it’s done after persistence in Hadoop.
  • Hadoop enabled Facebook to efficiently crunch extremely large data sets on the order of multi-petabytes, previously impractical under the old system.
  • The Hadoop data warehouse became easily accessible to developers which turned out to be a real bonus.  Developers previously found SQL to be an unfriendly environment because they couldn’t predict the impact of running SQL (it was easy for them to hose themselves and others) and because the dev environment for SQL was crude.  After switching, however, they found that a lot of Facebook’s developers started freely playing with the data set which fostered innovation and led to new features.

Shaping a New Platform

Jeff emphasized that while Hadoop provides a great foundation for data analysis, it’s not the whole story.  Today, there are many technologies built on top of Hadoop that need to be considered for your system.  For example: there is Hive, a system for offline analysis; there is HBase, an open source implementation of Google’s BigTable to name a couple.  He remarked that the abstraction layer needs to be redrawn to include the functionality provided by ETL, master data management (MDM), stream management, reporting, online analytical processing (OLAP), and search tools; all with a unified UI.

Jeff explained that SQL Server 2008 R2 is a good model.  SQL Server is no longer just a database – there are a bunch of associated products in the box offering a full suite of features.  You still have the old features like SQL Server Integration Services (ETL), SQL Server (data warehouse), SQL Server Reporting Services, SQL Server Analysis Services, and full-text search.  But now you also get a bunch of new features such as stream management (StreamInsight) providing real-time analytics, OLAP (PowerPivot) enabling rapid navigation of subsets of data, collaboration via SharePoint, MDM for integrating disparate data sources and entity resolution, and features that aid in scaling your servers out to a many-node SQL solution.  Jeff remarked that it’s “kind of scary that Microsoft has started to do a lot right within the last 5 years.”

Providing a full suite of features is also what Cloudera does well, but for Hadoop.  They’re not the primary developers of this stuff (currently only 3 out of 17 contributors on HDFS), but they do an excellent job at packaging and polishing Hadoop and make their money in training, services, and support.  And, like Microsoft, they eat their own dogfood: using the tools they build to solve their own business problems.  Jeff joked that it’s “interesting being a vendor now – I can see what we put these other vendors through [while at Facebook].”

Many thanks to Jeff for the great talk, Greylock for helping with the logistics (and providing the delicious pizza and beer), and to everybody that came out!  Be sure to check out the next talk on June 10th when our own Sasha Aickin, Redfin’s head of user experience, will weigh HTML 5 vs. Native Apps.

115 thoughts on “Engineer-to-Engineer: Evolving a New Analytical Platform with Hadoop”

  1. Pingback: how to buy viagra online in australia

  2. Pingback: i need help writing an essay

  3. Pingback: business school essay writing service

  4. Pingback: essay writing service best

  5. Pingback: sat essay writing help

  6. Pingback: pay for essay writing

  7. Pingback: college essay writing service reviews

  8. Pingback: buy essay papers online

  9. Pingback: best essays writing service

  10. Pingback: help me write my college essay

  11. Pingback: custom essay writing online

  12. Pingback: best mba essay writing service

  13. Pingback: optum rx pharmacy help desk

  14. Pingback: how much does viagra cost in a pharmacy

  15. Pingback: european pharmacy org buy strattera online

  16. Pingback: differin gel online pharmacy

  17. Pingback: hydrocodone pharmacy online no prescription

  18. Pingback: cialis manufacturer coupon lilly

  19. Pingback: generic viagra india pharmacy

  20. Pingback: tadalafil cialis bestprice

  21. Pingback: cialis super active

  22. Pingback: cialis manufacturer

  23. Pingback: selling cialis in us

  24. Pingback: pharmacy online india

  25. Pingback: wedgewood pharmacy prednisolone

  26. Pingback: canadian pharmacy no prescription generic cialis

  27. Pingback: viagra online without prescription usa

  28. Pingback: long term effects of cialis

  29. Pingback: can you buy generic viagra over the counter in canada

  30. Pingback: legitimate online pharmacy usa

  31. Pingback: what is tadalafil used for

  32. Pingback: purchase generic viagra in canada

  33. Pingback: cheap generic viagra in canada

  34. Pingback: buy generic viagra canada online

  35. Pingback: how to get viagra for women

  36. Pingback: cheapest viagra in united states

  37. Pingback: buy viagra tablet online india

  38. Pingback: cialis leg pain

  39. Pingback: how quickly does cialis work

  40. Pingback: tadalafil 20mg side effects

  41. Pingback: cialis profesional vs. super active vs. cialis

  42. Pingback: sulfamethoxazole-tmp da 800-160

  43. Pingback: gabapentin mirtazapin

  44. Pingback: metronidazole germany

  45. Pingback: tamoxifen apotek

  46. Pingback: valtrex paris

  47. Pingback: pregabalin是什么药

  48. Pingback: lasix efeitos

  49. Pingback: lisinopril sneezing

  50. Pingback: metformin recommendations

  51. Pingback: qu'est-ce que rybelsus

  52. Pingback: semaglutide weight loss

  53. Pingback: flagyl faq

  54. Pingback: zoloft headaches go away

  55. Pingback: lexapro increased anxiety

  56. Pingback: belgium drivers license

  57. Pingback: keflex what is it used for

  58. Pingback: cat fluoxetine dose

  59. Pingback: woodland hills personal injury attorney

  60. Pingback: buy sildenafil 20 mg tablets

  61. Pingback: สล็อตวอเลท

  62. Pingback: ciprofloxacin coupon

  63. Pingback: glue sniffer strain

  64. Pingback: what is cephalexin for

  65. Pingback: bactrim generic

  66. Pingback: bactrim dosage 800/160 for uti

  67. Pingback: สมัคร สล็อตออนไลน์

  68. Pingback: คอมพิวเตอร์

  69. Pingback: bacon999

  70. Pingback: 티비위키

  71. Pingback: https://mostbetcasino.pt/jogo-aviator/

  72. Pingback: พูลวิลล่าพัทยา

  73. Pingback: พอตไฟฟ้า

  74. Pingback: amoxicillin for strep

  75. Pingback: cozaar uses

  76. Pingback: switching from citalopram to lexapro

  77. Pingback: [augmentin].

  78. Pingback: when to take contrave

  79. Pingback: diclofenac otc

  80. Pingback: flexeril 10mg high

  81. Pingback: aripiprazole reviews

  82. Pingback: another name for amitriptyline

  83. Pingback: allopurinol alternatives

  84. Pingback: faceless youtube niches

  85. Pingback: celexa nausea

  86. Pingback: what is bupropion

  87. Pingback: best ashwagandha brand reddit

  88. Pingback: does augmentin treat strep throat

  89. Pingback: how long does it take for celebrex to take effect

  90. Pingback: how long does it take for baclofen to kick in

  91. Pingback: ruger bolt action rifle

  92. Pingback: ติดตั้งโซลาเซลล์

  93. Pingback: adhd virginia

  94. Pingback: abilify long-term side effects

  95. Pingback: remeron and anxiety

  96. Pingback: repaglinide chemical name

  97. Pingback: semaglutide for prediabetes

  98. Pingback: protonix ingredients

  99. Pingback: diltiazem side effects hair loss

  100. Pingback: effexor vs cymbalta

  101. Pingback: acarbose cirrhosis

  102. Pingback: actos mecanicos

  103. Pingback: ezetimibe stroke

  104. Pingback: robaxin interactions

  105. Pingback: ข่าวบอล

  106. Pingback: Hdtoday

  107. Pingback: ivermectin online

  108. Pingback: venlafaxine side effects sexually

  109. Pingback: spironolactone breast cancer

  110. Pingback: ks quik

  111. Pingback: warnings for voltaren

  112. Pingback: synthroid migraine

  113. Pingback: tizanidine vs robaxin

  114. Pingback: xelevia 100mg sitagliptin

  115. Pingback: tamsulosin for bladder

Leave a Comment

Your email address will not be published. Required fields are marked *

Be the first to see the latest real estate news:

  • This field is for validation purposes and should be left unchanged.

By submitting your email you agree to Redfin’s Terms of Use and Privacy Policy

Scroll to Top