- Shop with balance and get Rs.50 cashback per customer. Minimum order Rs.250 (excluding delivery fees). Applicable on orders paid using only Amazon Pay balance during Feb 16 - 28. Here's how (terms and conditions apply)
Mastering Apache Spark Paperback – Import, 30 Sep 2015
Special offers and product promotions
Frequently bought together
Customers who bought this item also bought
About the Author
Mike Frampton is an IT contractor, blogger, and IT author with a keen interest in new technology and big data. He has worked in the IT industry since 1990 in a range of roles (tester, developer, support, and author). He has also worked in many other sectors (energy, banking, telecoms, and insurance). He now lives by the beach in Paraparaumu, New Zealand, with his wife and teenage son. Being married to a Thai national, he divides his time between Paraparaumu and their house in Roi Et, Thailand, between writing and IT consulting. He is always keen to hear about new ideas and technologies in the areas of big data, AI, IT and hardware, so look him up on LinkedIn (http://linkedin.com/profile/view?id=73219349) or his website (http://www.semtech-solutions.co.nz/#!/pageHome) to ask questions or just to say hi.
Enter your mobile number or email address below and we'll send you a link to download the free Kindle App. Then you can start reading Kindle books on your smartphone, tablet, or computer - no Kindle device required.
To get the free app, enter mobile phone number.
What other items do customers buy after viewing this item?
Top customer reviews
Most helpful customer reviews on Amazon.com
1) Spark is huge
2) This book is not
3) Most pages are covered w/ Screen Shots, code listings and output. Diagrams, though potentially more useful, abound as well. How many times to you need to println the same output message, and show them all?
Basically, I'll be very surprised if I get half my money's worth here....
"Spark Cookbook" by Yadav, Packt, £30
"Apache Spark 2 for Beginners" by Thottuvaikkatumana, Packt, £28
"Advanced Analytics with Spark" by Ryza et al., O'Reilly, £23
"Spark in Action" by Zecevic and Bonaci, Manning, £25
"Spark for Python Developers" by Nandi, Packt, £26
"Mastering Apache Spark" by Frampton, Packt, £35
(Before that, I installed Spark on my Windows PC, following an extremely useful walk-through from Shantanu Sharma - google "Installing Spark on Windows 10").
There are two main findings: 1. None of the books is great, or sufficient. 2. The two non-Packt books are good; the four Packt books can be discarded. The weak books all suffer from the same problem: being hastily and poorly written, and unable to teach something, they go for breadth when they cannot provide depth, and run around the different corners of Spark and - this book especially - related software products/projects. You see a listing here and a listing there, but realize that you are missing the foundations, and those basic examples are anyway not difficult to google. It's a pity, there is original and valuable content between the covers, but I am not going to stick around to find out. "Advanced Analytics with Spark", supplemented/followed by "Spark in Action", looks like the best available course of action.
I have written a detailed chapter-by-chapter review of this book on www DOT i-programmer DOT info, the first and last parts of this review are given here. For my review of all chapters, search i-programmer DOT info for STIRK together with the book's title.
This book aims to provide a practical discussion of Spark and its major components. How does it fare?
Spark is an increasingly popular Big Data technology, generally performing processing much faster than traditional MapReduce jobs.
This book is for anyone who wants to know more about Spark. In particular, the basic Spark components are discussed, and then Spark is extended with some of the more experimental components.
The book assumes a basic knowledge of Linux, Hadoop, Spark, SBT, and a reasonable knowledge of Scala. The author suggests using the internet to fill any gaps in your prerequisites knowledge.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Apache Spark
The chapter opens with an overview of Spark, being a distributed, scalable, in-memory, parallel processing data analytics system. Spark can be programmed in various languages, including: Java, Python, and Scala. The examples in this book use Scala.
The chapter discusses in outline, the 4 major Spark components (i.e. Machine Learning, Streaming, SQL, and Graph processing), cloud integration, and the future of Spark. Cluster design is briefly examined, it’s noted that Spark doesn’t have its own storage system, so Hadoop is often used – this has the advantage that Spark can become another component in the Hadoop toolset.
The chapter continues with a look at cluster management, and configuring the Spark cluster. Useful discussions and diagrams explaining the Spark master, worker nodes, client nodes and Spark context are provided. This is followed by a section that examines cluster management running as: local, standalone, using YARN, using Mesos, and using Amazon’s Elastic Compute Cloud (EC2).
Next, performance is briefly examined. Topics include: cluster structure (cloud or shared boxes are often slower), putting applications on their own separate nodes, allocate sufficient memory, and filtering data early in the ETL process.
The chapter ends with a look at the cloud, it’s suggested this is the future direction of technology, with Spark as a service. Various providers are briefly discussed (e.g. Databricks, and Google cloud).
This chapter provides a helpful overview of what Spark is, its major components, its various cluster managers, Spark architecture, and its future. Subsequent chapters expand on the major Spark components, and discuss its promising future in the cloud.
Useful discussions, diagrams, configuration settings, practical example code, website links, inter-chapter links are given throughout. These traits apply to the whole of the book.
This book has well-written discussions, helpful examples, diagrams, website links, inter-chapter links, and useful chapter summaries. It contains plenty of step-by-step code walkthroughs, to help you understand the subject matter.
The book describes Spark’s major components (i.e. Machine Learning, Streaming, SQL, and Graph processing), each with practical code examples. Some of the template code could form the basis of your own application code.
Several of the core Spark components are extended using less well-know components, many of these are still works in progress. I’m not sure how many readers will find these chapters/sections useful, since they often involve workarounds, or the components might not exist or be superseded later – they can also distract from the book’s core. That said, if you enjoy working at the bleeding edge of technology, you’ll enjoy what these extensions add.
Although the book assumes some knowledge of Spark, for completeness, it might have been useful to have some introduction to it (e.g. explain RDDs, introduce the spark-shell etc). Developers coming from a Windows environment might struggle initially understanding Linux, SBT, JARs etc.
Despite these concerns, I enjoyed this book, it contains plenty of useful detail. Spark is a rapidly changing technology, so check the spark website for the latest changes. The book is highly recommended.
A huge positive for this book is that it not only talks about Spark itself, but also covers using Spark with other big data technologies like Hadoop, Kafka, Titan, Neo4j, HBase, Cassandra, H2O, etc. More on this below.
True to the name, sure the book covers more than simple introductory Spark topics, but it concentrates on breath than depth. There is decent coverage and enough code examples for each topic, but what it lacks is depth. There is no "best practices" or "performance" or "watch out for" type discussions or any type of advanced code.
The MLlib chapter covers Naive Bayes, K-Means and Artificial Neural Networks (ANN). For each algorithm, the theory is very briefly introduced and then jumps right into detailed code walkthroughs.
The Spark Streaming chapter introduces Streaming briefly and jumps straight into using different streaming sources and code walkthroughs of how to use them. This chapter covers TCP streams, File streams, Flume and Kafka sources.
By now the pattern of the chapters should be evident. The next chapter on Spark SQL follows the same format covering different data source like, Text, Json, Parquet, Hive and covers DataFrame/SQL code examples.
GraphX is covered in the next two chapters. Integration of GraphX with Neo4j and Titan (both HBase and Cassandra backed store) is covered extensively.
Finally H2O integration and the Databricks Spark hosted offering is discussed.
I would definitely recommend this as the second Spark book after any Introductory Spark book (or Spark Documentation).