Spark Foreach

Write to Cassandra using foreachBatch () in Scala. Arbitrary Spark configuration property in key=value format. You can import Spark data into Microsoft Power BI through an ODBC connection. The following procedure is written for Power BI Desktop 2. Under the covers, all that foreach is doing is calling the iterator's foreach using the provided function. So, char [][] myArray; myArray[0] = {} //<--char array of first five lines myArray[1]= //char array of next five lines. spark sortby and sortbykey example in java and scala - tutorial 7 November 2, 2017 adarsh 2d Comments We can sort an RDD with key/value pairs provided that there is an ordering defined on the key. The Spark RDD API also exposes asynchronous versions of some actions, like foreachAsync for foreach, which immediately return a FutureAction to the caller instead of blocking on completion of the action. A for loop with a yield expression is translated to a map method call on the collection. Since then, there has been effort by a small team comprising of developers from Intel, Sigmoid Analytics and Cloudera towards feature completeness. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data. It executes input function on each element of an RDD. This post is the third and last post in a series in which we learn how to send messages in the Avro format into Kafka so that they can be consumed by Spark Streaming. Deequ provides features like — Constraint Suggestions — What to test. Unlike other actions, foreach do not return any value. Spark streaming, sliding window example and explaination. Data on Spark is distributed among its clusters and hence needs to be brought to a local session first, from where it can be plotted. RDD stands for Resilient Distributed Dataset, these are the elements that run and operate on multiple nodes to. Basically It helps to do transformation on Pig Relation. -application -sparks -php-activerecord -0. Datasets also leverage Tungsten ’s fast in-memory encoding. Apache Spark groupBy Example. Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. lookup("token"). requestsWithState. The foreach method takes a function as parameter and applies it to every element in the collection. Here's a java example: public static. If we review the definition of foreach in the official documentation of Spark. October 16 — Join us at the New York stop of the 2019 GraphTour World Tour!. That will shutdown the spark context as well, and therefore all the threads related to spark and spark streaming. They can be used, for example, to give every. To start a PySpark shell, run the bin\pyspark utility. Matei&Zaharia& & UC&Berkeley& & www. 8 Direct Stream approach. foreach() method with example Spark applications. We all know Apache Spark is an open-source and a widely used cluster computing framework, which comes up with built-in features like in-memory computation, streaming API's, machine learning libraries and graph processing algorithms. foreach(println) To write it to disk you can use one of the saveAs functions (still actions) from the RDD API. Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (default is client). foreach(println)的疑问 04-26 我从分布式表格系统中读取数据放入rdd1中,然后调用map只获取U,然后一个调用collect一个不调用collect,分别打印U中的某个成员变量row,打出的结果竟然不一样。. This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. Average calculate without Spark DataFrames, You can do it using simple RDD's transformation and actions. streaming import StreamingContext # Kafka from pyspark. Example code from Learning Spark book. Being continually bombarded w. We will see how to setup Scala in IntelliJ IDEA and we will create a Spark application using Scala language and run with our local data. textFile(inputPath). In this tutorial, we shall learn the usage of RDD. Spark Structured Streaming and Streaming Queries ForeachWriter is the contract for a foreach writer that is a streaming format package org. With the addition of lambda expressions in Java 8, we've updated Spark's API to transparently support these expressions, while staying compatible with old versions of Java. Because we are reading 20G of data from HDFS, this task is I/O bound and can take a while to scan through all the data (2 - 3 mins). Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The following code examples show how to use org. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin. To run this example, you need to install the appropriate Cassandra Spark connector for your Spark version as a Maven library. Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted. show() - Instead use the console sink (see next section). Again the using this method is best achieved by some examples. I had a problem with long running spark structured streaming in spark 2. NettyRpcEnv. Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (default is client). 4, you can use joins only when the query is in Append output mode. 10 is similar in design to the 0. This blog covers real-time end-to-end integration with Kafka in Apache Spark's Structured Streaming, consuming messages from it, doing simple to complex windowing ETL, and pushing the desired output to various sinks such as memory, console, file, databases, and back to Kafka itself. We will see how to setup Scala in IntelliJ IDEA and we will create a Spark application using Scala language and run with our local data. Information regarding Spark setup and environment used in this tutorial are provided on this Spark Installation f1. For Spark without Hive support, a table catalog is implemented as a simple in-memory map, which means that table information lives in the driver's memory and disappears with the Spark session. Spark provides key capabilities in the form of Spark SQL, Spark Streaming, Spark ML and Graph X all accessible via Java, Scala, Python and R. To perform this action, first we need to download Spark-csv package (Latest version) and extract this package into the home directory of Spark. The foreach operation is used to iterate every element in the spark RDD. Spark Structured Streaming and Streaming Queries ForeachWriter is the contract for a foreach writer that is a streaming format package org. requestsWithState. The function should be able to accept an iterator. A for loop with a yield expression is translated to a map method call on the collection. Data sharing in memory is 10 to 100 times faster than network and Disk. foreach creates an instance of an enumerator (returned from GetEnumerator()) and that enumerator also keeps state throughout the course of the foreach loop. Apache Spark. 12/06/2018; 2 minutes to read; In this article. /bin/spark-shell –driver-memory 4g. definition : foreach runs a function func on each element of the dataset. Jul 3, 2015. 关于spark map(fun1). foreach() - Instead use ds. foreach(println) OR f1. Only then foreach is executed, but, by this time, changing found is meaningless, as filter has already executed. Under the covers, all that foreach is doing is calling the iterator's foreach using the provided function. 10/03/2019; 7 minutes to read +1; In this article. parallelize(), from text file, from another RDD, DataFrame, and Dataset. Spark Motivation Difficultly of programming directly in Hadoop MapReduce Performance bottlenecks, or batch not fitting use cases Better support iterative jobs typical for machine learning 8. Data on Spark is distributed among its clusters and hence needs to be brought to a local session first, from where it can be plotted. Learn how to connect an Apache Spark cluster in Azure HDInsight with an Azure SQL database and then read, write, and stream data into the SQL database. This is an excerpt from the Scala Cookbook (partially modified for the internet). * Java system properties as well. streaming import StreamingContext # Kafka from pyspark. A DataFrame is a collection of data, organized into named columns. In this example, we create a table, and then start a Structured Streaming query to write to that table. Let's start with streams. It is good for writing database or publishing to web services. This job is done using Spark's DataFrame API, which is ideally suited to the. They do not really have to since the DAGScheduler. To demonstrate a more “real world” example of looping over a Scala Map, while working through some programming examples in the book, Programming Collective Intelligence, I decided to code them up in Scala, and I wanted to share the approaches I prefer using the Scala foreach and for loops. Spark allows you to define roles for your team's members. The Spark context is the primary object under which everything else is called. In either case, test your. Unlike most Spark functions, however, those print() runs inside each executor, so the diagnostic logs also go into the executors' stdout instead of the driver stdout, which can be accessed under the Executors tab in Spark Web UI. map(), filter(), lambda, and list comprehensions provide compact, elegant, and efficient ways to encode a few common idioms in programming. It is a distributed graph processing framework that sits on top of the Spark core. But, in the case of cluster mode, you won't be able to see the results, because foreach() performs the given function inside the executors and does not return any data to the driver. No Maven pom. Similar to foreach(), but instead of invoking function for each element, it calls it for each partition. For further information on Spark SQL, see the Spark SQL, DataFrames, and Datasets Guide. Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. This API is known as datasource API. Matei&Zaharia& & UC&Berkeley& & www. Overview of some graph concepts. It executes input function on each element of an RDD. Rdd of Rdds. Once Spark Streaming is "connected" to an external data source via such input DStreams, any subsequent DStream transformations will create "normal" DStreams. 10 is similar in design to the 0. This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. This article assumes basic knowledge of Apache Spark. foreach(println) 123 101. Any interruption introduces substantial processing delays and could lead to data loss or duplicates. A good example is ; inserting elements in RDD into database. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. We use the DataFrame API in Spark (available from Spark 2. The Spark MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python. Which notebooks for my computations ? iPython was the first shell to introduce this great feature called “notebook”, that enables a nice display of your computations in a web server instead of a standard shell :. writeStream. DataFrame原生支持直接输出到JDBC,但如果目标表有自增字段(比如id),那么DataFrame就不能直接进行写入了。因为DataFrame. spark4project. For Spark without Hive support, a table catalog is implemented as a simple in-memory map, which means that table information lives in the driver's memory and disappears with the Spark session. mode property to FAIR when configuring a SparkContext. " When creating a collection, use one of the Scala's parallel collection classes, or convert an existing collection to a parallel collection. 背景最近在使用spark做一些图算法方面的工作,遇到了一些spark性能优化方面的坑,折腾了好久,最后通过各方面的努力,包括与同事讨论,阅读spark相关的原始论文,stackoverflow提问,google检索等,解决了一些,这里开个系列,总结相关内容。. This is the source code of the Azure Event Hubs Connector for Apache Spark. This can be used to manage or wait for the asynchronous execution of the action. For reading a csv file in Apache Spark, we need to specify a new library in our python shell. 4, you can use joins only when the query is in Append output mode. For more information about starting the Spark Shell and configuring it for use with MongoDB, see Getting Started. It then repeatedly calls for the Next() object on the enumerator and runs your code for each object it returns. This is Recipe 10. org&& Parallel&Programming With&Spark UC&BERKELEY&. Spark doesn't guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). Other output modes are not yet supported. Learn how to integrate Spark Structured Streaming and. *; @XmlRootElement @XmlAccessorType(XmlAccessType. August 25, 2013. This is Recipe 13. Starting with Spring for Apache Hadoop 2. Apache Spark map Example. In order to work with PySpark, start a Windows Command Prompt and change into your SPARK_HOME directory. It is an immutable distributed collection of objects. The reference book for these and other Spark related topics is Learning Spark by. This course gives you the knowledge you need to achieve success. Welcome to Azure Databricks. Hello, I would like to parallelize my work on multiple RDDs I have. Data on Spark is distributed among its clusters and hence needs to be brought to a local session first, from where it can be plotted. updateAccumulators method that the driver uses to update the values of accumulators after a task completes (successfully or with a failure) is only executed on a single thread that runs scheduling loop. 8, it is also possible to configure fair sharing between jobs. Refer SPARK-28650 for more details. This example-driven tutorial gives an in-depth overview about Java 8 streams. The Spark cluster I had access to made working with large data sets responsive and even pleasant. We will see how to setup Scala in IntelliJ IDEA and we will create a Spark application using Scala language and run with our local data. ←How to Parse a CSV File in Spark using DataFrames [or] CSV to Elastic Search through Spark. For example, Python Pandas is a fine way to do our following problem, and it will probably work on your laptop reasonably well. --deploy-mode. As an example, you can use foreach method to loop through all elements in a collection. Being continually bombarded w. The Wonders of foreach. mapPartitions will help you to use vectorisation. As foreach executes, it passes one element at a time from the collection to your function until it reaches the last element in the collection. The foreach function is applicable to both Scala's Mutable and Immutable collection data structures. On Measuring Apache Spark Workload Metrics for Performance Troubleshooting Topic: This post is about measuring Apache Spark workload metrics for performance investigations. RevoScaleR with sparklyr step-by-step examples. You could use the wholetextfiles() in SparkContext provided by Scala. That will shutdown the spark context as well, and therefore all the threads related to spark and spark streaming. The library automatically performs the schema conversion. mapPartitions() can be used as an alternative to map() & foreach(). Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It then repeatedly calls for the Next() object on the enumerator and runs your code for each object it returns. Stack Exchange network consists of 175 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. In this transformation, lots of unnecessary data transfer over the network. Question: Tag: java,arrays,file-io,char I have a file with 2939 lines, and I am trying to save the five lines together in one index of a 2D char array. Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets. As we do our first Spark exercises, you might think of several ways to accomplish these tasks that you already know. Qubole intelligently automates and scales big data workloads in the cloud for greater flexibility. 'spark', 'akka', 'spark vs hadoop', 'pyspark', 'pyspark and spark'] iii. I will need an inner for loop and wondering the syntax of this? something like "For each item in Row" what I have now is:. If you are in local mode, you can find the URL for the Web UI by running. Spark is a big data solution that has been proven to be easier and faster than Hadoop MapReduce. source provides different number of partitions for some reason, Spark optimization changes number of partitions, etc. Spark provides special type of operations on RDDs containing key or value pairs. Configuration for a Spark application. The spark-avro library supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark. Exception Handling in Apache Spark. In the example above, each file will by default generate one partition. parallelize(), from text file, from another RDD, DataFrame, and Dataset. Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. This idea is loosely analogous to virtual memory. While some of them may be supported in future releases of Spark. It came into picture as Apache Hadoop MapReduce was performing. In this tutorial, we shall learn some of the ways in Spark to print contents of RDD. Then, we need to open a PySpark shell and include the package (I am using "spark-csv_2. I want t o iterate every row of a dataframe without using collect. Two types of Apache Spark RDD operations are- Transformations and Actions. In this article, I will continue from. As of Spark 2. Stack Exchange network consists of 175 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. definition : foreach runs a function func on each element of the dataset. I am using an Indian Pin code data to analyze the state wise post office details. A second "regular" Spark application runs on the data stored in-memory by Ignite to join the records from the three separate logs into a single table in batches of 1 hour. Streaming Tweets to Snowflake Data Warehouse with Spark Structured Streaming and Kafka Streaming architecture In this post we will build a system that ingests real time data from Twitter, packages it as JSON objects and sends it through a Kafka Producer to a Kafka Cluster. Let's Talk Money! with Joseph Hogue, CFA Recommended for you. The foreach action in Spark is designed like a forced map (so the "map" action occurs on the executors). To perform this action, first we need to download Spark-csv package (Latest version) and extract this package into the home directory of Spark. Neither YARN nor Apache Spark have been designed for executing long-running services. On Tue, May 28, 2013 at 12:18 PM, < sbur[email protected] In Structured Streaming, if you enable checkpointing for a streaming query, then you can restart the query after a failure and the restarted query will continue where the failed one left off, while ensuring fault tolerance and data consistency guarantees. On applying groupByKey() on a dataset of (K, V) pairs, the data shuffle according to the key value K in another RDD. Use HDInsight Spark cluster to read and write data to Azure SQL database. 10 is similar in design to the 0. Refer SPARK-28650 for more details. Unless otherwise specified by the implementing class, actions are performed in the order of iteration (if an iteration order is specified). You could use the wholetextfiles() in SparkContext provided by Scala. Apache Spark flatMap Example As you can see in above image RDD X is the source RDD and RDD Y is a resulting RDD. In this example, we will use the same MovieLens dataset. Sparkour is an open-source collection of programming recipes for Apache Spark. But debugging this kind of applications is often a really hard task. spark4project. Now that we have installed and configured PySpark on our system, we can program in Python on Apache Spark. Heap Memory Flamegraph Spark reading a table in Parquet: spark. open(SelectChannelConnector. See the Deploying subsection below. I'm not familiar with those. Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. 1 Documentation. Spark Foreach Mongo Upsert Writer Introduction. Zeppelin Tutorial. Structured Streaming is a stream processing engine built on the Spark SQL engine. Machine Learning Server packages such as RevoScaleR and sparklyr can be used together within a single Spark session. Spark Motivation Difficultly of programming directly in Hadoop MapReduce Performance bottlenecks, or batch not fitting use cases Better support iterative jobs typical for machine learning 8. This reference data will keep changing periodically. Start the Spark shell in the Spark base directory, ensuring that you provide enough memory via the –driver-memory option: >. CreateOrReplaceTempView on spark Data Frame Often we might want to store the spark Data frame as the table and query it, to convert Data frame into temporary view that is available for only that spark session, we use registerTempTable or CreateOrReplaceTempView (Spark > = 2. 1 and Python 2. Scala List/sequence FAQ: How do I iterate over a Scala List (or more generally, a sequence) using the foreach method or for loop?. The JDK (Java Development Kit) includes tools for developing, debugging, and monitoring Java applications (not just data processing). com > wrote: The idea was to stream large sets of HBase rows from endpoing-coprocessor scans to a streaming spark job, so it can create a union of the RDDs and spill the entire. row and wondering how to do this. If you have any further questions, please ask them in the comments section below. mode property to FAIR when configuring a SparkContext. A managed table is a Spark SQL table for which Spark manages both the data and the metadata. It then repeatedly calls for the Next() object on the enumerator and runs your code for each object it returns. updateAccumulators method that the driver uses to update the values of accumulators after a task completes (successfully or with a failure) is only executed on a single thread that runs scheduling loop. Scala: How to loop over a collection with 'for' and 'foreach' (plus for loop translation) | alvinalexander. As of Spark 2. SparkContext (aka Spark context) is the entry point to the services of Apache Spark (execution engine) and so the heart of a Spark application. GitHub Gist: instantly share code, notes, and snippets. ) It's best to call collect() on the RDD to get a sequential array for orderly printing. Data on Spark is distributed among its clusters and hence needs to be brought to a local session first, from where it can be plotted. source provides different number of partitions for some reason, Spark optimization changes number of partitions, etc. parsing json in spark streaming: Date: Wed, 03 Sep 2014 11:52:27 GMT: Hello everyone. In this case, any parameters you set directly on the SparkConf object take priority over system properties. You can refer to the below screen shot for the same. In both cases (Spark with or without Hive support), the createOrReplaceTempView method registers a temporary table. In Structured Streaming, if you enable checkpointing for a streaming query, then you can restart the query after a failure and the restarted query will continue where the failed one left off, while ensuring fault tolerance and data consistency guarantees. Data and execution code are spread from the driver to tons of worker machines for parallel processing. parallelize(), from text file, from another RDD, DataFrame, and Dataset. spark4project. We will assume you have Zeppelin installed already. Cannot use streaming aggregations before joins. don’t get me wrong from the title, I’m not advocating to move from Apache Pig to Spark in all cases. Meet Neo4j: The graph database platform powering today's mission-critical enterprise applications, including artificial intelligence, fraud detection and recommendations. Spark allows you to define roles for your team's members. They are required to be used when you want to guarantee an accumulator's value to be correct. Get the flexibility to use the language and platform of your choice with open source support. How to pass sparkSession from driver to executor. spark foreach与foreachPartition 详解 spark foreach与foreachPartition 每个partition中iterator时行迭代的处理,通过用户传入的function对iterator进行内容的处理 一:foreach的操作: Foreach中,传入一个function,这个函数的传入参数就是每个partition中,每次的foreach得到的一个rdd的kv实例. If you use spark-shell or spark-submit you can pass these properties with -conf. @Adnan Alvee. Wikipedia has a great description of it: Apache Spark is an open source cluster computing framework originally developed in the AMPLab at University of California, Berkeley but was later donated to the Apache Software Foundation where it remains today. In fact, you can consider an application a Spark application only when it uses a SparkContext (directly or indirectly). My Spark & Python series of tutorials can be examined individually, although there is a more or less linear 'story' when followed in sequence. 3 and union. 04 18:31:45 字数 150 阅读 2323 Structured streaming默认支持的sink类型有File sink,Foreach sink,Console sink,Memory sink。. You could use the wholetextfiles() in SparkContext provided by Scala. Spark map itself is a transformation function which accepts a function as an argument. Features Of Spark SQL. Here are a few examples of what cannot be used. If you are running multiple Spark jobs on the batchDF, the input data rate of the streaming query (reported through StreamingQueryProgress and visible in the notebook rate graph) may be reported as a multiple of the actual rate at which data is generated at the source. [jira] [Updated] (SPARK-29869) HiveMetastoreCatalog#convertToLogicalRelation throws AssertionError. RDDs can contain any type of Python, Java, or Scala. how many partitions an RDD represents. Once Spark Streaming is "connected" to an external data source via such input DStreams, any subsequent DStream transformations will create "normal" DStreams. In general, Spark DataFrames are quite efficient in terms of performance as shown in Fig. So, set --conf spark. 利用Receiver接收数据,2. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. foreach() is an action. Two types of Apache Spark RDD operations are- Transformations and Actions. August 25, 2013. "Apache Spark Structured Streaming" Jan 15, 2017. foreach(println) 123 101. When I first read about the Stream API, I was confused about the name since it sounds similar to InputStream and OutputStream from Java I/O. Machine Learning Server supports the sparklyr package from RStudio. Under the covers, all that foreach is doing is calling the iterator's foreach using the provided function. Do you know about PySpark Serializers. I had a problem with long running spark structured streaming in spark 2. Apache Spark is a cluster computing framework for large-scale data processing. Tutorial with Local File Data Refine. Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (default is client). As of Spark 2. foreach(println)的疑问 04-26 我从分布式表格系统中读取数据放入rdd1中,然后调用map只获取U,然后一个调用collect一个不调用collect,分别打印U中的某个成员变量row,打出的结果竟然不一样。. Data and execution code are spread from the driver to tons of worker machines for parallel processing. Spark SQL is tightly integrated with the the various spark programming languages so we will start by launching the Spark shell from the root directory of the provided USB drive:. jdbc()要求DataFrame的schema与目标表的表结构必须完全一致(甚至字段顺序都要一致),否则会抛异常,当然,如果你SaveMode选择了Overwrite,那么Spark删除你原有的表,然后根据. RevoScaleR with sparklyr step-by-step examples. foreach { // No code here still takes a lot of time ( there used to be code but removed it to see if it's any faster without code) // } } Please let us know if we are missing anything or an alternative way to do this as i understand without a output operation on DStream spark streaming application will. foreach creates an instance of an enumerator (returned from GetEnumerator()) and that enumerator also keeps state throughout the course of the foreach loop. In either case, test your. The spark-avro library supports most conversions between Spark SQL and Avro records, making Avro a first-class citizen in Spark. Really appreciated the information and please keep sharing, I would like to share some information regarding online training. It is an immutable distributed collection of objects. They do not really have to since the DAGScheduler. Deequ is built on top of Apache Spark hence it is naturally scalable for the huge amount of data. ) It's best to call collect() on the RDD to get a sequential array for orderly printing. Here, to prints all the elements in the RDD, we will call a print function in foreach. In most cases it is possible to swap out Mustache with Handlebars and continue using your current templates. Rdd of Rdds. In this case, any parameters you set directly on the SparkConf object take priority over system properties. This post will help you get started using Apache Spark DataFrames with Scala on the MapR Sandbox. This article assumes basic knowledge of Apache Spark. Spark is an open source software developed by UC Berkeley RAD lab in 2009. For further information on Spark SQL, see the Spark SQL, DataFrames, and Datasets Guide. Unless otherwise specified by the implementing class, actions are performed in the order of iteration (if an iteration order is specified). It executes input function on each element of an RDD. The foreach action in Spark is designed like a forced map (so the "map" action occurs on the executors). Matei&Zaharia& & UC&Berkeley& & www. We also plan to extend Spark to support other levels of persistence (e. Caused by: java. I'm working in Spark 1. How to create an Apache Spark RDD using Parallelize. This post is the third and last post in a series in which we learn how to send messages in the Avro format into Kafka so that they can be consumed by Spark Streaming. how many partitions an RDD represents. 利用Receiver接收数据,2. The Spark Streaming integration for Kafka 0. 12, "Examples of how to use parallel collections in Scala. charAt(0) which will get the first character of the word in upper case (which will be considered as a group). foreach() is going to iterate through all elements in your ArrayBuffer, and since the CompactBuffer is the only element, it was returned as a group. This is an excerpt from the Scala Cookbook (partially modified for the internet). Fortunately, there are techniques that can be used to avoid buffering large amounts of data in memory and avoid out of memory exceptions. What Spark adds to existing frameworks like Hadoop are the ability to add multiple map and reduce tasks to a single workflow.