logo
down
shadow

APACHE-SPARK QUESTIONS

Hive automatically filtering NULL in NOT IN condition
Hive automatically filtering NULL in NOT IN condition
it should still fix some issue This is how all RDBMS systems treat null value.null has a special meaning - something like not defined
TAG : apache-spark
Date : January 12 2021, 08:33 AM , By : fayoh
Unexpected behavior during Join (only works if rename column 'year' as 'year' ) otherwise fails with "package.TreeN
Unexpected behavior during Join (only works if rename column 'year' as 'year' ) otherwise fails with "package.TreeN
hope this fix your issue Bugs. I simply rename. It is painful.See How to resolve the AnalysisException: resolved attribute(s) in Spark. Other scenarios as well.
TAG : apache-spark
Date : January 12 2021, 01:40 AM , By : Florian D.
Databricks notebook time out error when calling other notebooks: com.databricks.WorkflowException: java.net.SocketTimeou
Databricks notebook time out error when calling other notebooks: com.databricks.WorkflowException: java.net.SocketTimeou
this will help I fixed the problem by tuning the default spark configuration. I increase the executor heartbeat and the networko spark.executor.heartbeat 60s spark.network.timeout 720s
TAG : apache-spark
Date : January 11 2021, 02:18 PM , By : Mr. Tacos
How to refresh loaded dataframe contents in spark streaming?
How to refresh loaded dataframe contents in spark streaming?
around this issue I may have misunderstood the question, but refreshing the metadata dataframe should be a feature supported out of the box.You simply don't have to do anything.
TAG : apache-spark
Date : January 07 2021, 03:08 PM , By : Menno
How to make GraphFrame from Edge DataFrame only
How to make GraphFrame from Edge DataFrame only
will help you The graphframes scala API has a function called fromEdges which generates a graphframe from a edge dataframe. As far as I can overlook it this function isn't avaiable in pyspark, but you can do something like:
TAG : apache-spark
Date : January 02 2021, 10:54 PM , By : user112141
Get field values from a structtype in pyspark dataframe
Get field values from a structtype in pyspark dataframe
it should still fix some issue IIUC, you can loop over the values in df2.schema.fields and get the name and dataType:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : TRobison
Find number of partitions computed per machine in Apache Spark
Find number of partitions computed per machine in Apache Spark
I hope this helps . I am not sure about the Spark UI, but here is how you can achieve it programmatically -
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : user183825
Invalid status code '400' from .. error payload: "requirement failed: Session isn't active
Invalid status code '400' from .. error payload: "requirement failed: Session isn't active
With these it helps Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h),
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : George Handlin
Required executor memory is above the max threshold of this cluster
Required executor memory is above the max threshold of this cluster
I hope this helps you . Executor memory is only the heap portion of the memory. You still have to run a JVM plus allocate the non-heap portion of memory inside a container and have that fit in YARN. Refer to the image from How-to: Tune Your Apache Sp
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : dyarborough
Use GCS staging directory for Spark jobs (on Dataproc)
Use GCS staging directory for Spark jobs (on Dataproc)
To fix the issue you can do First, it's important to realize that the staging directory is primarily used for staging artifacts for executors (primarily jars and other archives) rather than for storing intermediate data as a job executes. If you want
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : lewing
Does Hive preserve file order when selecting data
Does Hive preserve file order when selecting data
Hope that helps Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated.
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Eric
How to set optimal config values - trigger time, maxOffsetsPerTrigger - for Spark Structured Streaming while reading mes
How to set optimal config values - trigger time, maxOffsetsPerTrigger - for Spark Structured Streaming while reading mes
wish of those help You can run the spark structured streaming application in either fixed interval micro-batches or continuous. Here are some of the options you can use for tuning streaming applications.Kafka Configurations:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : user186012
Spark policy for handling multiple watermarks
Spark policy for handling multiple watermarks
With these it helps If as far as I understand, you would like to know how multiple watermarks behave for join operations, right? I so, I did some dig into the implementation to find the answer.multipleWatermarkPolicy configuration used globally
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Brianna
Is it required to install spark on all the nodes of cluster
Is it required to install spark on all the nodes of cluster
this will help If you use yarn as manager on a cluster with multiple nodes you do not need to install spark on each node. Yarn will distribute the spark binaries to the nodes when a job is submitted.https://spark.apache.org/docs/latest/running-on-yar
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Francesco
SparkSQL Get all prefixes of a word
SparkSQL Get all prefixes of a word
this will help Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : ffmmjj
How to save spark dataframe to parquet without using INT96 format for timestamp columns?
How to save spark dataframe to parquet without using INT96 format for timestamp columns?
hop of those help? Reading spark code I have found the spark.sql.parquet.outputTimestampType property
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : iyogee
How to pass external resouce yml /property file while running spark job on cluster?
How to pass external resouce yml /property file while running spark job on cluster?
I wish did fix the issue. you will have to use --file path to your file in spark-submit command to be able to pass any files. please note this issyntax for that is
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Bas
How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?
How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?
may help you . If Kafka has more than one partition that means consumers can benefit from that by doing a certain task in parallel. In particular spark-streaming internally can speed up a job by increasing the num-executors parameter. That is tied to
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Habikki
How to handle small file problem in spark structured streaming?
How to handle small file problem in spark structured streaming?
Does that help We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Paulh
Killing spark streaming job when no activity
Killing spark streaming job when no activity
I wish this help you Use a NoSQL Table like Cassandra or HBase to keep the counter. You can not handle Stream Polling inside a loop. Implement same logic using NoSQL or Maria DB and perform a Graceful Shutdown to your streaming Job if no activity is
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : user98832
Changing bucket class(Regional/Multi Regional) in Google Cloud Storage connector in Spark
Changing bucket class(Regional/Multi Regional) in Google Cloud Storage connector in Spark
fixed the issue. Will look into that further From document: Cloud Dataproc staging bucket
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : LinnheCreative
What is a fast way to generate parquet data files with Spark for testing Hive/Presto/Drill/etc?
What is a fast way to generate parquet data files with Spark for testing Hive/Presto/Drill/etc?
This might help you I guess the main goal is to generate data, not to write it in a certain format.Let's start with a very simple example.
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : lifchicker
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper while writing delta-lake into
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper while writing delta-lake into
hope this fix your issue What is your Spark version? org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper came about in 2.4.0. If you are using an older version, you will have this issue. In 2.4.0 https://github.com/apache/spark/tree/v2.4.0/sql
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : yatil
Any clue how to join this spark-structured stream joins?
Any clue how to join this spark-structured stream joins?
I wish did fix the issue. AFAIK Spark structured streaming can't do joins after aggregations (or other non-map-like operations)https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.htmlsupport-matrix-for-joins-in-streaming-querie
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Yolanda N. Ceron
Spark SQL nested JSON error "no viable alternative at input "
Spark SQL nested JSON error "no viable alternative at input "
I wish did fix the issue. It's because SQL column names are expected to start with a letter or some other characters like _, @ or but not a digit. Let's consider this simple example:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Josh Tegart
Cassandra partition size vs partitions count while processing a large part of the table
Cassandra partition size vs partitions count while processing a large part of the table
I wish did fix the issue. As you suspect, planning to have just 31 partitions is a really bad idea for performance. The primary problem would be that the database cannot scale: When RF=3, there would be at most (under unlikely optimal conditions) 93
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Kenny
how to correctly configure maxResultSize?
how to correctly configure maxResultSize?
I wish this help you The following should do the trick. Also note that you have mis-spelled ("spark.executor.memories", "10g"). The correct configuration is 'spark.executor.memory'.
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Der Ketzer
Optimal file size and parquet block size
Optimal file size and parquet block size
wish of those help Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet. If it's going to be read/processed often, you may want to consider what are the access patterns
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : JulianCT
Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?
Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?
I wish this help you I always think those concepts from a standalone perspective firstly, then to a cluster perspective. Considering a single machine (and you will also run Spark in local mode), DataNode and NameNode are just pieces of software to su
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Vasiliy
design- Can Kafka Producer written as Spark-job?
design- Can Kafka Producer written as Spark-job?
wish helps you Spark provides connector for Kafka through which you can connect to any of the kafka topic available in your cluster. Once you get connected to your Kafka topic you can read or write the data. Example code:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : AnToni00
PySpark/Glue: When using a date column as a partition key, its always converted into String?
PySpark/Glue: When using a date column as a partition key, its always converted into String?
With these it helps This is a known behavior of paquet. You can add the following line before reading the parquet file to omit this behavior:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Fred Morrison
How to count the null,na and nan values in each column of pyspark dataframe
How to count the null,na and nan values in each column of pyspark dataframe
hop of those help? Dataframe as na,Nan and Null values . Schema (Name:String,Rol.No:Integer,Dept:String Example: , Use when()
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : Feroz
Elasticsearch Spark parse issue - cannot parse value [X] for field [Y]
Elasticsearch Spark parse issue - cannot parse value [X] for field [Y]
Any of those help I've created a sample document based on your data in ES 6.4/Spark 2.1 version and made use of the below code, in order to read GenerateTime field as text instead of date type in spark. Mapping in ES
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : user90210
Hive query to find the count for the weeks in middle
Hive query to find the count for the weeks in middle
I hope this helps . Although this answer is in Scala, Python version will look almost the same can be easily converted.Step 1:
TAG : apache-spark
Date : January 02 2021, 06:48 AM , By : mckasty
How to match/extract multi-line pattern from file in pysark
How to match/extract multi-line pattern from file in pysark
I wish this helpful for you If you can use \n < into a list y we can find rank, quantityUnit by checking y[1] and y[2], quantityAmount by checking y[1] and Item_id by checking y[0].
TAG : apache-spark
Date : January 01 2021, 05:01 PM , By : DonMac
Apache Spark,NameError: name 'flatMap' is not defined
Apache Spark,NameError: name 'flatMap' is not defined
it fixes the issue OK, here is a Scala example with tokenizer that leads me to think you are looking at it wrongly.
TAG : apache-spark
Date : January 01 2021, 04:56 PM , By : Janne Laine
How to Merge DataFrames in Apache Spark/Hive and then increment version
How to Merge DataFrames in Apache Spark/Hive and then increment version
wish of those help We receive daily files from external system and store it into Hive. Want to enable versioning on data. , existing main hive table:
TAG : apache-spark
Date : January 01 2021, 06:46 AM , By : Chris Hanley
Who executes the python codes in pyspark
Who executes the python codes in pyspark
I wish did fix the issue. Both print commands and datetime.now() are executed in Spark driver. The current_time will be passed to executors on next action command to actually add it to DataFrame. At the time of print("new column added") only df's sch
TAG : apache-spark
Date : January 01 2021, 06:46 AM , By : Thx1138.6
Comparing data across executors in Spark
Comparing data across executors in Spark
hop of those help? Just write an SQL query with lag windowing, qualifying, check the adjacent rows for date ad date minus 1, with major key qualification being Name. Sort as well within Name. You need not worry about Executors, Spark will hash for yo
TAG : apache-spark
Date : January 01 2021, 06:35 AM , By : phil
Spark structured streaming with Apache Hudi
Spark structured streaming with Apache Hudi
I wish this helpful for you I know of atleast one user using structure streaming sink in Hudi. https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/test/scala/DataSourceTest.scalaL190 could help?
TAG : apache-spark
Date : January 01 2021, 06:35 AM , By : jedameron
Scheduling Spark for Nightly run batch - Run Every Night Like an ETL
Scheduling Spark for Nightly run batch - Run Every Night Like an ETL
Any of those help There is no built-in mechanism in Spark that will help. A cron job seems reasonable. Other options are https://azkaban.github.io/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spa
TAG : apache-spark
Date : December 28 2020, 02:12 PM , By : CHeMoTaCTiC
Spark session/context lifecycle
Spark session/context lifecycle
it helps some times Let's understand sparkSession and sparkContextSparkContext is a channel to access all Spark functionality.The Spark driver program uses it to connect to the cluster manager to communicate, submit Spark jobs and knows what resource
TAG : apache-spark
Date : December 28 2020, 06:11 AM , By : Nicholas Hunter
How to use wildcards for directories with leading zeros in their names in Spark SQL?
How to use wildcards for directories with leading zeros in their names in Spark SQL?
hop of those help? From Hadoop Glob Pattern [abc]: Matches a single character from character set {a,b,c} [a-b]: Matches a single character from the character range {a…b} {ab,cd}: Matches a string from the string set {ab, cd}
TAG : apache-spark
Date : December 28 2020, 05:45 AM , By : LUK
pyspark DF.show() error after converting RDD to DF after zipWithIndex
pyspark DF.show() error after converting RDD to DF after zipWithIndex
This might help you I seem to be following the documented ways of showing a DF converted from an RDD with a Schema. But clearly there is some minor but significant point I am missing. As follows then: , The problem becomes clearer if you look at rdd:
TAG : apache-spark
Date : December 27 2020, 03:57 PM , By : gorbiz
Spark with Mesos : java.lang.UnsatisfiedLinkError: libsvn_delta-1.so.0: cannot open shared object file: No such file or
Spark with Mesos : java.lang.UnsatisfiedLinkError: libsvn_delta-1.so.0: cannot open shared object file: No such file or
This might help you library is already loaded by your application and the application tries to load it again, the UnsatisfiedLinkError will be thrown by the JVM
TAG : apache-spark
Date : December 27 2020, 03:57 PM , By : Piotr Balas
Difference between writing to partition path directly and using partitionBy
Difference between writing to partition path directly and using partitionBy
should help you out Are there any differences between:
TAG : apache-spark
Date : December 27 2020, 03:55 PM , By : user157654
Incomplete HDFS URI, no host, altohugh file does exist
Incomplete HDFS URI, no host, altohugh file does exist
Any of those help The error message says that you have not specified the host in the HDFS URI.Try to change the URI with:
TAG : apache-spark
Date : December 27 2020, 03:14 PM , By : kdietz
Sharing a spark session
Sharing a spark session
it helps some times So I feel there are two questions -Q1. How in scala file you can reuse already created spark session?
TAG : apache-spark
Date : December 27 2020, 02:58 PM , By : Mr. Tacos
Pyspark ignoring filtering of dataframe inside pyspark-sql-functions
Pyspark ignoring filtering of dataframe inside pyspark-sql-functions
it fixes the issue The easiest way to achieve what you require is to use when() instead of the df.where().Taking variables from your example -
TAG : apache-spark
Date : December 27 2020, 02:42 PM , By : Rob
How register a SQL Function in Spark Databricks
How register a SQL Function in Spark Databricks
Any of those help I have written SQL code that I'm going to use in Spark. The code works fine when applied in T-SQL on MS Server, however when I run apply the code to Spark platform I get the error: Undefined function: 'EOMONTH'. This function is nei
TAG : apache-spark
Date : December 26 2020, 12:01 AM , By : Ernie
Accessing already present table in Hive
Accessing already present table in Hive
hope this fix your issue You have a call to create the database, but you are never using it in the create table call. I'd suggest that your first 3 lines of the script to be changed to
TAG : apache-spark
Date : December 25 2020, 11:30 PM , By : user121350
Should the number of executor core for Apache Spark be set to 1 in YARN mode?
Should the number of executor core for Apache Spark be set to 1 in YARN mode?
hop of those help? When we run spark application using a cluster manager like Yarn, there’ll be several daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker. So, while specifying num-executors,
TAG : apache-spark
Date : December 25 2020, 07:30 PM , By : Brownell
Spark create new spark session/context and pick up from failure
Spark create new spark session/context and pick up from failure
help you fix your problem Checkpointing in non-Streaming is to used sever lineage. It is not designed for sharing data between different applications or different Spark Contexts. What you would like is not possible in fact.
TAG : apache-spark
Date : December 25 2020, 04:01 PM , By : Cadu
Push Spark Dataframe to Janusgraph for Spark running in EMR
Push Spark Dataframe to Janusgraph for Spark running in EMR
help you fix your problem I spend around two weeks to find answer, posting it so it can help someone.For writing the Dataframe running on remote computer, you can use gremlin, but for reading efficiently(in case you want to add edges) you may need Sp
TAG : apache-spark
Date : December 25 2020, 01:30 PM , By : Joshua Johnson
How compute the percentile in pyspark Dataframe?
How compute the percentile in pyspark Dataframe?
may help you . I have a pyspark Dataframe consists of three columns x, y, z. , Try groupby + F.expr:
TAG : apache-spark
Date : December 25 2020, 12:09 PM , By : mediafarm
How to find the mode of a few columns in pyspark
How to find the mode of a few columns in pyspark
Hope this helps If you're looking for how to calculate the row-wise mode in spark, refer to Mode of row as a new column in PySpark DataFrame. However, you can get your desired result without computing the mode. Since this is a binary classification p
TAG : apache-spark
Date : December 25 2020, 11:01 AM , By : PPD
How to convert a rdd of pandas DataFrame to Spark DataFrame
How to convert a rdd of pandas DataFrame to Spark DataFrame
will help you I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
TAG : apache-spark
Date : December 25 2020, 09:33 AM , By : Singularity
How to fix 'ClassCastException: cannot assign instance of' - Works local but not in standalone on cluster
How to fix 'ClassCastException: cannot assign instance of' - Works local but not in standalone on cluster
this will help It seems there was a conflict between dependencies but do not know exaclty what! This is what I did: Spring - Spark: Conflicts between jars/dependenciesEDIT:
TAG : apache-spark
Date : December 25 2020, 08:30 AM , By : Manik
Is it better to partition by time stamp or year,month,day, hour
Is it better to partition by time stamp or year,month,day, hour
I wish this helpful for you Aiming for 1GB each 10 minutes means you'll very quickly build up a pretty massive amount of data (1000 files and 1 TB a week, give or take).Your choices have to take into account at least :
TAG : apache-spark
Date : December 25 2020, 07:01 AM , By : Shitic
Where to find errors when writing to BigQuery from Dataproc?
Where to find errors when writing to BigQuery from Dataproc?
it fixes the issue I reproduced the problem, the errors returned by BigQuery API was discarded by BigQuery connector. I filed an issue for BQ connector. We'll fix in the next release.
TAG : apache-spark
Date : December 25 2020, 07:01 AM , By : Matt Croydon

shadow
Privacy Policy - Terms - Contact Us © scrbit.com