2,488
questions
0
votes
1
answer
16
views
Read CSV with "§" as delimiter using Databricks autoloader
I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a ...
1
vote
0
answers
27
views
How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?
I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
1
vote
1
answer
32
views
Spark 3.0 Cannot write non-nullable data to iceberg
I have an avro file which has a field called timeStamp which is a mandatory field without any default value. Which means there is no chance to get this field as null. The schema defined as below
{&...
-1
votes
0
answers
24
views
State variable in Pyspark
I am using pyspark to resolve the following case,
Detect the operations of a signal lamp, means capturing the start time (signal ON) and the end time (signal OFF) using some logic on a function ...
0
votes
0
answers
14
views
As known, spark structured streaming support that stream join static, but how to refresh the static data when the static data changed
In my spark structured streaming application, the static data is updated per day. But after the application started, the static data in the spark application memory is never updated.
It caused that ...
0
votes
0
answers
75
views
How To Evaluate different Spark Physical Plan
I have been recently diving in my knowledge in Spark query plans in order to start tuning the process that we have developed so far.
So I started to test some stuff. I will share the scenario as much ...
0
votes
1
answer
29
views
Unexpected Behavior using WHEN | OTHERWISE
We have developed one streaming process which use many other delta tables to enrich the final data product.
Lets call it FinalDataProduct the delta table where the data is inserted, semiLayout a ...
0
votes
1
answer
84
views
Feed the result of one query to another in the same Spark Structured Streaming app
I have just started working on Spark Structured Streaming and came up with an implementation question.
So I am using Apache Pulsar to stream data, and wanted to know if it is possible to run different ...
2
votes
1
answer
22
views
Structured stream writer using foreachBatch does not respect shuffle.partitions parameter
We are running a data deduplication operation on a structured stream using the foreachBatch functionality.
However, the write operation does not appear to respect the number of shuffle partitions that ...
0
votes
0
answers
32
views
Enrich Kafka with CDC data coming with debezium in Spark Streaming
I have a kafka stream which I am reading in spark streaming. Some metadata is present in Mysql which can be changed. I have setup a debezium which is sending CDC events to another Kafka topic. I need ...
0
votes
0
answers
13
views
Spark Arbitrary Stateful Operations - how does Spark manage the timeout?
I'm looking at this example and also tried to apply it for my use case, but I couldn't understand how Spark uses the hasTimedOut to manage the data in memory. For example, what happen if I handle a ...
0
votes
0
answers
23
views
Get rid of shuffle/CartesianRDD from the execution plan - Spark Structured Streaming
I have the following problem:
There is a Spark Structured Streaming query that runs forEachBatch and executes custom Python code as arrowOptimized Spark UDFs. The code is relatively complex. The ...
1
vote
1
answer
33
views
Using pyspark structured streaming to parse Kafka but getting null
I try to parse Kafka using the code bellow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession \
.builder \
....
1
vote
0
answers
65
views
Join after groupby in Spark structured streaming
When I run the following code:
Dataset<Row> aggStreamA = df
.withWatermark("dateTime", "2 days")
.groupBy(
window(col("dateTime"), ...
0
votes
0
answers
17
views
Spark Streaming: Periodic Latency Spike w/ ElasticSearch/OpenSearch Connector using Spark DataSource V2
I developed Spark Structured Streaming on Spark 3.1.2.
It reads streaming data and join with static DataFrame, which refreshes with a period using Delta foramt.
It sinks to ElasticSearch(or OpenSearch)...