Skip to main content
0 votes
1 answer
16 views

Read CSV with "§" as delimiter using Databricks autoloader

I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a ...
beingmanny's user avatar
1 vote
0 answers
27 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
Puredepatata's user avatar
1 vote
1 answer
32 views

Spark 3.0 Cannot write non-nullable data to iceberg

I have an avro file which has a field called timeStamp which is a mandatory field without any default value. Which means there is no chance to get this field as null. The schema defined as below {&...
Abhi's user avatar
  • 71
-1 votes
0 answers
24 views

State variable in Pyspark

I am using pyspark to resolve the following case, Detect the operations of a signal lamp, means capturing the start time (signal ON) and the end time (signal OFF) using some logic on a function ...
Anandu Balan's user avatar
0 votes
0 answers
14 views

As known, spark structured streaming support that stream join static, but how to refresh the static data when the static data changed

In my spark structured streaming application, the static data is updated per day. But after the application started, the static data in the spark application memory is never updated. It caused that ...
Angle Tom's user avatar
  • 1,120
0 votes
0 answers
75 views

How To Evaluate different Spark Physical Plan

I have been recently diving in my knowledge in Spark query plans in order to start tuning the process that we have developed so far. So I started to test some stuff. I will share the scenario as much ...
Alex's user avatar
  • 25
0 votes
1 answer
29 views

Unexpected Behavior using WHEN | OTHERWISE

We have developed one streaming process which use many other delta tables to enrich the final data product. Lets call it FinalDataProduct the delta table where the data is inserted, semiLayout a ...
Alex's user avatar
  • 25
0 votes
1 answer
84 views

Feed the result of one query to another in the same Spark Structured Streaming app

I have just started working on Spark Structured Streaming and came up with an implementation question. So I am using Apache Pulsar to stream data, and wanted to know if it is possible to run different ...
ghost's user avatar
  • 405
2 votes
1 answer
22 views

Structured stream writer using foreachBatch does not respect shuffle.partitions parameter

We are running a data deduplication operation on a structured stream using the foreachBatch functionality. However, the write operation does not appear to respect the number of shuffle partitions that ...
MRTN's user avatar
  • 211
0 votes
0 answers
32 views

Enrich Kafka with CDC data coming with debezium in Spark Streaming

I have a kafka stream which I am reading in spark streaming. Some metadata is present in Mysql which can be changed. I have setup a debezium which is sending CDC events to another Kafka topic. I need ...
Professor's user avatar
0 votes
0 answers
13 views

Spark Arbitrary Stateful Operations - how does Spark manage the timeout?

I'm looking at this example and also tried to apply it for my use case, but I couldn't understand how Spark uses the hasTimedOut to manage the data in memory. For example, what happen if I handle a ...
nirkov's user avatar
  • 769
0 votes
0 answers
23 views

Get rid of shuffle/CartesianRDD from the execution plan - Spark Structured Streaming

I have the following problem: There is a Spark Structured Streaming query that runs forEachBatch and executes custom Python code as arrowOptimized Spark UDFs. The code is relatively complex. The ...
Łukasz Kastelik's user avatar
1 vote
1 answer
33 views

Using pyspark structured streaming to parse Kafka but getting null

I try to parse Kafka using the code bellow: from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * spark = SparkSession \ .builder \ ....
rind-tran's user avatar
1 vote
0 answers
65 views

Join after groupby in Spark structured streaming

When I run the following code: Dataset<Row> aggStreamA = df .withWatermark("dateTime", "2 days") .groupBy( window(col("dateTime"), ...
Mohsen R's user avatar
0 votes
0 answers
17 views

Spark Streaming: Periodic Latency Spike w/ ElasticSearch/OpenSearch Connector using Spark DataSource V2

I developed Spark Structured Streaming on Spark 3.1.2. It reads streaming data and join with static DataFrame, which refreshes with a period using Delta foramt. It sinks to ElasticSearch(or OpenSearch)...
InJung Hwang's user avatar

15 30 50 per page
1
2 3 4 5
166