Newest 'spark-structured-streaming' Questions

0 votes

1 answer

16 views

Read CSV with "§" as delimiter using Databricks autoloader

I'm very new to spark streaming and autoloader and had a query on how we might be able to get autoloader to read a text file with "§" as the delimiter. Below I tried reading the file as a ...

beingmanny

1

asked 13 hours ago

1 vote

0 answers

27 views

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...

Puredepatata

23

asked Sep 16 at 18:50

1 vote

1 answer

32 views

Spark 3.0 Cannot write non-nullable data to iceberg

I have an avro file which has a field called timeStamp which is a mandatory field without any default value. Which means there is no chance to get this field as null. The schema defined as below {&...

Abhi

71

asked Sep 12 at 14:23

-1 votes

0 answers

24 views

State variable in Pyspark

I am using pyspark to resolve the following case, Detect the operations of a signal lamp, means capturing the start time (signal ON) and the end time (signal OFF) using some logic on a function ...

Anandu Balan

11

asked Sep 10 at 11:23

0 votes

0 answers

14 views

As known, spark structured streaming support that stream join static, but how to refresh the static data when the static data changed

In my spark structured streaming application, the static data is updated per day. But after the application started, the static data in the spark application memory is never updated. It caused that ...

Angle Tom

1,120

asked Sep 9 at 8:07

0 votes

0 answers

75 views

How To Evaluate different Spark Physical Plan

I have been recently diving in my knowledge in Spark query plans in order to start tuning the process that we have developed so far. So I started to test some stuff. I will share the scenario as much ...

Alex

25

asked Sep 6 at 23:39

0 votes

1 answer

29 views

Unexpected Behavior using WHEN | OTHERWISE

We have developed one streaming process which use many other delta tables to enrich the final data product. Lets call it FinalDataProduct the delta table where the data is inserted, semiLayout a ...

Alex

25

asked Sep 5 at 12:13

0 votes

1 answer

84 views

Feed the result of one query to another in the same Spark Structured Streaming app

I have just started working on Spark Structured Streaming and came up with an implementation question. So I am using Apache Pulsar to stream data, and wanted to know if it is possible to run different ...

ghost

405

asked Sep 2 at 10:13

2 votes

1 answer

22 views

Structured stream writer using foreachBatch does not respect shuffle.partitions parameter

We are running a data deduplication operation on a structured stream using the foreachBatch functionality. However, the write operation does not appear to respect the number of shuffle partitions that ...

MRTN

211

asked Aug 29 at 10:39

0 votes

0 answers

32 views

Enrich Kafka with CDC data coming with debezium in Spark Streaming

I have a kafka stream which I am reading in spark streaming. Some metadata is present in Mysql which can be changed. I have setup a debezium which is sending CDC events to another Kafka topic. I need ...

Professor

1

asked Aug 16 at 17:47

0 votes

0 answers

13 views

Spark Arbitrary Stateful Operations - how does Spark manage the timeout?

I'm looking at this example and also tried to apply it for my use case, but I couldn't understand how Spark uses the hasTimedOut to manage the data in memory. For example, what happen if I handle a ...

nirkov

769

asked Aug 12 at 8:28

0 votes

0 answers

23 views

Get rid of shuffle/CartesianRDD from the execution plan - Spark Structured Streaming

I have the following problem: There is a Spark Structured Streaming query that runs forEachBatch and executes custom Python code as arrowOptimized Spark UDFs. The code is relatively complex. The ...

Łukasz Kastelik

629

asked Jul 30 at 9:55

1 vote

1 answer

33 views

Using pyspark structured streaming to parse Kafka but getting null

I try to parse Kafka using the code bellow: from pyspark.sql import SparkSession from pyspark.sql.functions import * from pyspark.sql.types import * spark = SparkSession \ .builder \ ....

rind-tran

11

asked Jul 23 at 11:47

1 vote

0 answers

65 views

Join after groupby in Spark structured streaming

When I run the following code: Dataset<Row> aggStreamA = df .withWatermark("dateTime", "2 days") .groupBy( window(col("dateTime"), ...

Mohsen R

11

asked Jul 13 at 16:14

0 votes

0 answers

17 views

Spark Streaming: Periodic Latency Spike w/ ElasticSearch/OpenSearch Connector using Spark DataSource V2

I developed Spark Structured Streaming on Spark 3.1.2. It reads streaming data and join with static DataFrame, which refreshes with a period using Delta foramt. It sinks to ElasticSearch(or OpenSearch)...

InJung Hwang

31

asked Jul 12 at 2:50

Collectives™ on Stack Overflow

Read CSV with "§" as delimiter using Databricks autoloader

How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?

Spark 3.0 Cannot write non-nullable data to iceberg

State variable in Pyspark

As known, spark structured streaming support that stream join static, but how to refresh the static data when the static data changed

How To Evaluate different Spark Physical Plan

Unexpected Behavior using WHEN | OTHERWISE

Feed the result of one query to another in the same Spark Structured Streaming app

Structured stream writer using foreachBatch does not respect shuffle.partitions parameter

Enrich Kafka with CDC data coming with debezium in Spark Streaming

Spark Arbitrary Stateful Operations - how does Spark manage the timeout?

Get rid of shuffle/CartesianRDD from the execution plan - Spark Structured Streaming

Using pyspark structured streaming to parse Kafka but getting null

Join after groupby in Spark structured streaming

Spark Streaming: Periodic Latency Spike w/ ElasticSearch/OpenSearch Connector using Spark DataSource V2

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags