All Questions
Tagged with apache-spark azure-databricks
1,002
questions
1
vote
1
answer
56
views
DataFrame write to Azure-SQL row-by-row performance
We are using azure databricks spark to write data to Azure SQL database. Last week we switched from runtime 9.1 (spark 3.1) to newer 14.3 (spark 3.5) using spark native JDBC driver. However when we ...
1
vote
1
answer
43
views
Pyspark Java whitelisted class issues
I am trying to migrate hive metastore into unity catalog, so that I have to enabled unity catalog in my existing cluster but my one of the notebook we are using below code is not supported now and ...
-1
votes
1
answer
73
views
How avoid small files using databricks when data is been writing
I am performing two write operations, each in a different notebook. The first operation involves writing approximately 22 million records with 90 columns, and the second involves writing about 10 ...
1
vote
1
answer
100
views
Spark : Persist was not working as expected
I was using a PySpark DataFrame where I called a UDF function. This UDF function makes an API call and stores the response back into the DataFrame. My goal is to store the DataFrame and reuse it ...
1
vote
2
answers
71
views
Scala UDF with long running initialization
I have a scala UDF which functionally works, but slower than it should be.
It is a function that looks up location from an IP addresses. This uses a relatively large database (200+ MB), which I ...
0
votes
1
answer
108
views
DLT - how to get pipeline_id and update_id?
I need to insert pipeline_id and update_id in my Delta Live Table (DLT), the point being to know which pipeline created which row. How can I obtain this information?
I know you can get job_id and ...
0
votes
1
answer
54
views
How to ensure UDF containing api call is being run across multiple worker nodes in Databricks
Working in an Azure Databricks environment.
I have a spark dataframe containing 200 rows, each of which represents a container in ADLS. For each row I need to sum the size of the blobs in that ...
0
votes
0
answers
39
views
Can we create multiple Spark executors within a single driver node on a Databricks cluster?
I have a power user compute with a single driver node and I'm trying to parallelize forecasting across multiple series by aggregating the data and doing a groupBy and then an apply on the groupBy.
The ...
0
votes
1
answer
262
views
Error : No parent external location found while creating a dataframe and saving as table in ADLS on azuredatabricks free trial
Working on my free trial azure account , I am trying to copy csv files to ADLS Gen2 and save the dataframe as table in adls silver layer.
code:
DForderItems = spark.read.csv("abfss://bronze@...
1
vote
2
answers
128
views
Unable to InferSchema with Databricks SQL
When I attempt to create a table with Databricks SQL I get the error:
AnalysisException: Unable to infer schema for CSV. It must be specified manually.
%sql
CREATE TABLE IF NOT EXISTS newtabletable ...
1
vote
1
answer
90
views
Databricks Performance Tuning with Joins around 15 tables with around 200 Million Rows
As part of our Databricks notebook, we are trying to run sql joining around 15 Delta Tables with 1 Fact and around 14 Dimension Tables. The Data coming out of Joins is around 200 Million records.
...
2
votes
2
answers
126
views
Spark reading CSV with bad records
I am trying to read a csv file in spark using a pre-defined schema. For which I use:
df = (spark.read.format("csv")
.schema(schema)
.option("sep", ";")
...
0
votes
1
answer
74
views
Does the performance of inserting data into Azure SQL database from Databricks affected by the sizing of the database?
I am now working on a use case which would need to ingest huge data (~10M of rows) from Azure Databricks materialized view to an Azure SQL database. The database is using the elastic standard (eDTU 50)...
0
votes
0
answers
61
views
No PYTHON_UID found for session (random uuid)
I'm facing a strange error when writing a stream from a dataframe coming from Azure Databricks to a Postgres table:
First I use databricks-connect==14.2.1 to connect to create a session for our ...
0
votes
1
answer
102
views
Databricks: create table using delta table path giving AnalysisException: The specified schema does not match existing schema at dbfs:/mnt/datalake/
I have have a Delta Table Path. Based on this, I am trying to create table using
create table if not exists dbo.DimCustomer
(CustomerSK BIGINT GENERATED BY DEFAULT AS IDENTITY,
first_name varchar(128),...