Newest 'google-cloud-dataproc' Questions

0 votes

1 answer

68 views

ModuleNotFoundError in GCP after trying to sumbit a job

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...

SofiaNiki's user avatar

  • 1

2 votes

1 answer

190 views

Spark memory error in thread spark-listener-group-eventLog

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...

Jesus Diaz Rivero's user avatar

  • 332

1 vote

0 answers

74 views

Out of memory for a smaller dataset

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...

user16798185's user avatar

  • 387

1 vote

2 answers

90 views

How do you run Python Hadoop Jobs on Dataproc?

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...

The Beast's user avatar

  • 163

1 vote

0 answers

42 views

Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...

Bob's user avatar

  • 383

1 vote

2 answers

163 views

How to pass arguments from GCP Workflows into Dataproc

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...

54m's user avatar

  • 777