Newest 'google-cloud-dataproc' Questions
0 votes
1 answer
68 views
ModuleNotFoundError in GCP after trying to sumbit a job
new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
2 votes
1 answer
190 views
Spark memory error in thread spark-listener-group-eventLog
I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...
1 vote
0 answers
74 views
Out of memory for a smaller dataset
I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
1 vote
2 answers
90 views
How do you run Python Hadoop Jobs on Dataproc?
I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...
1 vote
0 answers
42 views
Paritial records being read in Pyspark through Dataproc
I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...
1 vote
2 answers
163 views
How to pass arguments from GCP Workflows into Dataproc
I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...