Cpu goes out of memory when i increase batch size from 2048 to 4096

prakhar_agrawal · June 12, 2023, 2:22am

I am creating a dataproc cluster
gcloud dataproc clusters create gputestsingleworker1
–service-account svceuclid@wmt-euclid-dev.iam.gserviceaccount.com
–region us-central1
–single-node
–enable-component-gateway
–subnet projects/shared-vpc-admin/regions/us-central1/subnetworks/prod-us-central1-02
–no-address
–num-masters 1
–master-accelerator type=nvidia-tesla-t4,count=1
–num-master-local-ssds=1
–master-machine-type n1-standard-48
–scopes cloud-platform
–project wmt-euclid-dev
–optional-components=JUPYTER
–initialization-actions gs://my-bucket-1322/requirements_geodemand.sh
–image onedemand-base-20230421
–properties=“^#^dataproc:dataproc.logging.stackdriver.job.driver.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:jobs.file-backed-output.enable=true#dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:dataproc.logging.stackdriver.job.yarn.container.enable=true#dataproc:dataproc.logging.stackdriver.enable=true#dataproc:startup.component.service-binding-timeout.hive-server2=15000#hive:hive.metastore.schema.verification=false#hive:javax.jdo.option.ConnectionURL=jdbc:mysql://bfd-mysql.gcp-prod.glb.us.walmart.net:3306/metastore#hive:javax.jdo.option.ConnectionUserName=gensmrtfrcst#hive:javax.jdo.option.ConnectionPassword=mdpKJZJ5jY350CYG”

Tanya · January 17, 2024, 6:08pm

@prakhar_agrawal Welcome to Tensorflow Forum !
Here are some strategies to address the CPU out-of-memory issue when increasing batch size:

Reduce Batch Size:

If possible, stick with a batch size that fits within your memory constraints. Experiment with smaller batch sizes to find the optimal balance between memory usage and training efficiency.

Optimize Data Types:

Use lower precision data types (e.g., float16 instead of float32) to reduce memory footprint. Consider mixed-precision training where possible.

Employ Gradient Accumulation:

Split large batches into smaller chunks and accumulate gradients across multiple iterations to simulate a larger effective batch size without increasing memory usage in a single step.

Utilize Gradient Checkpointing:

Store only a subset of gradients during backpropagation to reduce memory consumption. This technique can be particularly helpful for large models.

Free Up Memory:

Close unnecessary applications and processes to allocate more memory for your TensorFlow model. Monitor memory usage using tools like nvidia-smi (for GPUs) or system-level utilities to identify potential bottlenecks.

Let us know if this helps!

Topic		Replies	Views
Out of GPU memory when training CNN with large images TensorFlow gpu , help_request	1	1842	June 7, 2022
Out of memory issue with small model (500k parameters) and small to medium batch sizes General Discussion memory , gpu , tf-model	2	294	February 2, 2024
How to limit the cpu memory General Discussion api	5	1368	March 6, 2024
Memory issue when start fit a model Keras nvidia , tfconfig , tfkeras	1	51	July 18, 2024
CUDA error: out of memory (CUDA_ERROR_OUT_OF_MEMORY) General Discussion models , datasets , gpu , tensorflow	4	1378	November 11, 2023

Cpu goes out of memory when i increase batch size from 2048 to 4096

Related topics