Why Pandas DataFrame.memory_usage Differs from psutil Process.memory_info: Understanding Memory Discrepancies in Python CSV Processing

For data scientists, analysts, and Python developers working with large datasets, monitoring memory usage is critical to avoid crashes, optimize performance, and ensure efficient resource allocation. When processing CSV files with Pandas, two common tools for memory measurement are often used:

  • pandas.DataFrame.memory_usage(): A Pandas method to estimate the memory consumed by a DataFrame.
  • psutil.Process.memory_info(): A cross-platform library to retrieve memory usage of the entire Python process.

However, these tools rarely report the same number. A DataFrame might show 100MB via memory_usage(), but psutil could report 500MB of process memory. This discrepancy can be confusing, leading to questions like: “Where is the extra memory going?” or “Is my DataFrame really using that much RAM?”

In this blog, we’ll demystify why these measurements differ, explore the underlying causes, and provide practical guidance to interpret both tools effectively.

Table of Contents#

  1. Understanding the Tools: What Do They Measure?
  2. Key Reasons for Memory Discrepancies
  3. [Practical Example:Seeing Is Believing](#3-practical-example Seeing Is Believing-seeing-is-believing)
  4. How to Reconcile the Two Measurements
  5. Conclusion
  6. References

1. Understanding the Tools: What Do They Measure?#

Before diving into discrepancies, let’s clarify what each tool actually measures.

Pandas memory_usage()#

The pandas.DataFrame.memory_usage() method estimates the memory consumed by a DataFrame’s data and metadata. Its behavior depends on the deep parameter:

  • deep=False (default): Measures “shallow” memory, including only the memory occupied by the DataFrame’s columns (e.g., the size of NumPy arrays for numeric columns) but not the data inside Python objects (e.g., strings in object dtype columns).
  • deep=True: Measures “deep” memory, recursively calculating the memory of objects within the DataFrame (e.g., the actual characters in string columns).

Example:
For a DataFrame with a float64 column (8 bytes per element) and 1M rows, memory_usage(deep=False) for that column would return 1e6 * 8 = 8,000,000 bytes (8MB).

Limitations: It only measures the DataFrame itself, ignoring overhead from Pandas, Python, or other parts of the process.

psutil memory_info()#

The psutil.Process.memory_info() method (or psutil.virtual_memory()) reports memory usage for the entire Python process. The most commonly referenced metric is rss (Resident Set Size), which is the amount of physical RAM (not swapped to disk) occupied by the process.

Example:
If your Python script loads a Pandas DataFrame, imports libraries like numpy and matplotlib, and has other variables, rss includes all of this: the DataFrame, Python interpreter, libraries, and temporary objects.

Key Point: psutil measures the process-wide memory footprint, not just the DataFrame.

2. Key Reasons for Memory Discrepancies#

Now that we understand the tools, let’s explore why their results differ.

1. Scope of Measurement: Data vs. Entire Process#

The most fundamental difference is what they measure:

  • memory_usage(): Focused only on the DataFrame’s data (columns, dtypes, and optionally nested objects).
  • psutil: Measures everything the Python process is using, including:
    • The DataFrame (counted by memory_usage()).
    • Other variables (e.g., lists, dictionaries, or even other DataFrames).
    • Python interpreter overhead (e.g., object headers, reference counts).
    • Loaded libraries (e.g., NumPy, Pandas, csv parser, or matplotlib).
    • Temporary objects (e.g., during CSV parsing with pd.read_csv).

Example: If you load a 100MB DataFrame but also have a 200MB list of strings in your script, memory_usage() will report ~100MB, while psutil will report ~300MB+ (plus overhead).

2. Overhead: Python, Pandas, and Libraries#

Even if your script has only a single DataFrame, psutil will report more memory than memory_usage() due to overhead from:

Python Overhead#

Every Python object (e.g., integers, strings, DataFrames) has overhead:

  • Object headers (e.g., type pointers, reference counts: ~40 bytes per object on 64-bit systems).
  • Memory alignment/padding (OSes and CPUs require data to be aligned in memory, leading to “wasted” space between objects).

Pandas Overhead#

Pandas adds its own layers:

  • Index objects: The DataFrame’s index (even a default integer index) consumes memory (e.g., a RangeIndex is tiny, but a MultiIndex can be large).
  • Metadata: Column names, dtypes, and internal bookkeeping (e.g., BlockManager or ArrayManager structures that organize data into blocks for efficient access).
  • NumPy dependency: Pandas uses NumPy arrays under the hood, and NumPy arrays have their own overhead (e.g., shape, strides, and dtype metadata: ~100 bytes per array).

Library Overhead#

Libraries like numpy, pandas, and psutil themselves occupy memory when loaded. For example, just importing Pandas adds ~10-20MB to the process memory (before any DataFrame is created!).

3. Data Representation: How Memory Is Actually Stored#

Pandas memory_usage() provides an estimate of data size, but the actual memory layout in RAM can differ:

Numeric Columns#

For numeric dtypes (e.g., int64, float32), memory_usage() calculates n_rows * dtype_size (e.g., 1M rows × 4 bytes for int32 = 4MB). This is accurate for the raw data, but the NumPy array storing it has ~100 bytes of overhead (metadata), which memory_usage() ignores but psutil includes.

Object Columns#

For object dtype columns (e.g., strings, Python lists), memory_usage(deep=True) tries to count nested data, but it’s imperfect:

  • A string column with 1M rows, each containing the string "hello" (5 characters), would be estimated as 1e6 * (5 bytes) = 5MB by deep=True.
  • In reality, each string is a Python object with ~40 bytes of overhead (header) + 5 bytes of data = 45 bytes per string. Total: 1e6 * 45 = 45MB, which psutil would report.

Categorical Data#

Pandas category dtype reduces memory by storing unique values once. memory_usage() reflects this efficiency, but psutil includes the actual RAM used by the categorical array’s internal structures (e.g., the list of unique values and integer codes).

4. Garbage Collection: Uncollected “Ghost” Objects#

Python uses automatic garbage collection (GC) to free memory from objects no longer referenced. However:

  • GC runs periodically (not immediately when an object is deleted).
  • del df removes the reference to the DataFrame, but the memory isn’t freed until GC collects it.

Impact: If you delete a DataFrame and immediately check psutil, rss may remain high until GC runs. In contrast, memory_usage() will no longer report that DataFrame (since it’s deleted).

Example:

import pandas as pd  
import psutil  
 
df = pd.DataFrame({'data': [1.0] * 1_000_000})  
print("DataFrame memory:", df.memory_usage(deep=True).sum() / 1e6, "MB")  # ~8MB  
print("Process RSS before del:", psutil.Process().memory_info().rss / 1e6, "MB")  # ~50MB  
 
del df  
print("DataFrame memory after del:", df.memory_usage(deep=True).sum() / 1e6, "MB")  # Error (df deleted)  
print("Process RSS after del:", psutil.Process().memory_info().rss / 1e6, "MB")  # Still ~50MB (GC not run)  
 
import gc  
gc.collect()  # Force GC  
print("Process RSS after GC:", psutil.Process().memory_info().rss / 1e6, "MB")  # ~30MB (lower, but not zero)  

5. Process-Wide Factors: Fragmentation and External Libraries#

Other process-level factors inflate psutil measurements:

  • Memory Fragmentation: The Python heap can become fragmented over time (e.g., small objects allocated/deallocated leave gaps). psutil counts these gaps as used memory, even if they’re empty.
  • External Libraries: Tools like matplotlib (plots), scikit-learn (models), or dask (parallel processing) add their own memory footprints, which psutil includes but memory_usage() ignores.
  • CSV Parsing Overhead: When using pd.read_csv, temporary objects (e.g., file buffers, parser state) are created during parsing. These are deleted after the DataFrame is built, but if psutil is checked mid-parsing, rss will spike temporarily.

3. Practical Example:Seeing Is Believing#

Let’s walk through a concrete example to demonstrate discrepancies. We’ll:

  1. Create a sample CSV.
  2. Load it into a Pandas DataFrame.
  3. Compare memory_usage() and psutil measurements.

Step 1: Create a Sample CSV#

Generate a CSV with 1M rows, 2 columns: a numeric column (int64) and a string column (object dtype):

import csv  
 
with open("sample_data.csv", "w", newline="") as f:  
    writer = csv.writer(f)  
    writer.writerow(["id", "name"])  # Header  
    for i in range(1_000_000):  
        writer.writerow([i, f"user_{i}"])  # 1M rows  

Step 2: Load Data and Measure Memory#

import pandas as pd  
import psutil  
 
# Get process ID for psutil  
process = psutil.Process()  
 
# Load CSV  
df = pd.read_csv("sample_data.csv")  
 
# Measure DataFrame memory (deep=True to include strings)  
df_memory = df.memory_usage(deep=True).sum() / 1e6  # Convert to MB  
print(f"DataFrame memory (deep=True): {df_memory:.2f} MB")  
 
# Measure process RSS (physical memory used)  
process_memory = process.memory_info().rss / 1e6  
print(f"Process RSS (psutil): {process_memory:.2f} MB")  

Expected Output#

DataFrame memory (deep=True): ~22.00 MB  
Process RSS (psutil): ~80.00 MB  

Why the Difference?#

  • DataFrame Memory:
    • id column: 1M rows × 8 bytes (int64) = 8MB.
    • name column: 1M strings, each ~10 characters (e.g., "user_12345") → ~10MB (data) + ~4MB (overhead estimated by deep=True). Total ~22MB.
  • Process RSS:
    • DataFrame data: 22MB.
    • Pandas/NumPy overhead: ~20MB (metadata, indexes, BlockManager).
    • Python interpreter: ~10MB.
    • CSV parser residual objects: ~5MB.
    • Other libraries (e.g., csv, psutil): ~23MB.
    • Total: ~80MB.

4. How to Reconcile the Two Measurements#

While discrepancies are normal, you can use both tools to optimize memory usage:

Use memory_usage() to Optimize DataFrames#

  • Downcast Numeric Columns: Convert int64int32 or float64float32 if possible (e.g., pd.to_numeric(downcast="integer")).
  • Use category Dtype: For string columns with few unique values (e.g., “country” with 100 values), category dtype reduces memory by 90%+.
  • Drop Unused Columns: Remove columns not needed for analysis with df.drop(columns=[...]).

Use psutil to Monitor Process Health#

  • Check for Leaks: If rss grows indefinitely (even after deleting DataFrames), suspect uncollected garbage or memory leaks (e.g., global variables, circular references).
  • Profile Peak Usage: Use psutil to track rss during CSV parsing or data processing to identify bottlenecks (e.g., temporary objects during read_csv).
  • Force Garbage Collection: Use gc.collect() to free memory before critical operations (e.g., loading a second large DataFrame).

5. Conclusion#

The discrepancy between pandas.DataFrame.memory_usage() and psutil.Process.memory_info() is normal and expected. To summarize:

  • memory_usage() measures only the DataFrame’s data and nested objects (with deep=True).
  • psutil measures the entire Python process, including the DataFrame, overhead, libraries, and other variables.

By understanding their differences, you can:

  • Use memory_usage() to optimize DataFrame storage.
  • Use psutil to ensure your process doesn’t exceed system memory limits.

Both tools are essential for robust memory management in Python data processing!

6. References#