Utility functions » load data properly

This notebook addresses one common issues "The training data is huge, simply reading the full csv at once may be too much". the main issue is not using the correct data type for each feature in your dataset. Pandas load every column that has numpy.nan values as numpy.float64; most of the times we don't need numpy.float64. Assume there is a binary feature that can take only 0 and 1. Data type numpy.int8 is more than enough to store this data. However, if there is only 1 numpy.nan element in the data set, Pandas (and Numpy) stores all elements of that column as numpy.float64. Refer to the following example:

In [28]:
import numpy as np
import pandas as pd

a = np.zeros(10).astype(np.int8)
a[1], a[5] = 1, 1

print('A numpy array of binary feature:')
print(a)
print('data type:', a.dtype)

print('\nA numpy array of binary feature having a numpy.nan element:')
c = np.zeros(10)
c[1], c[2], c[5] = 1, np.nan, 1
print(c)
print('data type:', c.dtype)

print('\nThe same numpy array stored as numpy.int8 dtype:')
c = c.astype(np.int8)
print(c)
print('data type:', c.dtype)
print('Notice how the numpy.nan was replaced by zero.')

A numpy array of binary feature:
[0 1 0 0 0 1 0 0 0 0]
data type: int8

A numpy array of binary feature having a numpy.nan element:
[ 0.  1. nan  0.  0.  1.  0.  0.  0.  0.]
data type: float64

The same numpy array stored as numpy.int8 dtype:
[0 1 0 0 0 1 0 0 0 0]
data type: int8
Notice how the numpy.nan was replaced by zero.


In kaggle most of the competitions (if not all) present the data as .csv files. Further, almost all of the datasets contain numpy.nan values. Hence, when you load the data most of the columns will be loaded as numpy.float64 data type and will take a good chunk of your available memory.

reduce_memory_usage() function¶

There are some ways introduced over time to overcome this issue. One way is to use the famous reduce_memory_usage() function originally introduced in https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65. Below is the function

In [30]:
def reduce_memory_usage(df):
start_mem_usg = df.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in.
for col in df.columns:
if df[col].dtype != object:  # Exclude strings

# make variables for Int, max and min
IsInt = False
mx = df[col].max()
mn = df[col].min()
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(df[col]).all():
NAlist.append(col)
df[col].fillna(mn-1,inplace=True)

# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
result = (df[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
df[col] = df[col].astype(np.uint8)
elif mx < 65535:
df[col] = df[col].astype(np.uint16)
elif mx < 4294967295:
df[col] = df[col].astype(np.uint32)
else:
df[col] = df[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
# Make float datatypes 32 bit
else:
df[col] = df[col].astype(np.float32)

# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = df.memory_usage().sum() / 1024**2
print("Memory usage is: ",mem_usg," MB")
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return df, NAlist


First of all, this function automatically fills in your null values for you! that is not exactly what you asked for and is actually a big deal since the strategy used in filling the missing values could have a direct impact on the performance of your model; to avoid this you can first fill the missing values and then call the function on your dataframe. Another issue with this function is that there are some hidden pitfalls in using it as described in https://www.kaggle.com/c/champs-scalar-coupling/discussion/96655. Some people tried to improve the function to enhance its performance but I prefer - and recommend - not to use it at all.

Convert float64 to float32¶

Another alternative to the abovementioned function is to simply convert all numpy.float64 columns to numpy.float32. This one is very straight-forward and easy to impelement (see below). However, care should be given when using this one. In some instances you don't care much about the decimal places for example if the feature is transaction amount there should not be a sensible difference between 10.9991 and 10.9990912312312 (assume the amount was converted from another currency to US dollars). On the other hand, when it comes to the latitude and longitude coordinates, every single digit might be important. Long story short, you will need to pay close attention to this issue before using it on your dataset.

In [31]:
def float64_to_float32(df):
for col in df.columns:
if df[col].dtype == 'float64':
df[col] = df[col].astype('float32')
return df


Best solution¶

The best solution I have seen so far is numpy.savez_compressed() to save your dataset as a compressed .npz format and then load it using numpy.load().

This is very straight-forwad and as an example https://www.kaggle.com/friedchips/how-to-reduce-the-training-data-to-400mb shows the use of this function reduced load time from 2min 37s when pandas.read_csv() was used to 7.88 s when numpy.load() was used. Further, the file size was reduced significantly without any precision loss.