This notebook addresses one common issues "**The training data is huge, simply reading the full csv at once may be too much**". the main issue is not using the correct data type for each feature in your dataset. Pandas load every column that has `numpy.nan`

values as `numpy.float64`

; most of the times we don't need `numpy.float64`

. Assume there is a binary feature that can take only 0 and 1. Data type `numpy.int8`

is more than enough to store this data. However, if there is only 1 `numpy.nan`

element in the data set, Pandas (and Numpy) stores all elements of that column as `numpy.float64`

. Refer to the following example:

In [28]:

```
import numpy as np
import pandas as pd
a = np.zeros(10).astype(np.int8)
a[1], a[5] = 1, 1
print('A numpy array of binary feature:')
print(a)
print('data type:', a.dtype)
print('\nA numpy array of binary feature having a numpy.nan element:')
c = np.zeros(10)
c[1], c[2], c[5] = 1, np.nan, 1
print(c)
print('data type:', c.dtype)
print('\nThe same numpy array stored as numpy.int8 dtype:')
c = c.astype(np.int8)
print(c)
print('data type:', c.dtype)
print('Notice how the numpy.nan was replaced by zero.')
```

In kaggle most of the competitions (if not all) present the data as `.csv`

files. Further, almost all of the datasets contain `numpy.nan`

values. Hence, when you load the data most of the columns will be loaded as `numpy.float64`

data type and will take a good chunk of your available memory.

There are some ways introduced over time to overcome this issue.
One way is to use the famous `reduce_memory_usage()`

function originally introduced in https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65. Below is the function

In [30]:

```
def reduce_memory_usage(df):
start_mem_usg = df.memory_usage().sum() / 1024**2
print("Memory usage of properties dataframe is :",start_mem_usg," MB")
NAlist = [] # Keeps track of columns that have missing values filled in.
for col in df.columns:
if df[col].dtype != object: # Exclude strings
# make variables for Int, max and min
IsInt = False
mx = df[col].max()
mn = df[col].min()
# Integer does not support NA, therefore, NA needs to be filled
if not np.isfinite(df[col]).all():
NAlist.append(col)
df[col].fillna(mn-1,inplace=True)
# test if column can be converted to an integer
asint = df[col].fillna(0).astype(np.int64)
result = (df[col] - asint)
result = result.sum()
if result > -0.01 and result < 0.01:
IsInt = True
# Make Integer/unsigned Integer datatypes
if IsInt:
if mn >= 0:
if mx < 255:
df[col] = df[col].astype(np.uint8)
elif mx < 65535:
df[col] = df[col].astype(np.uint16)
elif mx < 4294967295:
df[col] = df[col].astype(np.uint32)
else:
df[col] = df[col].astype(np.uint64)
else:
if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
# Make float datatypes 32 bit
else:
df[col] = df[col].astype(np.float32)
# Print final result
print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = df.memory_usage().sum() / 1024**2
print("Memory usage is: ",mem_usg," MB")
print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
return df, NAlist
```

First of all, this function automatically fills in your null values for you! that is not exactly what you asked for and is actually a big deal since the strategy used in filling the missing values could have a direct impact on the performance of your model; to avoid this you can first fill the missing values and then call the function on your dataframe. Another issue with this function is that there are some hidden pitfalls in using it as described in https://www.kaggle.com/c/champs-scalar-coupling/discussion/96655. Some people tried to improve the function to enhance its performance but I prefer - and recommend - not to use it at all.

Another alternative to the abovementioned function is to simply convert all `numpy.float64`

columns to `numpy.float32`

. This one is very straight-forward and easy to impelement (see below). However, care should be given when using this one. In some instances you don't care much about the decimal places for example if the feature is transaction amount there should not be a sensible difference between 10.9991 and 10.9990912312312 (assume the amount was converted from another currency to US dollars). On the other hand, when it comes to the latitude and longitude coordinates, every single digit might be important. Long story short, you will need to pay close attention to this issue before using it on your dataset.

In [31]:

```
def float64_to_float32(df):
for col in df.columns:
if df[col].dtype == 'float64':
df[col] = df[col].astype('float32')
return df
```

The best solution I have seen so far is `numpy.savez_compressed()`

to save your dataset as a compressed `.npz`

format and then load it using `numpy.load()`

.

This is very straight-forwad and as an example https://www.kaggle.com/friedchips/how-to-reduce-the-training-data-to-400mb shows the use of this function reduced load time from 2min 37s when `pandas.read_csv()`

was used to 7.88 s when `numpy.load()`

was used. Further, the file size was reduced significantly without any precision loss.