This notebook addresses one common issues "The training data is huge, simply reading the full csv at once may be too much". the main issue is not using the correct data type for each feature in your dataset. Pandas load every column that has
numpy.nan values as
numpy.float64; most of the times we don't need
numpy.float64. Assume there is a binary feature that can take only 0 and 1. Data type
numpy.int8 is more than enough to store this data. However, if there is only 1
numpy.nan element in the data set, Pandas (and Numpy) stores all elements of that column as
numpy.float64. Refer to the following example:
import numpy as np import pandas as pd a = np.zeros(10).astype(np.int8) a, a = 1, 1 print('A numpy array of binary feature:') print(a) print('data type:', a.dtype) print('\nA numpy array of binary feature having a numpy.nan element:') c = np.zeros(10) c, c, c = 1, np.nan, 1 print(c) print('data type:', c.dtype) print('\nThe same numpy array stored as numpy.int8 dtype:') c = c.astype(np.int8) print(c) print('data type:', c.dtype) print('Notice how the numpy.nan was replaced by zero.')
A numpy array of binary feature: [0 1 0 0 0 1 0 0 0 0] data type: int8 A numpy array of binary feature having a numpy.nan element: [ 0. 1. nan 0. 0. 1. 0. 0. 0. 0.] data type: float64 The same numpy array stored as numpy.int8 dtype: [0 1 0 0 0 1 0 0 0 0] data type: int8 Notice how the numpy.nan was replaced by zero.
In kaggle most of the competitions (if not all) present the data as
.csv files. Further, almost all of the datasets contain
numpy.nan values. Hence, when you load the data most of the columns will be loaded as
numpy.float64 data type and will take a good chunk of your available memory.
There are some ways introduced over time to overcome this issue.
One way is to use the famous
reduce_memory_usage() function originally introduced in https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65. Below is the function
def reduce_memory_usage(df): start_mem_usg = df.memory_usage().sum() / 1024**2 print("Memory usage of properties dataframe is :",start_mem_usg," MB") NAlist =  # Keeps track of columns that have missing values filled in. for col in df.columns: if df[col].dtype != object: # Exclude strings # make variables for Int, max and min IsInt = False mx = df[col].max() mn = df[col].min() # Integer does not support NA, therefore, NA needs to be filled if not np.isfinite(df[col]).all(): NAlist.append(col) df[col].fillna(mn-1,inplace=True) # test if column can be converted to an integer asint = df[col].fillna(0).astype(np.int64) result = (df[col] - asint) result = result.sum() if result > -0.01 and result < 0.01: IsInt = True # Make Integer/unsigned Integer datatypes if IsInt: if mn >= 0: if mx < 255: df[col] = df[col].astype(np.uint8) elif mx < 65535: df[col] = df[col].astype(np.uint16) elif mx < 4294967295: df[col] = df[col].astype(np.uint32) else: df[col] = df[col].astype(np.uint64) else: if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max: df[col] = df[col].astype(np.int8) elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max: df[col] = df[col].astype(np.int16) elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max: df[col] = df[col].astype(np.int32) elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max: df[col] = df[col].astype(np.int64) # Make float datatypes 32 bit else: df[col] = df[col].astype(np.float32) # Print final result print("___MEMORY USAGE AFTER COMPLETION:___") mem_usg = df.memory_usage().sum() / 1024**2 print("Memory usage is: ",mem_usg," MB") print("This is ",100*mem_usg/start_mem_usg,"% of the initial size") return df, NAlist
First of all, this function automatically fills in your null values for you! that is not exactly what you asked for and is actually a big deal since the strategy used in filling the missing values could have a direct impact on the performance of your model; to avoid this you can first fill the missing values and then call the function on your dataframe. Another issue with this function is that there are some hidden pitfalls in using it as described in https://www.kaggle.com/c/champs-scalar-coupling/discussion/96655. Some people tried to improve the function to enhance its performance but I prefer - and recommend - not to use it at all.
Another alternative to the abovementioned function is to simply convert all
numpy.float64 columns to
numpy.float32. This one is very straight-forward and easy to impelement (see below). However, care should be given when using this one. In some instances you don't care much about the decimal places for example if the feature is transaction amount there should not be a sensible difference between 10.9991 and 10.9990912312312 (assume the amount was converted from another currency to US dollars). On the other hand, when it comes to the latitude and longitude coordinates, every single digit might be important. Long story short, you will need to pay close attention to this issue before using it on your dataset.
def float64_to_float32(df): for col in df.columns: if df[col].dtype == 'float64': df[col] = df[col].astype('float32') return df
The best solution I have seen so far is
numpy.savez_compressed() to save your dataset as a compressed
.npz format and then load it using
This is very straight-forwad and as an example https://www.kaggle.com/friedchips/how-to-reduce-the-training-data-to-400mb shows the use of this function reduced load time from 2min 37s when
pandas.read_csv() was used to 7.88 s when
numpy.load() was used. Further, the file size was reduced significantly without any precision loss.