Some helper functions to 'wrangle' the data once in pandas.

drop_low_uniqueness_cols[source]

drop_low_uniqueness_cols(df:DataFrame, nunique_thold=0.05)

Drop columns with a low number of unique values.

Parameters:
  • df pd.DataFrame A pandas dataframe.
  • nunique_thold float or int If a float then will drop cols with a uniqueness rate below nunique_thold, if is an int then will use counts instead.
Returns:
  • df pd.DataFrame A pandas dataframe.
# tests

df = pd.DataFrame([
    [1,2,3,4],
    [1,20,30,4],
    [1,200,300,40],
    [1,2000,3000,40],
    [1,20000,30000,400],
    [10,200000,300000,400],
], columns=['col0','col1','col2','col3'])

# check that col0 is removed as it only has 2 unique values
assert 'col0' not in drop_low_uniqueness_cols(df, nunique_thold=2).columns
# check that col3 is removed as it only has 50% unique values
assert 'col3' not in drop_low_uniqueness_cols(df, nunique_thold=0.5).columns

drop_low_std_cols[source]

drop_low_std_cols(df:DataFrame, std_thold=0.05)

Drop columns with a low standard deviation value.

Parameters:
  • df pd.DataFrame A pandas dataframe.
  • std_thold float Standard deviation threshold for columns below which they will be dropped.
Returns:
  • df pd.DataFrame A pandas dataframe.
#tests

df = pd.DataFrame([
    [1,2,3,4],
    [1,20,30,4],
    [1,200,300,40],
    [1,2000,3000,40],
    [1,20000,30000,400],
    [1.1,200000,300000,400],
], columns=['col0','col1','col2','col3'])

# check that col0 is removed as it only has 2 unique values and a low std value (0.040825)
assert 'col0' not in drop_low_std_cols(df, std_thold=0.05).columns