Skip to content

Pre-processing

These functions perform pre-processing steps on the raw data.

rename_columns

Rename the columns so that they have computer names as defined in the given data dictionary.

Tip

Renaming the columns should always be the first pre-processing step.

Parameters:

Name Type Description Default
df pandas DataFrame

The pandas DataFrame.

required
df_data_dictionary pandas DataFrame

DataFrame with data dictionary information. It should have at least the following columns:

  • Name - The original name of the column in the raw data
  • Computer name - The computer name defined in the dictionary
required
verbose bool

Define if verbose output will be printed (True) or not (False).

True

Returns:

Name Type Description
df_renamed pandas DataFrame

Same as input df, but with the columns renamed according to the data dictionary mapping.

Source code in pycelldyn/preprocessing.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def rename_columns(df, df_data_dictionary, verbose=True):
    """`rename_columns`

    Rename the columns so that they have computer names as defined
    in the given data dictionary.

    !!! tip
        Renaming the columns should always be the first pre-processing step.

    Parameters
    ----------
    df : pandas DataFrame
        The pandas DataFrame.

    df_data_dictionary : pandas DataFrame
        DataFrame with data dictionary information. It should have
        at least the following columns:

        * `Name` - The original name of the column in the raw data
        * `Computer name` - The computer name defined in the dictionary

    verbose : bool
        Define if verbose output will be printed (`True`) or not (`False`).

    Returns
    -------
    df_renamed : pandas DataFrame
        Same as input `df`, but with the columns renamed according
        to the data dictionary mapping.
    """
    if verbose:
        print("Renaming columns...", flush=True, end='')

    # Check that columns of interest are present in the data dictionary.
    for col in ['Name', 'Computer name']:
        if col not in df_data_dictionary.columns:
            raise Exception(f"Column '{col}' not present in df_data_dictionary")

    # Select the data dictionary's columns of interest.
    df_data_dictionary = df_data_dictionary[['Name', 'Computer name']]

    # Convert DataFrame to dictionary.
    df_data_dictionary_dict = dict(zip(df_data_dictionary['Name'].values,
                                       df_data_dictionary['Computer name'].values))

    # Perform the renaming.
    df_renamed = df.rename(columns=df_data_dictionary_dict)

    if verbose:
        print("\tDONE!")


    return df_renamed

clean_dataframe

Clean categorical and numerical columns of a Sapphire or Alinity DataFrame.

Info

To identify what type a column is, this function uses information from the given data dictionary:

  • Numerical columns are those that have a Type of int, float, or int (scientific notation).
  • Categorical columns are those that have a Type of str.
  • Columns that fall outside of these types remain unchanged.

Parameters:

Name Type Description Default
df pandas DataFrame

The pandas DataFrame.

required
df_data_dictionary pandas DataFrame

DataFrame with data dictionary information. It should have at least the following columns:

  • Computer name - The computer name of each parameter.
  • Type - Variable type
required
cols list of str

List with the columns to be cleaned. If None, all columns will be (attempted to be) cleaned.

None
verbose bool

Define if verbose output will be printed (True) or not (False).

True

Returns:

Name Type Description
df_clean pandas DataFrame

Clean DataFrame.

Source code in pycelldyn/preprocessing.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def clean_dataframe(df, df_data_dictionary, cols=None, verbose=True):
    """`clean_dataframe`

    Clean categorical and numerical columns of a Sapphire or Alinity
    DataFrame.

    !!! info
        To identify what type a column is, this function uses information
        from the given data dictionary:

        * Numerical columns are those that have a `Type` of `int`,
        `float`, or `int (scientific notation)`.
        * Categorical columns are those that have a `Type` of `str`.
        * Columns that fall outside of these types remain unchanged.


    Parameters
    ----------
    df : pandas DataFrame
        The pandas DataFrame.

    df_data_dictionary : pandas DataFrame
        DataFrame with data dictionary information. It should have
        at least the following columns:

        * `Computer name` - The computer name of each parameter.
        * `Type` - Variable type

    cols : list of str
        List with the columns to be cleaned. If `None`,
        all columns will be (attempted to be) cleaned.

    verbose : bool
        Define if verbose output will be printed (`True`) or not (`False`).

    Returns
    -------
    df_clean : pandas DataFrame
        Clean DataFrame.
    """

    if verbose:
        print("Cleaning columns...")

    # Check that columns of interest are present in the data dictionary.
    for col in ['Computer name', 'Type']:
        if col not in df_data_dictionary.columns:
            raise Exception(f"Column '{col}' not present in df_data_dictionary")

    # Select the data dictionary's columns of interest.
    df_data_dictionary = df_data_dictionary[['Computer name', 'Type']]
    df_data_dictionary = df_data_dictionary.set_index('Computer name')


    # Define which columns will be cleaned.
    if cols is None:
        cols = df.columns

    # Perform cleaning of columns.
    # This is done one by one and depending on the column type.
    types_numerical = ['int', 'float', 'int (scientific notation)']
    types_categorical = ['str', 'string']

    df_clean = df.copy()
    for col in cols:
        # In case we ever will need to use unit information, we can
        # do so by uncommenting this line:
        # col_unit = str(df_dictionary.loc[col, 'Unit']).lower()

        # If a column name starts with an underscore (_), it means that
        # it is meant to be ignored (for example, in cases when columns
        # are duplicated.
        if col[0] == '_':
            if verbose:
                print(f"- Column {col} is to be ignored.")
            continue

        col_type = str(df_data_dictionary.loc[col, 'Type']).lower()

        if col_type in types_numerical:
            if verbose:
                print(f"+ Cleaning numerical ({col_type}) column {col}...", end='', flush=True)
            df_clean[col] = clean_column_numerical(df, col)

            if verbose:
                print("\t DONE!")

        elif col_type in types_categorical:
            if verbose:
                print(f"+ Cleaning categorical ({col_type}) column {col}...", end='', flush=True)
            df_clean[col] = clean_column_categorical(df, col)
            if verbose:
                print("\t DONE!")
        else:
            if verbose:
                print(f". Column {col} will not be cleaned and left as is.")


    if verbose:
        print("\tDONE!")


    return df_clean

clean_column_numerical

Clean a numerical column. It applies the following steps:

  • Convert empty spaces (i.e., ' ') to NaNs.
  • Convert weird entries with a value of \xa0 to NaNs.
  • Convert entries with a value of 'nan' to NaNs.
  • Cast to float to ensure that values will be numbers.

Parameters:

Name Type Description Default
df pandas DataFrame

The pandas DataFrame.s

required
col string

Name of the numerical column to be cleaned.

required

Returns:

Name Type Description
col_clean pandas Series

The clean (numerical) column.

Source code in pycelldyn/preprocessing.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def clean_column_numerical(df, col):
    """`clean_column_numerical`

    Clean a numerical column. It applies the following steps:

    * Convert empty spaces (i.e., `' '`) to `NaN`s.
    * Convert weird entries with a value of `\\xa0` to `NaN`s.
    * Convert entries with a value of `'nan'` to `NaN`s.
    * Cast to float to ensure that values will be numbers.

    Parameters
    ----------
    df : pandas DataFrame
        The pandas DataFrame.s

    col : string
        Name of the numerical column to be cleaned.

    Returns
    -------
    col_clean: pandas Series
        The clean (numerical) column.
    """
    df_clean = df.copy(deep=True)

    # Clean weird string entries.
    df_clean.loc[df[col]==' ', col] = np.nan
    df_clean.loc[df[col]=='\xa0', col] = np.nan
    df_clean.loc[df[col]=='nan', col] = np.nan

    # Cast to float (i.e., ensure that it will be a number).
    df_clean[col] = df_clean[col].astype(float)

    return df_clean[col]

clean_column_categorical

Clean a categorical column. It applies the following steps:

  • Make strings lower case
  • Remove leading spaces
  • Remove trailing spaces
  • Convert weird entries with a value of \xa0 to NaNs.

Parameters:

Name Type Description Default
df pandas DataFrame

The pandas DataFrame.

required
col str

Name of the categorical column to be cleaned.

required

Returns:

Name Type Description
col_clean pandas Series

The clean (categorical) column.

Source code in pycelldyn/preprocessing.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
def clean_column_categorical(df, col):
    """`clean_column_categorical`

    Clean a categorical column. It applies the following steps:

    * Make strings lower case
    * Remove leading spaces
    * Remove trailing spaces
    * Convert weird entries with a value of `\\xa0` to `NaN`s.

    Parameters
    ----------
    df : pandas DataFrame
        The pandas DataFrame.

    col : str
        Name of the categorical column to be cleaned.

    Returns
    -------
    col_clean: pandas Series
        The clean (categorical) column.
    """

    df_clean = df.copy(deep=True)

    # Make lower case, remove leading/trailing spaces, and convert
    # apostrophes to proper format (’ --> ').
    def _clean_string(string):

        if isinstance(string, str):
            clean_string = string.lower().strip()
            clean_string = clean_string.replace("’", "'")

        else:
            clean_string = string

        return clean_string
    df_clean[col] = df_clean[col].apply(_clean_string)

    # Remove weird string entries.
    df_clean.loc[df[col]=='\xa0', col] = np.nan

    return df_clean[col]