Conversation
Removed z-score calculations and related outlier flags.
Implement is_outlier function to detect anomalies using IQR method.
| @@ -0,0 +1,40 @@ | |||
| def is_outlier(df, groupby, filter_column=None, window='7D',z_score_sensitivity=2) -> pd.DataFrame: #takes as input: dataframe, groupby - for eg. spaces_left, filter_columns-used to filter by parking_id, window- for eg. uses last 30D for calculating anomalies, returns updated dataframe | |||
There was a problem hiding this comment.
- Add markings as df: pd.DataFrame etc.
- Change name
groupbyname to something more intuitive likegroup_cols(list of string format to be able ingest more than one column) - add Google-Styled doctring
- add mindful comments across all function
- We are not using
z_score_sensitivityor any of c=2, min_t, max_t (delete unused code or add n_sigma method)
| @@ -0,0 +1,40 @@ | |||
| def is_outlier(df, groupby, filter_column=None, window='7D',z_score_sensitivity=2) -> pd.DataFrame: #takes as input: dataframe, groupby - for eg. spaces_left, filter_columns-used to filter by parking_id, window- for eg. uses last 30D for calculating anomalies, returns updated dataframe | |||
| df=df.copy() | |||
There was a problem hiding this comment.
add bool parameter to control if we want to create copy (as it costs as Memory)
|
|
||
| if filter_column is None: | ||
| temp_dataframe=df[groupby] | ||
|
|
There was a problem hiding this comment.
group_cols fix should sort this
|
|
||
| Q1 = temp_dataframe.transform(lambda x: x.rolling(window, min_periods=1).quantile(0.25)) | ||
| Q3 = temp_dataframe.transform(lambda x: x.rolling(window, min_periods=1).quantile(0.75)) | ||
|
|
There was a problem hiding this comment.
groupby.transform(lambda x: x.rolling(...).quantile()) is inefficient: computes rolling twice per quantile, broadcasts results awkwardly, and scales poorly (O(n log n) per group). -> propose more scalable solution
|
|
||
| lower_bound = Q1 - 1.5 * IQR | ||
| upper_bound = Q3 + 1.5 * IQR | ||
|
|
There was a problem hiding this comment.
replace 1.5 with multipler parameter which could tune our IQR method
|
|
||
| df['is_event_iqr_outlier']=df['is_outlier_iqr'] & df['is_event'] | ||
|
|
||
|
|
There was a problem hiding this comment.
Whole function takes already processed dataframe I assume. We need to be able to run feature_engineering functions in order so:
- make this function take needed raw dataframes and transform them inside
- or add new function that will transform dataframes (before) using this function
No description provided.