The Power of Feature Engineering: Unveiling Hidden Insights

Shunya Vichaar
10 min readMay 18, 2023

--

Welcome to the world of feature engineering, where data scientists transform raw data into useful insights. Feature engineering involves creating new variables or modifying existing ones to better understand the data.

Here, we’ll explore different ways to transform data, such as using ratios, averages, cumulative values, and standard deviations, to uncover hidden patterns and trends.

In the hands of a skilled data scientist, even a simple feature can unlock the secrets of a very complex problem.

Let’s go through each transformation that we can use on any given data:

  1. _CRLB: This transformation counts the number of times the variable crossed the lower bound of the confidence interval in the last 3 months. Example, it could be applied to track how many times a stock price dropped below the lower bound of its confidence interval over the past 3 months.
  2. _R0309: This transformation calculates the ratio of the average of the last 3 months to the average of the last 9 months. For example, if we’re analyzing monthly sales data, this ratio could indicate whether sales have been increasing or decreasing in the recent 3-month period compared to the previous 9-month period.
  3. _D0306: This transformation measures the difference between the average of the last 3 months and the average of the last 6 months. In a scenario where we’re examining monthly website traffic, this difference could reflect whether there has been a significant change in traffic between the two time periods.
  4. RMA0306: This transformation calculates the ratio of the maximum value in the last 3 months to the maximum value in the last 6 months. It could be useful in analyzing metrics such as monthly revenue, where the ratio indicates the relative growth or decline in revenue between these two time spans.
  5. RMI0306: This transformation computes the ratio of the minimum value in the last 3 months to the minimum value in the last 6 months. Applied to a dataset of daily temperatures, this ratio would provide insights into whether the recent minimum temperatures are lower or higher compared to the previous 6 months.
  6. RAM0306: This transformation calculates the ratio of the maximum value in the last 3 months to the average value in the last 6 months. If we consider a dataset of monthly expenses, this ratio would indicate whether there have been any unusual spikes or fluctuations in expenses compared to the average.
  7. _CRUB: Similar to _CRLB, this transformation counts the number of times the variable crossed the upper bound of the confidence interval in the last 3 months. For example, if we’re analyzing stock prices, this feature would capture how many times the price exceeded the upper bound of its confidence interval.
  8. _L1: This transformation represents the lagged value of the variable from 1 month ago. It could be applied to various time-dependent datasets. For instance, in analyzing monthly sales, _L1 would provide the previous month’s sales figure.
  9. _L2: Similar to _L1, this transformation represents the lagged value of the variable from 2 months ago. Continuing with the sales example, _L2 would provide the sales figure from two months prior.
  10. _Rc3, _Rc6, _Rc9, _Rc12: These transformations calculate the ratio of the variable’s current value to its value at a lagged time point (3, 6, 9, or 12 months ago). These ratios could be useful for examining the growth or decline in various time-dependent datasets such as stock prices, customer counts, or website traffic.
  11. _Rs3s6, _Rs3s9, _Rs3s12, _Rs6s12, _Rs6s18: These transformations compute the sum of values over a recent time period divided by the sum of values over a prior time period. For instance, _Rs3s6 would calculate the sum of values in the last 3 months divided by the sum of values in the 6 months prior to that. These ratios could reveal patterns or trends in cumulative data, such as monthly revenue or total website visits.
  12. _avg3: This transformation calculates the average of the variable over the last 3 months. It provides a measure of the central tendency of the variable’s recent values. For example, in analyzing monthly temperatures, _avg3 would give the average temperature over the past 3 months.
  13. _avg3_CRLB: This transformation acts as an indicator of whether the average of the last 3 months crossed the lower bound of the confidence interval. It helps identify instances where the average value falls below the lower bound. This indicator can be useful in detecting unusual or significant deviations from the expected range.
  14. _avg3_CRUB: Similar to _avg3_CRLB, this transformation acts as an indicator of whether the average of the last 3 months crossed the upper bound of the confidence interval. It helps identify instances where the average value exceeds the upper bound.
  15. _ci_osh: This transformation calculates a value by subtracting the mean and standard deviation of the confidence interval from _avg3 and setting any negative result to zero. It ensures that the calculated value is non-negative. This transformation is useful in scenarios where negative values are not meaningful or desired.
  16. _ci_ush: This transformation calculates a value by subtracting _avg3 from the mean and standard deviation of the confidence interval and setting any negative result to zero. Similar to _ci_osh, it ensures a non-negative value. This transformation can be beneficial when focusing on positive deviations from the expected range.
  17. _cumi_t<j>: This transformation computes a cumulative value by dividing the current value plus a small epsilon by the minimum value plus the same epsilon, using the last j-month time span. Epsilon is added to avoid division by zero. It captures the relative growth of the variable compared to its minimum value over the specified time span. This transformation can be useful for tracking variables that have a lower limit or baseline value.
  18. _cumx_t<j>: This transformation calculates a cumulative value by dividing the current value plus a small epsilon by the maximum value plus the same epsilon, using the last j-month time span. Epsilon is added to avoid division by zero. It captures the relative growth of the variable compared to its maximum value over the specified time span. This transformation is valuable for monitoring variables that have an upper limit or threshold.
  19. _curr: This transformation represents the current value of the variable at time t=0. It provides the starting point or baseline value for further calculations or comparisons.
  20. _cv_t<j>: This transformation calculates the coefficient of variation using the standard deviation divided by the mean of the variable over the last j-month time span. The coefficient of variation measures the relative variability or dispersion of the variable’s values. It can help assess the stability or volatility of the data.
  21. _div3: This transformation computes a value by subtracting the average of the last 3 months from the average of the prior 21 months and then dividing it by the standard deviation of the prior 21 months. It quantifies the difference in the variable’s recent average compared to the average of a longer time span relative to the standard deviation. This transformation can identify periods of significant deviation from the long-term average.
  22. _div6: This transformation calculates a value by subtracting the average of the last 6 months from the average of the prior 18 months and then dividing it by the standard deviation of the prior 18 months. It provides a measure of the difference in the variable’s average over a shorter time span compared to the average of a longer time span relative to the standard deviation. This transformation can highlight shorter-term fluctuations or trends.
  23. _m2_t<j>: This transformation computes a value by dividing the maximum value over the last j-month time span by the minimum value over the same time span, with the addition of a small epsilon to avoid division by zero. It captures the ratio between the maximum and minimum values of the variable over the specified time period. This transformation can be useful for understanding the range or variability of the variable’s values.
  24. _max_t<j>: This transformation calculates the maximum value of the variable over the last j-month time span. It provides information about the highest value reached by the variable within the specified time period. This transformation can be valuable for identifying peak values or extreme observations.
  25. _mean_CI: This transformation represents the mean of the variable’s values between months -23 and -3. It is used to define the confidence interval of an account or dataset. The mean_CI provides a reference point for comparing individual values and assessing their deviation from the expected range.
  26. _min_t<j>: This transformation computes the minimum value of the variable over the last j-month time span. It indicates the lowest value observed within the specified time period. This transformation can be helpful for identifying troughs or the lowest points in the variable’s values.
  27. _mm_t<j>: This transformation calculates a value by dividing the maximum value over the last j-month time span by the minimum value over the same time span, with the addition of a small epsilon to avoid division by zero. It captures the ratio between the maximum and minimum values of the variable over the specified time period. This transformation is similar to _m2_t<j> but represents the ratio between the maximum and minimum values without considering the time span.
  28. _mx_t<j>: This transformation computes a value by dividing the maximum value over the last j-month time span by the sum of values over the same time span, with the addition of a small epsilon to avoid division by zero. It captures the proportion of the maximum value relative to the sum of values within the specified time period. This transformation can provide insights into the contribution or significance of the maximum value to the overall sum.
  29. _mxme_t<j>: This transformation calculates a value by dividing the maximum value over the last j-month time span by the mean of values over the same time span, with the addition of a small epsilon to avoid division by zero. It captures the ratio between the maximum value and the mean value within the specified time period. This transformation can help identify the relative distance of the maximum value from the average.
  30. _s2_t<j>: This transformation computes a value by subtracting the minimum value over the last j-month time span from the sum of values over the same time span, and then dividing it by the difference between the maximum and minimum values, with the addition of a small epsilon to avoid division by zero. It provides a measure of the proportion of the sum of values relative to the range between the minimum and maximum values. This transformation can be useful for understanding the distribution or concentration of values within the specified time period.
  31. _slop_b1_12m: This transformation represents the slope or coefficient of the account-level regression with the last 12 months as the time span. It captures the trend or rate of change in the variable over the specified time period. This transformation can provide insights into the direction and magnitude of the variable’s long-term trend.
  32. _slop_b1_24m: This transformation represents the slope or coefficient of the account-level regression with the last 24 months as the time span. It captures the trend or rate of change in the variable over a longer time period compared to _slop_b1_12m. This transformation can help identify longer-term trends or patterns.
  33. _slop_b1_24m_12m: This transformation represents the slope or coefficient of the account-level regression with the oldest 12 months as the time span. It captures the trend or rate of change in the variable using the oldest data points available. This transformation can provide insights into the long-term trend or pattern that may differ from the more recent trends.
  34. _slop_b1_6m: This transformation represents the slope or coefficient of the account-level regression with the last 6 months as the time span. It captures the trend or rate of change in the variable over a shorter time period compared to _slop_b1_12m and _slop_b1_24m. This transformation can help identify shorter-term trends or patterns.
  35. _std_CI: This transformation represents the standard deviation of the variable’s values between months -23 and -3. It is used for defining the confidence interval of an account or dataset. The std_CI provides a measure of the dispersion or variability of the variable’s values around the mean.
  36. _std_t<j>: This transformation calculates the standard deviation of the variable over the last j-month time span. It measures the spread or dispersion of the variable’s values within the specified time period. This transformation can help assess the volatility or variability of the variable.
  37. _sto_t<j>: This transformation calculates the stochastic oscillator by subtracting the minimum value over the last j-month time span from the current value and dividing it by the difference between the maximum and minimum values, with the addition of a small epsilon to avoid division by zero. It provides a normalized measure of the current value’s position within the range between the minimum and maximum values. This transformation is useful for assessing the relative position or momentum of the variable.
  38. _sum<j>: This transformation computes the sum of the variable’s values over the last j-month time span. It provides information about the total accumulation or aggregation of the variable’s values within the specified time period. This transformation can be helpful for understanding the overall magnitude or quantity represented by the variable.
  39. _yt_t<j>: This transformation calculates a value by subtracting the minimum value over the last j-month time span from twice the current value and subtracting the maximum value over the same time span. It captures the difference between the current value and the range of values within the specified time period. This transformation can help identify the deviation of the current value from the minimum and maximum values.
  40. _dev: This transformation represents the deviation of the variable from its mean. It measures the difference between each value and the average value of the variable. This transformation can be useful for understanding the distance or divergence of individual values from the central tendency.
  41. _cluster: This transformation utilizes groups or clusters from a clustering algorithm. It preserves the grouping or categorization of the data and ensures that no transformations are applied that would change the groupings. This transformation is useful when maintaining the original cluster structure is important for subsequent analysis or interpretation.

In summary, feature engineering involving creating new variables or transformations based on existing variables to enhance the predictive power or interpretability of a dataset is a useful tool. These transformations capture various aspects such as averages, ratios, cumulative values, deviations, trends, standard deviations, and more. They provide valuable insights into the characteristics and dynamics of the data, enabling more effective analysis and modeling.

--

--

Shunya Vichaar
Shunya Vichaar

Written by Shunya Vichaar

Imagination + Science = Discovery

No responses yet