Evaluating Sample Datasets: 'IBM HR Analytics Employee Attrition & Performance'

garywalton05
Jul 24, 2023
1 min read

Dataset 2: IBM HR Analytics Employee Attrition & Performance | Kaggle

This synthetic dataset was thoughtfully curated by IBM data scientists to train a machine learning model in identifying attrition and performance leading indicators.

Key Evaluation Criteria:

Data Volume 5/10: The dataset includes 1470 workers, with 1233 currently active and 237 having terminated. While it provides a decent number of workers, the absence of dates limits the ability to analyze the data across a reasonable timeframe.

Data Completeness 3/10: The dataset lacks crucial information required for comprehensive workforce analytics. It lacks worker history, which hinders drawing deeper insights from employee trajectories. Furthermore, fundamental HR dimensions such as manager assignments, cost allocations, business structure, and geographical representation are missing. Additionally, the absence of race or ethnicity information restricts its usability for diversity, equity, and inclusion (DE&I) initiatives.

Data Quality 9/10: The dataset has undergone an efficiency initiative to reduce data transmission and storage requirements. This includes representing long-form text as shortened codes, which may require some effort to make the codes human-readable.

Realism and Complexity 2/10: While the dataset introduces an efficiency initiative present in real-world applications, it lacks realism and complexity due to the missing fundamental HR data and limited worker history.

Data Availability 10/10: The dataset is legally and ethically available for use.

Conclusion:

While this dataset may be useful for comparing commonalities between active and terminated workers, its limitations in worker history and missing fundamental HR dimensions preclude its suitability for comprehensive workforce analytics. It is ideal however, for testing hypotheses and training machine learning models.