An HR dataset would be nothing without people right? Here I'll talk about generating fictional individuals and how I assign them to positions.
Generating Fictitious Workers
The basic worker information required consists of:
First Name
Last Name
Gender
Military Statuses
Marital Status
Race/Ethnicity
Date of Birth
In order to generate this information en masse, I employed ChatGPT. Using the following prompts, and variations, it was able to supply around 40 workers per response in the web interface (you could probably use the API and specify the required number of responses to get them all at once).
Prompts:
Create a table of fictional Americans with the following headers: Worker Name, First Name, Last Name, Gender, Military Statuses, Marital Status, Race/Ethnicity, Date of Birth
Variations:
...Americans* of suitable disposition to work in a manufacturing plant, *with...
...Americans* of suitable disposition to work in scientific laboratories, *with...
...Americans*, males, over 18, *with...
...Americans*, females, over 40, *with...
...Americans*, between the ages of 18 and 27, *with...
The outputs should look something like this:
As can be seen, ChatGPT will attempt to output an 'equal distribution' of results (unless otherwise dictated in the prompt), this is not a problem as we are currently looking for a diverse pool to pull our workers from.
Assigning Workers to Positions
Now that we have a pool to pull from, we can use past experience and generalised knowledge of the industry to select individuals to fill positions in certain job categories, for instance:
HR will primarily be female populated with workers present in all age groupings. There is little difference in age distribution and race/ethnicity distribution is usually even.
Production plant workers are predominantly male, with lower skilled roles likely to be held by those under 30, the race/ethnicity distribution is also likely to be skewed towards Black/African American persons.
Higher skilled production positions are likely to be held by those above the age of 30, gender and race/ethnicity distributions are more even.
Scientific roles are skewed towards Asian males in their 30s.
Leadership roles are skewed towards White males over the age of 40.
Now it is a simple matter of filtering our pool and assigning the appropriate people to the position so that groups of jobs, specialities and responsibility reasonably align with our real-world experience.
Putting Faces to Names
Now that we have people in positions, and to make the dataset a little more human, we can start generating profile pictures by using midjourney, an image generator based on stable diffusion. Similar to ChatGPT, we can supply a prompt of what we need, resulting in 4 samples that can be upscaled if desired.
Prompt:
realistic, high quality, photograph, mid-shot, isolated on white background, a picture of a professional White Male in his 60s
realistic, high quality, photograph, mid-shot, isolated on white background, a picture of a Black Male in his 20s
realistic photograph, mid-shot, isolated on white background, a worker profile picture for a Asian Female of 50 years of age who holds a Biologics Scientist position
Results:
Adding Compensation Data
People don't work for free, so we'll be returning to ChatGPT and incorporating some market data to find out how much each position could pay.
Prompt:
For the following positions, I need you to provide, in a table, minimum acceptable salary in $, maximum acceptable salary in $, assume that all positions are located in Baltimore, MA. the positions are: {List positions here}
Result:
Using these values as a baseline (some positions modified to closer reflect market) I then calculate salaries using the minimum and maximum as a range and apply a randomised modifier and apply an individual amount to each worker.
Modifiers:
0.5 to 1.3 for leadership positions
0.1 to 0.8 for mid-level positions
-0.25 to 0.5 for lower level positions
Bonus percentages are also applied based on the level of the worker, 15% to 20% for leadership, 10% to 15% for mid-level and 7.5% for lower-level.
Usable Initial Load
With this, we now have the minimum required information to create an HR dataset. The scope is very limited and can only present 'as of now' information. I'll be taking a look at some reporting we can do against this data before beginning to emulate Globocom's expansion with data generated over time.
Comments