Author: Dongbo Shi, Sherry T. Tong
Abstract: Identifying an individual's gender based on name is crucial for many scientific studies on gender issues, but it is often complicated by diverse naming conventions globally. Predicting gender from Chinese names poses particular challenges due to unique naming conventions and limited representation in existing datasets. In this study, we introduce a novel dataset comprising 1,051,891 Chinese names in Chinese characters and 96,797 corresponding names in Pinyin from over thirty million Chinese individuals. This dataset includes the frequency of each name’s usage by men and women. We validate our dataset using two additional datasets for predicting gender and find that it offers broader name coverage and higher predictive precision than existing methods. Overall, this dataset serves as an essential resource for advancing research in China's gender-related studies.
Fig. 1 The distribution of names in the Chinese Gender dataset. (a) prensets the distribution of names in Chinese characters and (b) presents that in Pinyin formant. Nmaes are divided the names into ten groups according to their female ratio (ranging from 0 to1), where a higher ratio signifies a name predominantly used by women. The x-axis indicates the female ratio, while the y-axis indicates the population percentage within each group.
Fig 2 The concentration of names in Chinese characters (a) and Pinyin format (b) in the Chinese Gender dataset. The horizontal axis represents the top percent of names, and the vertical axis represents the percentage of the total population.
Fig. 3 Name coverage in the Grantees (a) and Teenagers (b) over various frequency settings. The x-axis is the frequency of names, the y-axis is the proportion of names under a limit of name frequency. Red dotted line is located on frequency=10. Name coverage in the Grantees (c) and Teenagers (d) over various threshold settings. We show the coverage of the Grantees(c) and Teenagers(d) by setting various thresholds, with a frequency of 10. The x-axis is the gender threshold(female=1, male=0), and the y-axis denotes the proportion of males(=<threshold) and females(>=threshold). Adjusting the threshold determines the confidence level for categorizing a name as male or female, affecting the number of classified individuals. Starting at a threshold of 0.5, it can cover all predictable individuals.
Fig. 4 The relation between precision and threshold. Here we restrict the frequency of names to 10 or more and perform the analysis separately for male(a,c) and female(b,d) groups. The upper and lower sides denote the results of the Grantees(a,b) and Teenagers(c,d). The threshold range is 0.5-1, with 0.01 increment.