Data Critique

Where Our Data Came From

We found the dataset on Kaggle, which is just a scrape of the Academy Awards database. It documents all nominees across all categories of the Academy Awards, stating year of film, category of nomination, race, gender, film name, etc. The data starts from the very first ceremony in 1928 to the 92nd in 2020, meaning no post-pandemic ceremony has been recorded in the set. This means we cannot account for any change in racial or gender trends in the past five Oscar ceremonies. The main source of the dataset is by Raphael Fontes, and the author of the Kaggle dataset, Dharmik Donga, built upon the dataset to include the gender and race of the nominees and winners.

Issues with Our Dataset

In the dataset itself, we see that there are some missing cells, as well as different names for the same award, such as “Actress in a supporting role” also appearing as “Best actress in a supporting role.” Additionally, with some people’s names who have accents, the dataset did not import the accents properly. Also, due to some awards, such as music, awarding multiple people, some of the cells list multiple people instead of one person themself. This should be taken into consideration as the list of people is listed as one gender and race, which can give less information about the people and lead to a lack of representation in the dataset. We also see that for awards such as “Best Picture,” the winner of the award is credited as the Producer, which can prove to be a headache in data cleaning. Some cells are instead named as “Honorary Awards” for specific people for their influence in the industry, and thus, they do not list any specific films associated with them.


The dataset can illuminate long-term patterns in recognition at the Oscars, especially related to gender and race. It allows us to examine who has historically been celebrated in awards categories by the Academy and how representation amongst these winners has changed over time. The dataset can also help show moments when “diversity” increased or stagnated, and how those shifts may relate to broader social or cultural changes. The dataset is not able to explain why certain nominees won or lost, since voting data or industry context is not included. It also does not represent the full film industry or all acting performances, only those that received Oscars. Additionally, it does not tabulate films’ popularity or gross, so the most mass-viewed films or performances may not be included, nor does it include a score or rating from a credible site to see whether less critically acclaimed movies won over others, and being able to see the difference in reception between the two could show if there was a discrepancy or if voters had a particular bias toward any one race or gender. The dataset also simplifies complex identities into limited categories for race, which may overlook nuance. This should be taken into consideration, as there may be some bias when the owner of the dataset is inputting data. Additionally, an ideological bias that leaks through in the dataset is the classification of countries as races. In the category “Foreign Language Film,” the dataset creator classifies each nominee as a race, for example, choosing to classify the country of “Japan” as “Asian” and “Mexico” as “White.” This is an interesting choice because it reflects the creator’s personal generalizations of race and overlooks and oversimplifies the complex racial identities that make up each country. Additionally, the dataset uses “Male” and “Female” as categories, and it does not account for the diversity of gender identities in the film industry. This limitation may erase the contributions of any non-binary or gender-nonconforming nominees or winners. Overall, despite several structural and ideological limitations of the dataset, it remains a valuable resource for identifying significant trends and patterns within the history of the Academy Awards.