Generation of Census Tabular data with Generative Adversarial Networks

Mayank Agarwal
8 min readMay 1, 2021

Increasing demand for data and its privacy encourages the generation of synthetic data, which is as realistic and worthy as real data and enables various important applications, including data disclosure, data accessibility, and privacy-preserving. However, tabular data usually contains a mix of discrete and continuous columns, building such a model is a non-trivial task. Deep generative models like Generative Adversarial Networks (vanilla GAN) and its variants Wasserstein GAN(WGAN), Conditional GAN (CGAN) are exciting approaches for the synthesis of new data and can produce synthetic data that can be used for research and also maintaining consumer privacy. WGAN and CGAN models preserve all modes of data distributions and handle the problems such as mode collapse and vanishing gradient which occurs in vanilla GANs. For image datasets, previous GAN research has yielded quite promising results. The feasibility of employing GANs for the generation of tabular data for U.S. census tract data is focused in this research and a comprehensive evaluation framework is developed to facilitate comparison of the synthetic data with real data by different GAN models.

In this blog, I generated United States Census data using these GANs and evaluated their performance by considering data for each census tract in the U.S, including DC and Puerto Rico for 2017. Below are the focused questions:

1. Can GAN be used to generate synthetic U.S. census Tract tabular data in a manner that preserves the statistical properties of the underlying real data? For this purpose, I have considered a few variables namely Income, Poverty, Unemployment, and compared the distribution of real and generated data.

2. Whether the generated data provides the same correlation as in real data. For this purpose, I have cross-checked the correlation between Income and races, Income and Poverty, Black or Hispanic race and Poverty, and finally the relationship between Income and Unemployment.

Short about GANs

Generative adversarial networks are progressing techniques for both semi-supervised and unsupervised learning. They can be trained by training a pair of networks. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles A common analogy, the counterfeiters, known in the GAN literature as the generator, G, produces fake data, with the aim of producing realistic data. The police, known as the discriminator, D, receives both fake and real data and aims to tell them apart. Both are trained concurrently, and in competition with each other.

Generative Adversarial Nets. Source:https://www.toptal.com/machine-learning/generative-adversarial-networks.

Alternatively, A GAN has a Generator G(.) and Discriminator D(.). The generator is expected to project a multivariate Gaussian distribution to the data distribution. The discriminator is intended to tell whether the distribution parameterized by the Generator is the same as the distribution of real training data. The generator and discriminator perform the arm-wrestling and when they are perfect, they achieve a Nash equilibrium.

The GAN methods used for improving the performance:

CGAN

In GAN, we can’t direct the data generation process as there is no control on the modes of the synthetic data. But the direction of data generation can be conditioned if we add additional information. We can extend the GAN to a conditional model called CGAN if the generator and the discriminator are conditioned on additional information y, where y can be the class label or any other auxiliary information. By adding y into both the generator and discriminator as an extra input layer, conditioning can be achieved.

Mathematical Equation for Conditional GAN

WGAN

Traditionally binary cross-entropy loss used to train GANs. But one can often encounter problems such as Vanishing gradient and Mode collapse of the multi-modal distribution of the data. In order to replicate these problems due to the underlying cost function of the whole architecture, Earth-Mover (EM) distance or Wasserstein-1 and The WGAN value function is constructed using the Kantorovich-Rubinstein duality.

The mathematical equation for WassersteinGAN

New objective loss function with gradient penalty

Improvised Loss function with gradient penalty in WGAN

where lambda is the penalty coefficient. In this way, WGAN provides very clean gradients and improves the stability of the Optimization process.

Data preprocessing

To determine the performance of different GANs, I considered U.S. census tract data from Kaggle including DC and Puerto Rico for the year 2017. Dataset has 37 features and 74,001 observations.

Information is available for a particular county of a state. There are 52 states and 1955 counties. The dataset contains both categorical and numerical data types which have different distributions. Rows which are having values only for State and County and no values for other features are not considered for further analysis. State and County columns are categorical. For processing the data in later stages, I have used the frequency encoding technique to replace the categorical value with the numerical value and then taking the logarithm(base 2) of the variable for frequency dampening and the column renamed as state frequency (18). Missing values in the Income and Poverty columns were imputed by the median for the respective columns. The features which are highly correlated and the features such as Drive, Carpool, which don’t contribute towards my goals have been removed. The final dataset has standardized values for 13 columns and 73189 rows.

As CGAN makes use of class labels, using the Elbow method the number of clusters was finalized. Later, using an unsupervised K-means Clustering algorithm, we divided the dataset into two classes based on the feature Income.

The below figure shows the real median household Income of different races and Hispanic origin in the U.S. from 1967 to 2017. Black and Hispanic races had less real median income compared to other races.

Line Chart showing the median Income by different Races in U.S. Source: U.S. Census Bureau, Current Population Survey, 1967 to 2017, Annual Social and Economic Supplements

Features of Interest and Correlation between them

I choose to compare the point estimate of Income, Poverty, Unemployment in real data with the generated data.

Table showing the point estimators in Real data
Box plots for the features selected

I would like to cross-check whether the generated data exhibits the same correlation as in real data.

1. Hispanic and Black are having a negative correlation with Income.

2. Black and Hispanic races are positively correlated with Poverty.

3. Poverty and Unemployment are negatively correlated with Income.

Statistical Evaluations of Results

Our focus here is to show how well the GAN models are preserving the statistical properties of the underlying data. For this, we compare the box plots and pairwise associations between features of interest of real and generated dataset. Also, to compare the generated dataset to the underlying dataset, visual inspection of graphs of density functions, and cumulative sum distributions are used. Plots are shown for 2000 generated standardized samples which are transformed using inverse transform to retrieve the samples that look similar to columns of real data.

  1. The density plots and cumulative sum distributions clearly indicate mode collapse in Vanilla GAN. At tails, CGAN and WGAN are not following real data distribution.
Density and Cumulative plots showing real and fake data for Income Feature

2. Below figure shows the mean as the point estimate for features of interest in real data and generated data by different GAN models. Mean is not exactly the same as in real data CGAN and WGAN results are comparatively better than vanilla GAN. More samples and fine-tuning of our model might give good results.

Point Estimate for the real and generated data for selected features Income,Poverty and Unemployment respectively by Different methods of GAN

3. The correlation between features of interest has below outcomes in all GAN models.

• It can be clearly seen that black, Hispanic races are negatively correlated with Income.

• Hispanic and black races are prone to poverty and unemployment as compared to other races.

• Poverty and unemployment are negatively correlated with income

Pair-wise association (Correlation) between features- Comparing different methods of GAN

The goal of this research case study is to examine the capabilities of GAN-based models in generating synthetic U.S. Census tabular data by considering the specific categorical and continuous features. The emphasis is on maintaining the statistical properties of the synthesized data in order to facilitate subsequent analysis by increasing the amount of data availability without compromising the privacy of underlying data. In order to accomplish the objective, different models of GANs are evaluated namely vanilla GAN, WGAN, and CGAN. The limitations of vanilla GAN such as mode collapse and vanishing gradient are handled by WGAN with gradient penalty function and CGAN which uses conditional probability based on the label which is constructed based on the Income feature. The most important for GANs to generate synthetic data is the prepossessing of the real data which has been done to handle categorical and numerical features precisely. Also, Optimizing the hyperparameters such as choice of non-linear activation function, choice of the critic in WGAN, and learning rate of Optimizer for efficient learning of weights in back-propagation in convolutional layers of deep neural networks in GAN architecture yield improvements in the results. Here, it can be seen that the resulting distributions and correlation chart by WGAN and CGAN show better pairwise associations between features of interest as compared to vanilla GAN. But still, there are flaws in the generated data that constraints the results within some confined range as compared to real data.

Sample of Extracted generated data in .CSV file format

Real Data
Generated data by GAN
Generated data by WGAN
Generated data by CGAN

This is the result of generating the tabular data by GANs for U.S. Census Data. This blog will give you deep insights into the generation of Tabular data and how to compare results and check whether the generated data shows the same statistical properties as real data. I will upload the git hub link after some modifications and comments.

If you find this useful and have suggestions to better compare the results generated by GANs or if you have a question then write in the comments below. Also, do let me know the improvements required. Thanks.

Follow me on Medium & LinkedIn.

If you enjoyed then Clap and Share it!

Thanks!!!

References:

  1. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.arXiv preprint arXiv:1406.2661, 2014.
  2. Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  3. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.arXiv preprint arXiv:1411.1784, 2014.
  4. Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. arXiv preprint arXiv:1806.03384, 2018.

--

--

Mayank Agarwal

Always eager to learn Data Science and cooking recipes :)