Translated Stata variable and value labels for the RAIS Dataset.
This repository provides translated variable and value labels for the RAIS Dataset. The Python code uses takes a Stata .dta file, pulls all varile and value labels, and uses Google Translate API to translate them to a language of your choice.
I have already generated labels for 5 languages, available in the labels-{lang}
folder:
## How do I use these labels?
The output is two .do files for each .dta: <dtaname>-varlabs.do
and <dtaname>-vallabs.do
, which one can run within Stata after opening the dta file using do <dtaname>-varlabs.do
and do <dtaname>-vallabs.do
.
## Translating a Single File
Use 01_translate_local.ipynb
if you wish to only translate one file. In the 4th code block, you will see the code that tests the doLabs()
function. This is used for a single file. Just replace all the strings with your file paths, and run!
## Disclaimer I am using cleaned RAIS dataset courtesy Ricardo Dahis (https://github.com/rdahis/clean_RAIS). Ricardo also provides some english labels under RAIS/extra/Variables_RAIS_1985-2018.xlsx. The workflow I use is more general and covers all variable and value labels, can be used to translate to multiple languages, and should provide a simplified method for attaching them to a dataset you are using (simply run the .do file), but there is a risk of Google Translate messing up the translation. Use at your own discretion!
## Requirements
conda install -c conda-forge pandas
), Google Cloud Core (conda install -c conda-forge google-cloud-core
) and Google Cloud Translate (conda install -c conda-forge google-cloud-translate
) installed. (Optionally use pip)## Files
If you are just trying to learn how to translate one .dta file. 01_translate_local.ipynb
will suffice. The other two are meant for looping over multiple files and translating them all.
01_translate_local.ipynb
: A Jupyter notebook explaining the full process of connecting to Google Cloud, translating text, and creating programs to pull and translate the labels. Use this if you are just doing one or a couple datasets locally.02_translate_cloud.py
: A Python file meant for cloud computing. Can also be used for locally running python 02_translate_cloud.py
if this is preferred. This allows you to specify a language in the call to the file. e.g. python 02_translate_cloud.py 'en'
.03_array_job.sh
: A SLURM batch script for running 02_translate_cloud.py on a SLURM based High performance computing cluster in the cloud. The RAIS data I was using took up 750GB, so I ran everything in the cloud. This is likely unnecessary if you are only trying to translate one file at a time.## Replication Instructions For those who just want the labels, feel free to take the labels from this repository: No need to replicate.
For those interested in replication, or for those interested in how to translate Stata dataset variable and value labels more generally, read on!
### Setting Up Google Cloud
These setup instructions are repeated in 01_translate_local.ipynb
.
Before you begin, you will need to set up an account and project in Google Cloud. You can use a personal Google account for this step.
#### 1: Create a Google Cloud account and project
gcloud init
.gcloud auth application-default set-quota-project <PROJECT ID>
conda install -c conda-forge google-cloud-core
) and Google Cloud Translate (conda install -c conda-forge google-cloud-translate
) are installed in your anaconda environment. (Optionally use pip).conda activate ENV_NAME
module load gcloud/379.0.0
.gcloud init
gcloud auth application-default set-quota-project <PROJECT ID>
to try again. You will know it worked when the last line in te terminal reads “Quota project “conda install -c conda-forge google-cloud-core
) and Google Cloud Translate (conda install -c conda-forge google-cloud-translate
) are installed in your anaconda environment. (Optionally use pip).That’s it! To quickly check whether it has worked (on LDE or SLURM), open a quick iPython environment (or notebook) and try the following:
from google.cloud import translate_v2 as translate
translate_client = translate.Client()
translate_client.translate("Hello!",target_language='fr',source_language='en')
OUT: {'translatedText': 'Bonjour!', 'input': 'Hello!'}
Feel free to reach out if anything is not working. You can email me at aaron [dot] wolf [at] u [dot] northwestern [dot] edu.