Using the OpenAI API to quickly code data in R – Dr. Julian Barg, climate disinformation research

Every time I want to quickly bash out a command to code some data with ChatGPT in R, I stumble. Not that it is very difficult, it is just a very different workflow compared to the stuff I usually do, and I have to look it up every time. As a result, I am no longer in the flow.

The easiest way to interact with the OpenAI API, hands down, is to use the official python package. Yes, you could use curl or the R equivalent, but in practice many of the best features, such as structured outputs, are only available when using the python package, or the python implementation is more feature complete or easier to use.

Since r-reticulate provides an interface between R and python, we can also take full advantage of the native python package when working in R. It plugs right into your regular workflow. First we initiate reticulate and grab some example data.

library(reticulate)
library(tidyverse)
library(janeaustenr)
data("emma")
glimpse(emma)

 chr [1:16235] "EMMA" "" "By Jane Austen" "" "" "" "" "VOLUME I" "" "" "" ...

Next we switch over to R and set up the function for coding our text data. We could get fancy and accept role and task etc. as function arguments, but that should generally not be necessary.

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
import json

client = OpenAI()
def code_male_female(line, model_, seed_ = 321):
  role = """
  You are a graduate research assistants assisting us with our research project 
  on Jane Austen. Count the number of female and male characters per line.
  """
  system_role = {"role": "system", "content": role}

  task = f"""
  Next, we will send you one line to analyze.
  
  {line}
  """
  prompt_json = {
    "role": "user", 
    "content": [
      {
        "type": "text", 
        "text": task
      }
    ]
  }
  messages_ = [
    system_role, 
    prompt_json
  ]
  Certainty = Literal[
    "very certain", "certain", "neutral", "uncertain", "very uncertain"
    ]
  class CodingResponse(BaseModel):
    male: int = Field()
    female: int = Field()
    # other: int = Field()
    certainty: Certainty
    
  response = client.beta.chat.completions.parse(
    model = model_,
    messages = messages_,
    response_format = CodingResponse
  )
  return json.loads(response.choices[0].message.content)
  
test = code_male_female("This is just a text", "gpt-4o-mini")
print(test)

{'male': 0, 'female': 0, 'certainty': 'very certain'}

Now we can conveniently apply this function to our Jane Austen sample within R.

code_partial <- partial(py$code_male_female, model_ = "gpt-4o-mini")
results <- map(emma[1:20], code_partial, .progress = T)

 ■■■■■                             15% |  ETA: 13s

 ■■■■■■■■■■■                       35% |  ETA:  9s

 ■■■■■■■■■■■■■■■■■                 55% |  ETA:  6s

 ■■■■■■■■■■■■■■■■■■■■■■■■■         80% |  ETA:  3s

bind_rows(results)

# A tibble: 20 × 3
    male female certainty     
   <int>  <int> <chr>         
 1     0      1 very certain  
 2     0      0 very uncertain
 3     0      0 very certain  
 4     0      0 very uncertain
 5     0      0 very uncertain
 6     0      0 neutral       
 7     0      0 very certain  
 8     0      0 very certain  
 9     0      0 neutral       
10     0      0 very certain  
11     0      0 neutral       
12     0      0 very certain  
13     0      0 very uncertain
14     0      0 very certain  
15     0      1 very certain  
16     0      0 very certain  
17     0      0 very certain  
18     0      0 very certain  
19     0      0 very certain  
20     0      1 very certain