library(reticulate)
library(tidyverse)
library(janeaustenr)
data("emma")
glimpse(emma) chr [1:16235] "EMMA" "" "By Jane Austen" "" "" "" "" "VOLUME I" "" "" "" ...
Julian Barg
May 11, 2025
Every time I want to quickly bash out a command to code some data with ChatGPT in R, I stumble. Not that it is very difficult, it is just a very different workflow compared to the stuff I usually do, and I have to look it up every time. As a result, I am no longer in the flow.
The easiest way to interact with the OpenAI API, hands down, is to use the official python package. Yes, you could use curl or the R equivalent, but in practice many of the best features, such as structured outputs, are only available when using the python package, or the python implementation is more feature complete or easier to use.
Since r-reticulate provides an interface between R and python, we can also take full advantage of the native python package when working in R. It plugs right into your regular workflow. First we initiate reticulate and grab some example data.
chr [1:16235] "EMMA" "" "By Jane Austen" "" "" "" "" "VOLUME I" "" "" "" ...
Next we switch over to R and set up the function for coding our text data. We could get fancy and accept role and task etc. as function arguments, but that should generally not be necessary.
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal
import json
client = OpenAI()
def code_male_female(line, model_, seed_ = 321):
role = """
You are a graduate research assistants assisting us with our research project
on Jane Austen. Count the number of female and male characters per line.
"""
system_role = {"role": "system", "content": role}
task = f"""
Next, we will send you one line to analyze.
{line}
"""
prompt_json = {
"role": "user",
"content": [
{
"type": "text",
"text": task
}
]
}
messages_ = [
system_role,
prompt_json
]
Certainty = Literal[
"very certain", "certain", "neutral", "uncertain", "very uncertain"
]
class CodingResponse(BaseModel):
male: int = Field()
female: int = Field()
# other: int = Field()
certainty: Certainty
response = client.beta.chat.completions.parse(
model = model_,
messages = messages_,
response_format = CodingResponse
)
return json.loads(response.choices[0].message.content)
test = code_male_female("This is just a text", "gpt-4o-mini")
print(test){'male': 0, 'female': 0, 'certainty': 'very certain'}
Now we can conveniently apply this function to our Jane Austen sample within R.
code_partial <- partial(py$code_male_female, model_ = "gpt-4o-mini")
results <- map(emma[1:20], code_partial, .progress = T) ■■■■■ 15% | ETA: 13s
■■■■■■■■■■■ 35% | ETA: 9s
■■■■■■■■■■■■■■■■■ 55% | ETA: 6s
■■■■■■■■■■■■■■■■■■■■■■■■■ 80% | ETA: 3s
# A tibble: 20 × 3
male female certainty
<int> <int> <chr>
1 0 1 very certain
2 0 0 very uncertain
3 0 0 very certain
4 0 0 very uncertain
5 0 0 very uncertain
6 0 0 neutral
7 0 0 very certain
8 0 0 very certain
9 0 0 neutral
10 0 0 very certain
11 0 0 neutral
12 0 0 very certain
13 0 0 very uncertain
14 0 0 very certain
15 0 1 very certain
16 0 0 very certain
17 0 0 very certain
18 0 0 very certain
19 0 0 very certain
20 0 1 very certain