Do Grammatical Errors in Social Media Posts Affect the Perception of the Authors' Intelligence?

Posted by Cris Benge Stone Jiang Andrew Fogarty on Sat 10 April 2021

View us on GitHub


Study Overview

Command of language is one of the most significant cognitive abilities we possess and is often the most pervasive signal we encounter in a social media setting. When we notice overt and unintentional grammatical errors in social media posts, do we make unconscious assumptions about the authors’ general intelligence? Do we attribute difficulty with written language with other indicators such as lower-performing verbal acuity or overall intelligence? Further, are some categories of grammatical errors more injurious than others – or do we take in stride all these trespasses?

General intelligence, sometimes referred to as cognitive ability, includes our capacity to ”reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly, and learn from experience” (Plomin, 1999). Assessment of our cognitive abilities often informs critical judgments in others that affect our educational, occupational, and relationship opportunities (B. M. Newman & P. R. Newman, 2020). Though social media channels are often used to identify potentially qualified job candidates, their use in screening candidates for suitability and background investigation is also on the rise (Driver, 2020). In the CareerBuilder survey that noted an increase in social media screening, 57% of employers reported rejecting candidates based on negative findings in applicant social media posts. Of those rejected, 27% of employers specified ”poor communication skills” as the primary factor for the rejection.

There is evidence that we ought to take this question seriously. In (Borkenau & Liebler, 1993), college students rated their perceived intelligence of strangers after watching them read aloud a pre-written weather report. The study found a significant correlation between perceived and measured IQ scores of the strangers, suggesting that some information about individual intelligence is provided through verbal communication. (Kreiner et al., 2002) showed that, while experiments with only a small percentage of typographical errors1 didn’t result in significant perceived intelligence ratings, the presence of a larger number of typographical errors or phonological errors did significantly influence the perception of cognitive writing abilities. The participants in these studies were comprised entirely of college students, but it may be the case that other populations would arrive at a different outcome. In (Silverman, 1990) college professors gave equally high perceived intelligence ratings to hypothetical students with and without verbal language difficulties, such as stuttering. More work is necessary to fully understand these questions, particularly in the context of contemporary social media communication channels where abbreviation, punctuation-skipping, and slang are frequently employed to accommodate restrictive post-length limitations on popular platforms.

Based on these previous studies, our experiment seeks to better understand the effect of the two types of spelling errors on perceived intelligence of the writers of social media posts. Our hypothesis is that both typographical and phonological spelling errors, compared to no spelling errors at all, will lead a decreased level of perceived intelligence, irrespective of the nature, content, and platform of the social media post. In addition, we believe that the effect of phonological errors could be greater than typographical errors, both from (Kreiner et al., 2002), and the fact that typographical errors are common in mainstream social media, and do not necessarily reflect the writer's inability to spell the word.


Experiment Design

The purpose of this study is to assess the impact, if any, on the perceived intelligence of authors’ who make spelling errors in social media posts. If the perception of author intelligence is significantly perturbed by such error, it will be useful to know if particular categories of error are more or less deleterious. We will leave outside the scope of this experiment the concern of whether a correlation between spelling error and measured intelligence exists as we are interested only in the potential causal relationship between written error and perception of intelligence.

Our potential outcomes is stated as follows: We compare the average perceived level of intelligence of writers of social media posts when either typographical or phonological spelling errors are made in the post, to what would have happened had the post contained no errors. Since we can only measure one potential outcome (no error, typographical, or phonological), we adopt a pre-test/post-test control group study design (ROXO). In this design, we first randomize participants at the start of the study by allowing each participant an equal chance of being assigned to Control, Typographical, or Phonological. Next, all participants are measured on one post (identical for all groups, and containing no errors -- more details of measurement below) in order to establish a pretreatment baseline across all individuals. Each group is then subjected to their particular treatment for 5 more posts (Control receives posts with no error, Typographical receives posts with only what's deemed typographical error, and Phonological receives only what's deemed phonological error). We then measure the participants responses after each post. Our null and alternative hypotheses are stated as follows:

H0: In a comparison of individuals, those who are exposed to either treatment (typographical or phonological errors) will not view its author as less intelligent than those exposed to control (no errors).

H1: In a comparison of individuals, those who are exposed to either treatment (typographical or phonological errors) will view its author as less intelligent than those exposed to control (no errors).

In order to establish that our randomization processed worked correctly, we perform a covariate balance check using the R package "cobalt". We check all covariantes not related to the pretreatment (this will be done later with a placebo test), including demographic information, and how often individuals in each group reads and writes social media posts. This check measures raw differences in proportion for categorical variables across the control and treatment groups. For example, for how often an individual reads social media, we have 5 potential levels ("Weekly", "Less than Weekly", "Daily", "More than once a day", and "Prefer not to say"). The difference in proportion of individuals belonging to these levels is calculated across all three groups. There's indication that this raw difference is a strong predictor of potential bias, and that a threshold of 0.1 to 0.25 have been proposed to be satisfactory. Our check (see our analysis notebook for more details) indicates that 36 / 39 levels of our covariates pass the balance check at a threshold level of 0.1, while the remaining 3 pass the level of 0.15. Two of these three are levels of how often one reads social media ("Daily" at 0.1575, "More than once a day" at 0.1022), and one is how often one writes social media ("Less than Weekly" at 0.1476). All other levels of these variables pass the 0.1 threshold. Due to such small differences in only a few levels of two variables, and the fact that all are below or signficantly below the acceptable threshold of 0.25, we believe the covariate balance check passes in our case since. Bias in our estimates should not be an issues in terms of including these covariates in our analysis.

Participants

Our study recruited volunteer survey participants through a combination of social media posts on Facebook and Slack for conducting a Pilot study. Following the pilot, the formal experiment will recruit through a combination of Amazon Mechanical Turk (Mar 31, 2021 - Apr 1, 2021) and the University of California, Berkeley XLab (Apr 5, 2021 - Apr 9, 2021). Data collected after this date (about 40 additional response) were intentially not reported, as we had already started our analysis and did not want our knowledge of the majority of the dataset to influence decisions we make on this remaining data. However, as a disclaimer, including this additional data did not change any of our reported conclusions. The survey ran as omnibus alongside other experiments on a variety of topics being researched during the Spring 2021 semester at the UC Berkeley MIDS program; all demographic data was shared as common resource for each sub-survey.

Pilot Study

Our pilot was conducted on Mar 10, 2021 - Mar 11, 2021; open solicitations were posted on public walls and message boards accessible to the study organizers, including Facebook and UC Berkeley's Slack. Randomization into one of three branches {Control, Treatment A (typographical errors), Treatment B (phonological errors)} occurred at the point of accessing the static web app that redirects to the desired survey forms. Survey's were written in Microsoft Office Forms product and data for each branch was collected on March 11, 2021 at the close of the pilot.

Our Pilot Study Architecture

Pilot randomization

To establish randomization as close to the point of survey as possible, all participants were redirected at the moment just prior to survey initiation via a static web app call using javascript as follows:

  <!-- Randomization Code for Study Participants -->
  <script>
    var urls = ["<control group url>", "<treatment group A url>", "<treatment group B url>"];
    window.location.href = urls[Math.floor(Math.random() * urls.length)];
  </script>


Pilot Participation

We received participation from 31 volunteers for our pilot study between Mar 10, 2021 - Mar 11, 2021. Through randomization, assignment to control and treatment groups were as follows:

Despite having only a small pilot dataset to work with, we had a very noteable result in our regression analysis. We employed a simple linear model of treatment against the intelligence outcome of interest, and controlled for each question to allow for separate means. The results showed high statistical sigificance between our treatment groups and control, with phonological having more than double the effect of typographical.

NOTE: We removed Q4 (i.e., 'Nature') from the live study because the length of post was universally regarded as too long by the experiment organizers and pilot study participants. Accuracy of measuring an effect requires that we encourage participants to read the entire post, and Q4 was discouraging that behavior. Therefore, results here do not include Q4, BUT if we did include Q4, the results remains the same (highly statistically significant) with even slightly more negative coefficient estimates.

Dependent variable:
Intelligence
TypeP-1.567***
(0.283)
TypeT-0.726***
(0.226)
factor(q_num)2-0.677**
(0.321)
factor(q_num)3-1.065***
(0.321)
factor(q_num)5-0.516
(0.321)
factor(q_num)6-0.387
(0.321)
Constant4.962***
(0.261)
Observations155
R20.227
Adjusted R20.196
Residual Std. Error1.265 (df = 148)
F Statistic7.243*** (df = 6; 148)
Note:*p<0.1; **p<0.05; ***p<0.01

Live Study

Our live study was conducted between Mar 31, 2021 - Apr 9, 2021, divided into two main sections: Amazon Mechanical Turk participants had access from Mar 31, 2021 through Apr 1, 2021 and UC Berkeley XLab participants had access to the omnibus survey from Apr 6, 2021 through Apr 9, 2021. Randomization and survey execution was carried through Qualtrics as depicted below.

Our Live Study Architecture

Live Study Participation

In our live study, we collected survey feedback from a total 265 participants, with the vast majority coming from Qualtrics (209 or ~78.9%); the remaining came from participants recruiter through Amazon Mechanical Turk (56 or ~21.1%). The balance of group assignment came out to be nearly uniform, with a slight advantage to control group.


Methodology

Participants were invited to join a short, anonymous survey with the communicated intent of assessing their opinion on the appropriateness in length of social media posts. Deception was employed to avoid anticipated response bias in individuals who may occasionally commit typographical or phonological errors themselves and thus may, when asked to consciously consider the intelligence of those who commit errors, provide a more charitable rating. After collection of basic and shared demographic data, participants are presented with a series of seven example social media posts individual followed by eight questions [for each example post]. The questions are delivered in the following order and category:

  • Attention Question
    • Recall question about the content of the social media post
  • Decoy Questions
    • 5-point Likert question about the appropriateness in length of the social media post
    • 7-point Likert question about attitude toward content of post
  • Outcome of Interest Questions
    • 7-point Likert question asking if author was 'effective' in communicating their message
    • 7-point Likert question asking the participants opinion on the 'intelligence' of the author
    • 7-point Likert question asking the participants opinion of the author's 'writing skills'
  • Decoy Question
    • 7-point Likert question asking what the level of interest is in meeting the author
  • Attention Question
    • Question asking for how many spelling or grammar mistakes the participant noticed

Treatment Experience

For our study, we constructed seven fictive social media posts: one a control question that everyone receives, regardless of branch assignment, and six posts that are identical in content save for deliberate typographical errors (treatment group 1), or deliberate phonological errors (treatment group 2). The control question contains no grammatical or spelling errors, and is presented as the first question for all participants as a mechanism to prime the participants for attention and to elicit more careful reading of the following five posts. While control group postings are meant to avoid grammar and spelling mistakes, some loose language is used to establish credibility as 'genuine' to a normal social media interaction. All posts cover topics that are intended to be banal so as to avoid evocation of excited emotional states [we assume there to be] due to topics such as religion or politics.

Attention Social Media Post (Post 0) - All Participants

The control post (dubbed Post 0) in our experiment that all assignment groups see first to prime participant focus and attention.


Control Group : Posts 1 - 6

Below, the six posts seen only by the control group participants. Click on any image to see a full-sized version

Post 1 Post 2 Post 3
     
Post 4 Post 5 Post 6


Treatment Group 1 (Typographical) : Posts 1 - 6

Below, the six posts seen only by treatment group 1 (typographical errors) participants. Click on any image to see a full-sized version

Post 1 Post 2 Post 3
     
Post 4 Post 5 Post 6


Treatment Group 2 (Phonological) : Posts 1 - 6

Below, the six posts seen only by treatment group 2 (phonological errors) participants. Click on any image to see a full-sized version

Post 1 Post 2 Post 3
     
Post 4 Post 5 Post 6


Pre-Survey Power Analysis

Prior to conducting the experiment, we performed a power analysis to estimate the number of compliant study participants we would need to reach our desired Beta = 0.8 and alpha = 0.05. Below is the code in R we used to perform our pre-experiment power analysis. For a range of effect sizes, going from 0.2 to 1 (where effect size is the average difference in mean perceived Intelligence between either treatment or control), we calculate the number of participants needed using the R package "pacman" using a standard t-test in order to achieve our desired error rates.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pwr, ggplot2, install = TRUE)

effect_sizes <- seq(0.2,1.0,0.05)
participants_needed <- vector(mode="logical", length=length(effect_sizes))

l <- 1
for (j in effect_sizes){
  participants_needed[l] <- ceiling(pwr.t.test(d = j/2, power = 0.80, sig.level = 0.05)$n)
  l <- l + 1}

results <- data.table(
  effect_size = effect_sizes,
  participants_needed = participants_needed)

A visual of the results is presented below. We see that for very small effect sizes of <0.25, over 1000 subjects would be needed. For a still conservative effect size of 0.5 (keeping in mind our Intelligence scale ranges from 1-7), we would approximately need between 250-300 participants to meet our desired power and significance goals, which is something XLab is able to provide.



Flow diagram

Below is our CONSORT document showing the flow of our study. With 265 recruited individuals, 209 were included in the final analysis from UCB XLab.

In the first part of the analysis, we consider compliance as a non-issue as all participants had the opportunity to read their respective posts and thus receive either control or treatment. However, even though treatment was always delivered, whether it made actually made an impression on the reader is a different question. In order to gauge this, our final "Attention Question" gauged whether participants noticed spelling errors when they were supposed to (both treatment groups), and did not notice spelling errors when they were not supposed to (control group). While this measurement is a noisy metric for true compliance, if it were a perfect metric, we potentially have two-sided noncompliance (some Control subjects noticed errors, and some Treatment subjects did not notice errors). The CONSORT document shows our fraction of compliers, and number of prompts that falls within our compliance parameters. We deal this further with an instrumental variables approach in the analysis.

Exploratory Data Analysis

Below is a summary analysis of relevant findings in the preliminary exploration of the live study data. Pilot data has not been included here for brevity. NOTE: This section is not an exhaustive EDA; rather, it is focused on showing errata and interesting findings only.

Demographics

Demographic data was collected at the beginning of the ominbus study for all experimentation teams; as such, the questions asked are a composite of the questions formulated by the individual teams during research design. Unfortunately, some data elements were captured via free-form text entry rather than a radio-button, drop-down, or other fixed selection mechanism and resultantly some of the data has mistaken entries or missing values. Below is a distribution of the demographic data present in the omnibus results files:

Distribution of Missing Demographic Data

% Missing Values Missing Values Non-Null Values Density
Year 0.69 11 1576 0.03
Gender 0.00 0 1587 0.17
English 0.00 0 1587 0.33
Race 0.00 0 1587 0.12
Country 1.83 29 1558 0.08
State 9.77 155 1432 0.04
Student 0.38 6 1581 0.20
Degree 0.00 0 1587 0.25

Demographic: YEAR

The YEAR variable was intended to capture the calendar year of the study participants birth; of the 265 study participants in the live study, only 255 had valid entries; their distribution is depicted below.

Ten of the entires in YEAR, which was presented to the participant as a free-form text box, are invalid as-is:

# df contains a pandas dataframe with the live study survey data
df2 = df.copy()
df2.Year = df2.Year.fillna('<MISSING>')
df2 = pd.DataFrame(df2.groupby(by=['ROWID','Year']).size()\
    .reset_index()[['ROWID','Year']].Year.value_counts(dropna=False))\
    .reset_index().rename(columns={'index':'year', 'Year':'count'}).sort_values(by='year')

strange_values = ['19996','25','26','54','<MISSING>','Los Angeles','Mumbai, India','US','2020']
df2[(df2.year.isin(strange_values))].year.unique()

Results:

array(['19996', '2020', '25', '26', '54', '<MISSING>', 'Los Angeles', 'Mumbai, India', 'US'], dtype=object)


Demographic: GENDER

GENDER data was provided by the study participant through selection of a single-select radio button choice menu. In our data, Cisgender Women drastically outweigh the rest of the combined genders; this may be the result of Cisgender Woman being the first option in the menu (and thus was selected by 'default' whereby the participant did not change the option). The option Transgender Woman was presented to the study participants, but was not selected.


Demographic: COUNTRY

The COUNTRY variable represents the birth country of the study participant. The vast majority of participants in our study were from the United States of America; however, we did have five missing values in the field, indicating the study participant either did not want to provide one or elected not enter it for another reason.


Demographic: STATE

A total of 25 unique U.S. States were represented in the demographic data identifying the birth state of those participants born in the United States. The vast majority of those who participated in the survey were from California; 26 participants did not specify a state during the demographic section of the survey. Fortunately, all 26 missing states are for participants with a missing or non-US country listed as their birth country.


Demographic: STUDENT

The majority of respondants (approximately ~72%) were undergraduate students, with a small mix of Other, Graduate Student, Staff, and Faculty represented. Only one missing value for student status was found in the demographic portion of the survey data.


Survey Response Data

Findings from response data from the live survey is outlined in this section. Analysis will generally be performed by displaying response data for that collected through Amazon Mechanical Turk and data collected from UC Berkeley XLab separately.

Descriptive Statistics

Below are descriptive statistics for the response data of interest. We present two tables showing the descriptive statistics associated with our Mechanical Turk and XLab datasets. Numbers in dark blue shading mark the highest values in each column while lighter blue shading marks the next highest values. The takeaway here is that most users processed our survey at times that we expected the survey, judging from the median values for our words per minute calculator and the time spent on our prompts; roughly 13 minutes for our social media posts and 24 minutes for our questions. On the XLab data, responses for all post-treatment (Interest through Meet) questions ranged the entire scale from 1-7, indicating desirable high variability in the responses at least in the extremes. For the measurement of interest (Intelligence), the avereage is right around the middle of the scale at 3.96.

Descriptive Statistics : Amazon Mechanical Turk

count mean std min 25% 50% 75% max
PromptTime 334.00 30.16 75.98 0.74 9.38 15.29 25.93 967.12
QuestionTime 334.00 43.90 74.77 5.86 19.31 28.61 46.65 1114.19
wpm 334.00 395.53 574.86 3.23 122.29 209.68 346.46 3913.04
Interest 334.00 3.97 1.64 1.00 3.00 4.00 5.00 7.00
Effective 334.00 4.60 1.41 1.00 4.00 5.00 6.00 7.00
Intelligence 334.00 4.55 1.40 1.00 4.00 5.00 6.00 7.00
Writing 334.00 4.33 1.50 1.00 3.25 4.00 5.00 7.00
Meet 334.00 3.68 1.79 1.00 2.00 4.00 5.00 7.00

Descriptive Statistics : Berkeley XLab

count mean std min 25% 50% 75% max
PromptTime 1253.00 18.17 47.53 0.54 9.22 13.29 19.28 1570.79
QuestionTime 1253.00 30.20 51.57 7.81 18.05 23.56 31.38 1432.50
wpm 1253.00 347.49 428.14 1.83 165.30 241.83 356.52 5423.73
Interest 1253.00 2.90 1.68 1.00 1.00 3.00 4.00 7.00
Effective 1253.00 4.65 1.65 1.00 4.00 5.00 6.00 7.00
Intelligence 1253.00 3.96 1.47 1.00 3.00 4.00 5.00 7.00
Writing 1253.00 3.62 1.50 1.00 3.00 4.00 5.00 7.00
Meet 1253.00 2.64 1.57 1.00 1.00 2.00 4.00 7.00

Questionable Responses

There were a number of dubious responses in the data that indicated less than serious attention; from uniform likert-scale choices to implausible response times. In this section, we'll highlight a few of the data issues that gave us pause as to how to proceed.

Variance in Likert Scale Responses

When survey takers are presented with a cluster of vertically-stacked likert scale questions, a common shortcut tactic is to simply vertically click down a column of the questions without considering their intent. In our study, we present a five-question cluster of this sort for each of the six social media posts presented. If a survey participant is fatiqued or disinterested, they may select "all 1's" or "all 7's" as a way to complete their task as soon as possible. Only three of our five questions are "similar" in their scope and thus the liklihood that a careful participant would, after due consideration, select a uniform distribution of these values is relatively low.

Thus, we can use signal from zero variance responders, along with other information about response times and reading words-per-minute, to handle potentially noisy and meaningless responses.