R is a programming language and environment designed for statistical computing, data analysis, and graphical visualization. It originated as an open-source implementation of the S language (from Bell Labs) and has grown into a rich ecosystem used by statisticians, data scientists, bioinformaticians, and researchers across many fields. R emphasizes expressiveness for data manipulation and modeling, with first-class support for vectors, matrices, data frames, and statistical functions.
Historically, R was created in the mid-1990s by Ross Ihaka and Robert Gentleman at the University of Auckland. It became widely adopted because it combined a powerful statistical toolset with an open-source license and an extensible package system (CRAN). Over time, the community has built thousands of packages that extend R’s capabilities into nearly every domain that requires data analysis.
Core language features include vectorized operations (so you operate on whole vectors without explicit loops), a comprehensive standard library for statistics (regression, time series, clustering, hypothesis testing, ANOVA, etc.), and a flexible formula syntax used by many modeling functions (e.g., y ~ x1 + x2
). R supports multiple programming paradigms: functional programming (first-class functions, closures), procedural code, and two main object systems (S3 and S4), with newer systems like R6 for reference-style objects.
Data structures you’ll use constantly are vectors, matrices, lists, and data frames (and the newer tibble
from the tidyverse). Data frames are the canonical tabular structure and integrate with modeling and plotting functions. R’s indexing is 1-based (first element has index 1), which is important to remember if you come from 0-based languages.
The R ecosystem is a major strength. CRAN (Comprehensive R Archive Network) hosts thousands of packages for modeling, visualization, reporting, and more. The tidyverse (packages like dplyr
, ggplot2
, tidyr
, readr
) provides a modern, consistent grammar for data manipulation and plotting and is hugely popular for data science workflows. For bioinformatics and genomics, Bioconductor is the community-driven collection of packages tailored to high-throughput biological data analysis. R integrates well with other languages and tools (calling C/C++ for speed, using reticulate
to interoperate with Python, or outputting reproducible reports via R Markdown and Shiny apps for interactive web apps).
Typical workflows include loading and cleaning data, exploratory data analysis (summary stats and visualizations), statistical modeling, and communicating results (R Markdown, HTML, PDF, dashboards).
Example
library(dplyr)
library(ggplot2)
df <- read.csv("data.csv")
summary_tbl <- df %>%
group_by(category) %>%
summarize(mean_value = mean(value, na.rm = TRUE),
n = n())
ggplot(summary_tbl, aes(x = category, y = mean_value)) +
geom_col() +
labs(title = "Mean value by category")