Datasets are essential for data analysis and statistical modeling in R. They come in various forms, including built-in datasets, external datasets loaded from files, and datasets created programmatically also.
Let me explain you in detail about the different types of datasets and how to work with them in R.
1. Built-in Datasets
R comes with several built-in datasets that are often used for practice and demonstration purposes.
Let me show you the common built-in datasets.
- iris – A classic dataset containing measurements of iris flowers.
- mtcars – A dataset with various attributes of different car models.
- airquality – A dataset containing daily air quality measurements.
You can load a built-in dataset using its name.
I’ll write an example code now.
# Load the iris dataset data(iris) # Display the first few rows of the iris dataset head(iris)
2. External Datasets
You can import datasets from external files like CSV, Excel, or databases.
R provides functions like `read.csv()`, `read.table()`, `read.csv2()`, `read.delim()`, etc., to import data into your program.
Let’s see how to use these functions.
# Import a CSV file my_data <- read.csv("my_data.csv") # Import an Excel file library(readxl) my_data <- read_excel("my_data.xlsx") # Import a tab-delimited file my_data <- read.delim("my_data.txt", header = TRUE, sep = "\t")
3. Creating Datasets Programmatically
Now I’ll create datasets programmatically using functions like `data.frame()` or by combining existing datasets.
# Create a dataframe my_dataframe <- data.frame( ID = c(1, 2, 3), Name = c("John", "Alice", "Bob"), Age = c(25, 30, 35))
4. Manipulating Datasets
Once you have a dataset loaded, you can perform various operations on it. The following are the various operations you can perform,
- Subsetting – Extracting specific rows or columns.
- Filtering – Selecting rows based on certain conditions.
- Aggregation – Computing summary statistics.
- Joining – Combining multiple datasets based on common keys.
Now let’s write a code which uses all the above operations.
# Subset the iris dataset subset_iris <- iris[1:5, ] # Filter the mtcars dataset filter_mtcars <- mtcars[mtcars$mpg > 20, ] # Compute summary statistics summary(iris) # Join datasets merged_data <- merge(dataset1, dataset2, by = "ID")
Example
Let’s do a quick analysis on the iris dataset.
# Load the iris dataset data(iris) # Summary statistics summary(iris) # Plotting plot(iris$Petal.Length, iris$Petal.Width, col = iris$Species, pch = 19, xlab = "Petal Length", ylab = "Petal Width", main = "Iris Dataset") # Boxplot boxplot(Sepal.Length ~ Species, data = iris, main = "Sepal Length by Species")
This above code loads the iris dataset, displays summary statistics, creates a scatter plot of petal length vs. width colored by species. It also performs boxplot of sepal length by species.
Datasets are the easier fundamental for data analysis in R programming. Whether built-in, imported, or created programmatically, understanding how to work with datasets is essential for any data analysis task.
Example Dataset of Customer Transactions for a Retail Store
Let’s consider a real-world use case involving a dataset of customer transactions for a retail store. I’ll perform various operations on the dataset to analyze customer behavior and generate insights – this helps you to easily understand the dataset concepts.
# Load the dataset transactions <- read.csv("transactions.csv") # Display the structure of the dataset str(transactions) # Summary statistics summary(transactions) # Filter transactions for a specific product category electronics_transactions <- subset(transactions, Category == "Electronics") # Compute total sales for each product category sales_by_category <- aggregate(Amount ~ Category, data = transactions, FUN = sum) # Identify top-selling products top_products <- head(arrange(transactions, desc(Amount)), n = 10) # Merge with customer data to analyze demographics customer_data <- read.csv("customer_data.csv") merged_data <- merge(transactions, customer_data, by = "CustomerID") # Compute average transaction amount by gender avg_transaction_by_gender <- aggregate(Amount ~ Gender, data = merged_data, FUN = mean) # Visualize transaction distribution hist(transactions$Amount, main = "Transaction Amount Distribution", xlab = "Amount") # Generate a time series plot of transaction count over time transactions$Date <- as.Date(transactions$Date) transaction_ts <- ts(table(transactions$Date), start = min(transactions$Date), frequency = 365) plot(transaction_ts, main = "Transaction Count Over Time", xlab = "Date", ylab = "Transaction Count")
Let me summarize the above code so that everything will be clear for you.
- Start by loading the transaction data from a CSV file.
- Inspect the structure of the dataset using `str()` and generate summary statistics using `summary()`.
- Filter transactions for a specific product category (in this case, “Electronics”) using `subset()`.
- Compute total sales for each product category using `aggregate()`.
- Identify top-selling products by sorting the dataset based on transaction amounts using `arrange()` from the `dplyr` package.
- Merge transaction data with customer data based on the common column “CustomerID” using `merge()`.
- Compute the average transaction amount by gender using `aggregate()`.
- Visualize the distribution of transaction amounts using a histogram and plot the transaction count over time using a time series plot.
When you learn the basics of R functions and libraries to manipulate, analyze, and visualize datasets efficiently – you must be able to handle any R related work or projects.