Classical statistical analysis includes many commonly used statistical methods and techniques for data description, inference, and modeling. This section describes the implementation of classical statistical analysis methods (including t-test, ANOVA, chi-square test, and linear regression) in the R language.
5.1.1 t-test
The sample mean, also known as the sample average, is an important statistic that describes the trend in a set of data sets. It is calculated by adding all the values of a set of data and dividing by the number of data in the set. In statistics, the variance is the average of the squares of the differences between each sample value and the sample mean. The variance measures the dispersion of the data, i.e., the spread of data points relative to the mean. The standard deviation, on the other hand, is the square root of the variance and indicates the average degree of deviation of the data relative to the mean. The standard deviation is also an important measure of the dispersion of the data.
The t-test is a statistical method used to compare two sample means to determine if they are significantly different. It is based on the sample mean and the sample standard deviation, and determines whether two sample means are significantly different by calculating the t-value. t-tests are used for the comparison of independent or paired samples of a continuous variable, and are widely used in the field of medicine. t-tests can be of two types: one-sample t-tests and two-sample t-tests.
1. One-sample t-test
The purpose of a one-sample t-test is to test whether the mean of a single sample is equal to the mean of a known population.
Examples of application scenarios: randomly select a number of healthy men in a remote area and test whether their average pulse rate is higher than the average of all healthy men; test whether the vision level of all senior high school students in a first-tier city is lower than that of senior high school students nationwide.
In R, you can use functions to perform a one-sample t-test. Specific steps include:
Setting the null hypothesis (H0): the null hypothesis of a hypothesis test is usually that the sample mean is equal to the overall mean.
Collect sample data.
Run a one-sample t-test: use a function that specifies the sample data and the overall mean to produce a test result, including a t-value, p-value, and confidence interval.
2. Two-sample t-test
The purpose of the two-sample t-test is to test whether the means of two independent samples are significantly different, requiring the two samples to be independent and to follow a normal or approximately normal distribution.
Examples of application scenarios: to study the efficacy of a new drug for the treatment of anemia in children, the conventional drug is used as a control, and after a period of treatment, it is examined whether the increase in hemoglobin in children administered the new drug is greater than that of the conventional drug; to examine the effectiveness of the two drugs for the treatment of high blood pressure, and to test whether there is a significant difference in the level of blood pressure lowering between the two groups of drugs.
In R, you can also use the function to perform a two-sample t-test. Specific steps include:
Setting the null hypothesis (H0): the null hypothesis of a hypothesis test is usually that the means of the two samples are equal.
Data were collected from two independent samples.
Run a two-sample t-test: using a function that specifies two samples of data and produces test results, including t-values, p-values, and confidence intervals.
When performing a t-test, key outputs include t-values, degrees of freedom, p-values, and confidence intervals. p-values are typically used to determine the significance of the test results, and if the p-value is less than the level of significance (usually 0.05), the null hypothesis can be rejected and the means of the two samples can be considered significantly different. The sample R code for performing a t-test is shown below:
# One-sample t-test set.seed(123) blood_pressure <- rnorm(30, mean = 120, sd = 10) #create data blood_pressure # Perform a one-sample t-test comparing the mean of the sample blood_pressure to the overall mean mu=Difference of 130 t_test_single_blood_pressure<- (blood_pressure, mu = 130) # Output one-sample t-test results print(t_test_single_blood_pressure) #Two-sample t-test set.seed(123) baseline_blood_sugar <- rnorm(20, mean = 100, sd = 15) #create data baseline_blood_sugar endline_blood_sugar<- rnorm(20, mean = 120, sd = 15) #create data endline_blood_sugar #Perform a two-sample t-test to compare the difference between the means of the samples baseline_blood_sugar and endline_blood_sugar t_test_double_blood_sugar<- (baseline_blood_sugar, endline_blood_sugar) # Output two-sample t-test results print(t_test_double_blood_sugar)
The results of the run are shown in Figure 5-1. The figure demonstrates the results of the one-sample and two-sample t-tests, where the p-values are 2.431e-06 and 0.0003, respectively, which are less than 0.05. The results suggest that there is a significant difference between the mean value of blood_pressure and the overall mean of 130, and that there is a significant difference between the baseline_blood_sugar and endline_blood_sugar There is also a significant difference between the mean values and this difference is statistically significant.
Figure 5-1
Overall, t-test is a commonly used hypothesis test for comparing differences between sample means. In R, it is very convenient to use functions to perform t-tests, which can help users to perform significance tests quickly and accurately.
5.1.2 Analysis of variance
Analysis of Variance (Analysis of Variance, referred to as ANOVA), pioneered by the British statistician, in honor of Fisher, so named "F", so the analysis of variance is also known as "F test". It is used to test the significance of the difference between the means of two or more samples.
The most frequently used type of ANOVA is the one-way ANOVA. It is used to analyze the variability between a class of qualitative data and quantitative data, and the qualitative data is usually multicategorical. For example, when analyzing the differences between the academic performance of different classes (Class 1, Class 2, and Class 3), a one-way ANOVA can be used to compare the differences in the means of the academic performance of the three classes (the independent samples t-test can only be performed to compare the differences in means between two sets of data).
In R, one-way ANOVA can be performed using the aov function in the stats package and the Anova function in the car package. Specific steps include:
Setting the null hypothesis (H0): the null hypothesis of a hypothesis test is usually that the means of all groups are equal.
Collect sample data from each group.
Run a one-way ANOVA: use the aov function to specify the dependent and group variables, and use the summary function to obtain the results of the ANOVA, including F-statistics, degrees of freedom, and p-values.
In an ANOVA, the key output values are the F-value and the p-value. the F-value is used to determine whether the difference in means between groups is significant or not; when the p-value is less than the level of significance (usually 0.05), the null hypothesis can be rejected as a significant difference in the means between groups. The sample R code for performing an ANOVA is shown below:
# Data boxes created (containing 3 groups of measures) data<- ( RecoveryTime = c(10, 12, 8, 9, 11, 14, 13, 7, 6, 5,11), # Recovery time Treatment= factor(c("TreatmentA", "TreatmentB", "TreatmentC", "TreatmentA", "TreatmentB", "TreatmentC", "TreatmentA", "TreatmentB", "TreatmentC", "TreatmentA", "TreatmentB"))))#Treatments #An analysis of variance (ANOVA) was performed aov_result<- aov(RecoveryTime ~ Treatment, data = data) summary(aov_result)
The results of the run are shown in Figure 5-2, which demonstrates the results of the analysis of variance (ANOVA) with an F-value of 0.116 and a p-value of 0.892 calculated by summary(aov()). p-values greater than 0.05 indicate that the treatment method did not have a significant effect on recovery time.
Figure 5-2
Overall, ANOVA is a commonly used statistical method for comparing whether there is a significant difference between group means. In R, it is very convenient to use the aov function and the Anova function for ANOVA, which can help users quickly and accurately compare differences between groups.
5.1.3 Chi-square test
The chi-square test is a statistical method used to test whether there is a correlation between two or more categorical variables. The chi-square test is used to analyze categorical data and is usually used to test the significance of the difference between the observed frequency and the expected frequency. Frequency, also known as "count", is the number of times a characteristic (sign value) occurs in the value of a variable. The chi-square test can be performed in R using basic functions. The following is a detailed description of the chi-square test.
In a one-sample chi-square test, we test whether the frequencies of a categorical variable conform to some theoretical distribution. The null hypothesis of a hypothesis test is that the observed frequencies conform to some expected distribution. In R, the chi-square statistic and p-value can be derived using a function that specifies the expected proportion of observed and theoretical frequencies.
In a chi-square test, the key outputs include the chi-square statistic, degrees of freedom, and the p-value. p-value is usually used to determine the significance of the test results. If the p-value is less than the level of significance (usually 0.05), the null hypothesis can be rejected and it is assumed that there is a significant difference between the observed and expected frequencies or that there is a correlation between the two categorical variables.
The sample R code to perform the chi-square test is as follows:
#Creating data frames data<- ( PatientID = 1:80, Treatment = c(rep("TreatmentA", 40), rep("TreatmentB", 40)), CureStatus = c(rep("Cured", 30), rep("Cured", 30), rep("Not Cured", 20)) ) # Create a linked table of columns table_data<- table(data$Treatment, data$CureStatus) #Perform chi-square test chi_result<- (table_data) # Output results print(chi_result)
The results of the code run are shown in Figure 5-3, which demonstrates the results of the chi-square test. The X-squared value of the chi-squared statistic calculated by the function is 24.067 and the p-value is 9.306e-07. p-value is less than 0.05, which indicates that there is a significant correlation between the treatment and the cure outcome.
Figure 5-3
Overall, chi-square test is a commonly used statistical method to test the correlation between categorical variables, which can help researchers to make reasonable inferences in data analysis. In R language, it is very convenient to use functions for chi-square test, which can perform correlation analysis quickly and accurately.
5.1.4 Simple linear regression analysis
Simple linear regression analysis is a frequently used method in statistics for establishing and evaluating the linear relationship of an independent variable to a continuous dependent variable. Simple linear regression analysis describes the relationship between an independent variable and a dependent variable by fitting a straight line to predict and explain changes in the dependent variable. Here is a detailed description of simple linear regression analysis:
The simple linear regression model can be expressed as:
Y = β0 + β1X + ε
Among them.Yis the dependent variable (response variable),Xis the independent variable,β0is the intercept term,β1is the slope,εis the error term (random error).
The goal of regression analysis is to fit an optimal straight line to describe the relationship between the independent and dependent variables by estimating the intercept and slope. The least squares method is usually used to estimate the regression coefficients such that the sum of squares of the residuals between the predicted and actual values is minimized. Simple linear regression models can be used to predict the value of the dependent variable or to infer the degree of relationship between the independent and dependent variables.
In the R language, you can use the lm() function to perform simple linear regression analysis. Specific steps include:
Collect data on the independent and dependent variables.
Run the lm() function, specify the dependent and independent variables, and derive the regression coefficients, fitted straight line, and related statistics.
The summary() function is used to obtain summary statistics of the regression model, including intercepts, slopes, R-squared values, t-values, and p-values.
Simple linear regression analysis can help researchers understand and predict relationships between variables, and it provides a simple yet effective tool to interpret and explore patterns and regularities behind data. The sample R code to perform simple linear regression analysis is shown below:
#Creating data frames data<- ( Weight = c(61, 75, 83, 92, 100), # Weight in kilograms BloodPressure= c(110, 123, 130, 150, 160) #Blood pressure in mmHg ) #Simple linear regression analysis performed model<- lm(BloodPressure ~ Weight, data = data) # View the model output summary(model)
The "BloodPressure ~ Weight" in the lm() function indicates that a model is being fitted, where BloodPressure is the dependent variable and Weight is the independent variable. " specifies the data frame that contains the data.The Summary() function will provide a detailed summary of the model, including regression coefficients, intercepts, R-squared values, F-statistics, and p-values. These statistics can help us understand how well the model fits and whether the linear relationship between the variables is significant. The results of the code run are shown in Figure 5-4, which demonstrates the results of a simple linear regression.
Figure 5-4
As can be seen in Figure 5-4, the intercept (Intercept) is 26.3778 and the slope (coefficient of Weight) is 1.3166, which indicates that for every 1 kg increase in body weight, the blood pressure increases by an average of 1.3166 mmHg; the p-value of 0.02809, which is less than 0.05, indicates that there is a significant linear relationship between body weight and blood pressure; the R-squared value ( Multiple R-squared) was 0.965, indicating that the model explained 96.5% of the variation in blood pressure.
https:///brucexia/p/18466799