SlimR is an R package designed for annotating single-cell and
spatial-transcriptomics (ST) datasets. It supports the creation of a
unified marker list, Markers_list
, using sources including:
the package’s built-in curated species-specific cell type and marker
reference databases (e.g., ‘Cellmarker2’, ‘PanglaoDB’, ‘scIBD’,
‘TCellSI’), Seurat objects containing cell label information, or
user-provided Excel tables mapping cell types to markers.
SlimR can predict calculate parameters by machine learning algorithms
(e.g., ‘Random Forest’, ‘Gradient Boosting’, ‘Support Vector Machine’,
‘Ensemble Learning’), and based on Markers_list, calculate gene
expression of different cell types and predict annotation information
and calculate corresponding AUC by Celltype_Calculate()
,
and annotate it by Celltype_Annotation()
, then verify it by
Celltype_Verification()
. At the same time, it can calculate
gene expression corresponding to the cell type to generate a reference
map for manual annotation (e.g., ‘Heat Map’, ‘Feature Plots’, ‘Combined
Plots’).
Install SlimR directly from CRAN using: (Stable version, recommended when the version equivalent to GitHub package version)
install.packages("SlimR")
Note: Try adjusting the CRAN image to Global (CDN)
or use BiocManager::install("SlimR")
if you encounter a
version mismatch during installation.
Install SlimR directly from GitHub using: (Development version, recommended when the version is higher than CRAN package version)
::install_github("Zhaoqing-wang/SlimR") devtools
Note: If the function doesn’t work, please run
install.packages('devtools')
first.
Load the package in your R environment:
library(SlimR)
For Seurat objects with multiple layers in the assay, please run
SeuratObject::JoinLayers()
first.
# For example, if you want to use the 'RNA' layer in the multilayered Seurat object assay.
@assays$RNA <- SeuratObject::JoinLayers(sce@assays$RNA) sce
Important: To ensure accuracy of the annotation, make sure that the entered Seurat object has run the standard process and removed batch effects.
Note: It is recommended to use the clustree
package
to determine the appropriate resolution for the input Seurat
object.
SlimR requires R (≥ 3.5) and depends on the following packages:
cowplot
, dplyr
, ggplot2
,
patchwork
, pheatmap
, readxl
,
scales
, Seurat
, tidyr
,
tools
. If installation fails, please install missing
dependencies using:
# Install dependencies if needed:
install.packages(c("cowplot", "dplyr", "ggplot2", "patchwork",
"pheatmap", "readxl", "scales", "Seurat",
"tidyr", "tools"))
SlimR requires a standardized list format for storing marker information, metrics (can be omitted), and corresponding cell types (list names = cell types (necessary), first column = markers (necessary), subsequent columns = metrics (can be omitted)).
Cellmarkers2: A database of cell types and markers covering different species and tissue types.
Reference: Hu et al. (2023) doi:10.1093/nar/gkac947.
<- SlimR::Cellmarker2 Cellmarker2
<- SlimR::Cellmarker2_table
Cellmarker2_table View(Cellmarker2_table)
Markers_list
:<- Markers_filter_Cellmarker2(
Markers_list_Cellmarker2
Cellmarker2,species = "Human",
tissue_class = "Intestine",
tissue_type = NULL,
cancer_type = NULL,
cell_type = NULL
)
Important: Select at least the species
and
tissue_class
parameters to ensure the accuracy of the
annotation.
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.1. Click
to section3 automated annotation workflow.
PanglaoDB: Database of cell types and markers covering different species and tissue types.
Reference: Franzén et al. (2019) doi:10.1093/database/baz046.
<- SlimR::PanglaoDB PanglaoDB
<- SlimR::PanglaoDB_table
PanglaoDB_table View(PanglaoDB_table)
Markers_list
:<- Markers_filter_PanglaoDB(
Markers_list_panglaoDB
PanglaoDB,species_input = 'Human',
organ_input = 'GI tract'
)
Important: Select the species_input
and
organ_input
parameters to ensure the accuracy of the
annotation.
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.2. Click
to section3 automated annotation workflow.
scIBD: A database of human intestine markers.
Reference: Nie et al. (2023) doi:10.1038/s43588-023-00464-9.
<- SlimR::Markers_list_scIBD Markers_list_scIBD
Important: This is for human intestinal annotation only. The input Seurat object was ensured to be a human intestinal type to ensure the accuracy of the labeling.
Note: The Markers_list_scIBD
was generated using
section 2.5.2 and the parameters sort_by = "logFC"
and
gene_filter = 20
were set.
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.3. Click
to section3 automated annotation workflow.
TCellSI: A database of T cell markers of different sub types.
Reference: Yang et al. (2024) doi:10.1002/imt2.231.
<- SlimR::Markers_list_TCellSI Markers_list_TCellSI
Important: This is only for T cell subset annotation. Ensure that the input Seurat object is of T cell type to guarantee the accuracy of the annotation.
Note: The Markers_list_TCellSI
was generated using
section 2.6.
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.4. Click
to section3 automated annotation workflow.
Markers_list
:The standard Markers_list
can be generated by the
built-in read_seurat_markers()
function after obtaining
Markers through the Seurat::FindAllMarkers()
function.
<- Seurat::FindAllMarkers(
seurat_markers object = sce,
group.by = "Cell_type",
only.pos = TRUE)
<- Read_seurat_markers(seurat_markers,
Markers_list_Seurat sources = "Seurat",
sort_by = "FSS",
gene_filter = 20
)
Note: Recommend use the parameter sort_by = "FSS"
to
use the ‘Feature Significance Score’ (FSS, product value of
log2FC
and Expression ratio
) or use the
parameter sort_by = "avg_log2FC"
as the ranking
basis.
presto
to Speed Up: (Alternative)For large data sets, the presto::wilcoxauc()
function
can be used to speed up the operation. (Alternative, ~10x faster,
sacrifice partial accuracy)
<- dplyr::filter(
seurat_markers ::wilcoxauc(
prestoX = sce,
group_by = "Cell_type",
seurat_assay = "RNA"
),< 0.05, logFC > 0.5
padj
)
<- Read_seurat_markers(seurat_markers,
Markers_list_Seurat sources = "presto",
sort_by = "FSS",
gene_filter = 20
)
Improtant: This feature depends on the presto
packages, please run
devtools::install_github('immunogenomics/presto')
first.
Note: Recommend use the parameter sort_by = "logFC"
or use the parameter sort_by = "FSS"
to use the ‘Feature
Significance Score’ (FSS, product value of log2FC
and
Expression ratio
) as the ranking basis.
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.3. Click
to section3 automated annotation workflow.
Format Requirements:
Each sheet name = cell type (necessary)
First row = column headers (necessary)
First column = markers (necessary)
Subsequent columns = metrics (can be omitted)
<- Read_excel_markers("D:/Laboratory/Marker_load.xlsx") Markers_list_Excel
Link: Output Markers_list
usable in sections 3.1,
4.1, 4.2, 4.3 and 5.4. Click
to section3 automated annotation workflow.
SlimR integrates multiple machine learning algorithms (e.g., Random
Forest, Gradient Boosting, Support Vector Machine, Ensemble Learning) to
automatically determine optimal min_expression
and
specificity_weight
parameters in section 3.2 for cell types
probability calculate.
# Basic usage uses default genes
<- Parameter_Calculate(
SlimR_params seurat_obj = sce,
features = c("CD3E", "CD4", "CD8A"),
assay = "RNA",
cluster_col = "seurat_clusters",
method = "ensemble",
n_models = 3,
return_model = FALSE,
verbose = TRUE
)
# Use with custom method: use the genes corresponding to a specific cell type in 'Markers_list' as input
<- Parameter_Calculate(
SlimR_params seurat_obj = sce,
features = unique(Markers_list_Cellmarker2$`B cell`$marker),
assay = "RNA",
cluster_col = "seurat_clusters",
method = "rf",
return_model = FALSE,
verbose = TRUE
)
Important: This scheme is optional and can be skipped to section 3.2 for cell type probability calculation using default parameters.
Note: Using the parameter method = "rf"
in the
function Parameter_Calculate ()
can modify the machine
learning model used.Machine learning method: rf
(Random
Forest), gbm
(Gradient Boosting), svm
(Support
Vector Machine), or ensemble
(Ensemble Learning;
default)
Uses markers_list
to calculate probability, prediction
results, calculate corresponding AUC (optional) and generate heat map
and ROC graphs (optional) for cell annotation.
<- Celltype_Calculate(seurat_obj = sce,
SlimR_anno_result gene_list = Markers_list,
species = "Human",
cluster_col = "seurat_clusters",
assay = "RNA",
min_expression = 0.1,
specificity_weight = 3,
threshold = 0.8,
compute_AUC = TRUE,
plot_AUC = TRUE,
AUC_correction = TRUE,
colour_low = "navy",
colour_high = "firebrick3"
)
You can use the
min_expression = SlimR_params$min_expression
and
specificity_weight = SlimR_params$specificity_weight
parameter in function Celltype_Calculate()
if you have run
the Parameter_Calculate ()
function in section 3.1
above.
Important: The parameter cluster_col
in the
function Celltype_Calculate()
and the function
Celltype_Annotation()
must be strictly the same to avoid
false matches.
Note: Using the parameter AUC_correction = TRUE
takes a little longer to compute (~20% longer than only setting
parameter plot_AUC = TRUE
; ~40% longer than only setting
parameter compute_AUC = TRUE
), but it is recommended to
correct the predicted cell type this way in order to obtain more
accurate cell type prediction results. The lower the parameter
threshold
, the more alternative cell types will be checked
by AUC, and the longer the run time will be.
Check the annotation probability of the cell type to be annotated in
the input cluster_col
column and cell types in
Markers_list
with the following code.
print(SlimR_anno_result$Heatmap_plot)
Note: If the heat map is not generated properly, please run the
function library(pheatmap)
first.
Cell type information results predicted by SlimR can be viewed with the following code.
View(SlimR_anno_result$Prediction_results)
Furthermore, the ROC curve and AUC value of the corresponding
cluster_col
and predicted cell types can be viewed by the
following code.
print(SlimR_anno_result$AUC_plot)
Improtant: This feature depends on the parameter
plot_AUC = TRUE
.
Note: If the heat map is not generated properly, please run the
function library(ggplot2)
first.
After viewing the list of predicted cell types and the corresponding AUC values, the predicted cell types can be corrected with the following code.
Example 1:
# For example, cluster '15' in 'cluster_col' corresponds to cell type 'Intestinal stem cell'.
$Prediction_results$Predicted_cell_type[
SlimR_anno_result$Prediction_results$cluster_col == 15
SlimR_anno_result<- "Intestinal stem cell" ]
Example 2:
# For example, a predicted cell type with an AUC of 0.5 or less should be labeled 'Unknown'.
$Prediction_results$Predicted_cell_type[
SlimR_anno_result$Prediction_results$AUC <= 0.5
SlimR_anno_result<- "Unknown" ]
After modifying the corresponding predicted cell type, the following code is used to view the updated predicted cell type table.
View(SlimR_anno_result$Prediction_results)
Improtant: It is strongly recommended that if you need to
correct the cell type, use cell types in
SlimR_anno_result$Prediction_results$Alternative_cell_type
.
Assigns SlimR predicted cell types information in
SlimR_anno_result$Prediction_results$Predicted_cell_type
to
the Seurat object based on cluster annotations, and stores the results
into seurat_obj@meta.data$annotation_col
.
<- Celltype_Annotation(seurat_obj = sce,
sce cluster_col = "seurat_clusters",
SlimR_anno_result = SlimR_anno_result,
plot_UMAP = TRUE,
annotation_col = "Cell_type_SlimR"
)
Important: The parameter cluster_col
in the
function Celltype_Calculate()
and the function
Celltype_Annotation()
must be strictly the same to avoid
false matches. And the parameter annotation_col
in the
function Celltype_Annotation()
and the function
Celltype_Verification()
must be strictly the same to avoid
false matches.
Use the cell group identity information in
seurat_obj@meta.data$annotation_col
and use the ‘Feature
Significance Score’ (FSS, product value of log2FC
and
Expression ratio
) as the ranking basis.
Celltype_Verification(seurat_obj = sce,
SlimR_anno_result = SlimR_anno_result,
gene_number = 5,
assay = "RNA",
colour_low = "white",
colour_high = "navy",
annotation_col = "Cell_type_SlimR"
)
Important: The parameter annotation_col
in the
function Celltype_Annotation()
and the function
Celltype_Verification()
must be strictly the same to avoid
false matches.
Note: Cell types located in
SlimR_anno_result$Prediction_results
were verified using
the markers information from
SlimR_anno_result$Expression_list
; cell types that are not
in the above list are validated using the markers information from the
function FindMarkers()
.
Generate a heat map to estimate the likelihood that various cell clusters exhibited similarity to control cell types:
Celltype_Annotation_Heatmap(
seurat_obj = sce,
gene_list = Markers_list,
species = "Human",
cluster_col = "seurat_cluster",
min_expression = 0.1,
specificity_weight = 3,
colour_low = "navy",
colour_high = "firebrick3"
)
Note: Now this function has been incorporated into
Celltype_Calculate()
, and it is recommended to use
Celltype_Calculate()
instead.
Generates per-cell-type expression dot plot with metric heat map (when the metric information exists):
Celltype_Annotation_Features(
seurat_obj = sce,
gene_list = Markers_list,
gene_list_type = "Cellmarker2",
species = "Human",
save_path = "./SlimR/Celltype_Annotation_Features/",
colour_low = "white",
colour_high = "navy",
colour_low_mertic = "white",
colour_high_mertic = "navy"
)
Each resulting combined image consists of a dot plot above and a heat map below (if metric information present). Dot plot show the expression level and expression ratio relationship between the cell type and corresponding markers. Below it, there is a metric heat map for the corresponding markers (if the metric information exists).
Generates per-cell-type expression combined plots:
Celltype_Annotation_Combined(
seurat_obj = sce,
gene_list = Markers_list,
species = "Human",
cluster_col = "seurat_cluster",
assay = "RNA",
save_path = "./SlimR/Celltype_Annotation_Combined/",
colour_low = "white",
colour_high = "navy"
)
Each generated combined plot shows the box plot of the expression levels of the corresponding markers for that cell type, with the colors corresponding to the average expression levels of the markers.
Functions in section 5.1, 5.2, 5.3 and 5.4 has been incorporated into
Celltype_Annotation_Features()
, and it is recommended to
use Celltype_Annotation_Features()
and set corresponding
parameters (for example, gene_list_type = "Cellmarker2"
)
instead. For more information, please refer to section 4.2.
Celltype_annotation_Cellmarker2(
seurat_obj = sce,
gene_list = Markers_list_Cellmarker2,
species = "Human",
cluster_col = "seurat_cluster",
assay = "RNA",
save_path = "./SlimR/Celltype_annotation_Cellmarkers2/",
colour_low = "white",
colour_high = "navy",
colour_low_mertic = "white",
colour_high_mertic = "navy"
)
Note: To call this function, set the parameter
gene_list_type = "Cellmarker2"
in the function
Celltype_Annotation_Features()
.
Celltype_annotation_PanglaoDB(
seurat_obj = sce,
gene_list = Markers_list_panglaoDB,
species = "Human",
cluster_col = "seurat_cluster",
assay = "RNA",
save_path = "./SlimR/Celltype_annotation_PanglaoDB/",
colour_low = "white",
colour_high = "navy",
colour_low_mertic = "white",
colour_high_mertic = "navy"
)
Note: To call this function, set the parameter
gene_list_type = "PanglaoDB"
in the function
Celltype_Annotation_Features()
.
Celltype_annotation_Seurat(
seurat_obj = sce,
gene_list = Markers_list_Seurat,
species = "Human",
cluster_col = "seurat_cluster",
assay = "RNA",
save_path = "./SlimR/Celltype_annotation_Seurat/",
colour_low = "white",
colour_high = "navy",
colour_low_mertic = "white",
colour_high_mertic = "navy"
)
Note: To call this function, set the parameter
gene_list_type = "Seurat"
in the function
Celltype_Annotation_Features()
.
Celltype_annotation_Excel(
seurat_obj = sce,
gene_list = Markers_list_Excel,
species = "Human",
cluster_col = "seurat_cluster",
assay = "RNA",
save_path = "./SlimR/Celltype_annotation_Excel/",
colour_low = "white",
colour_high = "navy",
colour_low_mertic = "white",
colour_high_mertic = "navy"
)
Note: To call this function, set the parameter
gene_list_type = "Excel"
in the function
Celltype_Annotation_Features
. This function also works with
Markers_list
without metric information or with metric
information generated in other ways.
Thank you for using SlimR. For questions, issues, or suggestions, please submit them in the issue section or discussion section on GitHub (suggested) or send an email (alternative):
zhaoqingwang@mail.sdu.edu.cn
Zhaoqing Wang