MEP_LINCS_Pilot/MEP-LINCS_Preprocessing.Rmd at master · markdane/MEP_LINCS_Pilot · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
title: "MEP-LINCs Preprocessing"
author: "Mark Dane"
date:  "`r Sys.Date()`"
output: html_document
---

##Introduction

The MEP-LINCs dataset contains imaging data from a Nikon automated microscope that is analyzed with a CellProfiler pipeline.

This preprocessing of the dataset will be deprecated when the merging of the data and metadata happens within the CellProfiler part of the pipeline. For now, the metadata about the ECM proteins is read from the GAL file and the metadata about the wells (cell line, stains and ligands) is read from Excel spreadsheets.

```{r setup, echo=FALSE}
source("MEPLINCSFunctions.R")

library("limma")#read GAL file and strsplit2
library("MEMA")#merge, annotate and normalize functions
library("data.table")#fast file reads, data merges and subsetting
library("parallel")#use multiple cores for faster processing

#Select a staining set
ss <- "SS3"
#Select a CellLine
cellLine <- "PC3"
#select analysis version
analysisVersion <- "v1"

#Rules-based classifier thresholds for perimeter cells
neighborsThresh <- 0.4 #Gates sparse cells on a spot
wedgeAngs <- 20 #Size in degrees of spot wedges used in perimeter gating
outerThresh <- 0.5 #Defines out cells used in perimeter gating

#Filter out debris based on nuclear area
nuclearAreaThresh <- 50
nuclearAreaHiThresh <- 4000

#Only process a curated set of the data
curatedOnly <- FALSE
curatedCols <- "ImageNumber|ObjectNumber|_Area$|_Eccentricity|_Perimeter|_MedianIntensity_|_IntegratedIntensity_|_Center_|_PA_"

#Do not normalized to Spot level
normToSpot <- TRUE

```


##Summary
This script prepares cell-level data and metadata for the MEP LINCs Analysis Pipeline.

In the code, the variable ss determines which staining set (SS1, SS2 or SS3) to merge and the variable cellLine determines the cell line (PC3,MCF7, etc). All .txt data files in the "./RawData" folder will be merged with the well (xlsx) and log (XML) data from the "./Metadata" folder.

The merging assumes that the actual, physical B row wells (B01-B04) have been printed upside-down. That is, rotated 180 degrees resulting in the spot 1, 1 being in the lower right corner instead of the upper left corner. The metadata is matched to the actual printed orientation.

```{r Read and clean spotmetadata, echo=FALSE}

#Read in the spot metadata from the gal file
smd <- readSpotMetadata(paste0("./",cellLine,"/",ss,"/Metadata/20150515_LI8X001_v1.2.gal"))
#Relabel the column Name to ECMpAnnotID
setnames(smd, "Name", "ECMpAnnotID")

#Add the print order and deposition number to the metadata
ldf <- readLogData(paste0("./",cellLine,"/",ss,"/Metadata/20150512-112336.xml"))
spotMetadata <- merge(smd,ldf, all=TRUE)
setkey(spotMetadata,Spot)
#Make a rotated version of the spot metadata to match the print orientation
spotMetadata180 <- rotateMetadata(spotMetadata)
ARowMetadata <- data.table(spotMetadata,Well=rep(c("A01", "A02","A03","A04"),each=nrow(spotMetadata)))
BRowMetadata <- data.table(spotMetadata180,Well=rep(c("B01", "B02","B03","B04"),each=nrow(spotMetadata180)))

```

The well metadata describes the cell line, ligands and staining endpoints that are all added on a per well basis. There is one mutlisheet .xlsx file for each plate. Each filename is the plate's barcode.


The raw data from all wells in all plates in the dataset are read in and merged with their spot and well metadata. The number of nuclei at each spot are counted and a loess model of the spot cell count is added. Then all intensity values are normalized through dividing them by the median intensity value of the control well in the same plate.

Next, the data is filtered to remove objects with a nuclear area less than `r nuclearAreaThresh` pixels or more than `r nuclearAreaHiThresh' pixels.

```{r merge_normalize_QA, echo=FALSE}
#The next steps are to bring in the well metadata, the print order and the CP data

cellDataFiles <- dir(paste0("./",cellLine,"/", ss,"/RawData/",analysisVersion),full.names = TRUE)
splits <- strsplit2(strsplit2(cellDataFiles,split = "_")[,1],"/")
barcodes <- unique(splits[,ncol(splits)])
expDTList <- mclapply(barcodes, function(barcode){
  #browser()
  plateDataFiles <- grep(barcode,cellDataFiles,value = TRUE)
  wells <- unique(strsplit2(split = "_",plateDataFiles)[,2])
  wellDataList <- lapply(wells,function(well){
    #browser()
    wellDataFiles <- grep(well,plateDataFiles,value = TRUE)
    imageDataFile <- grep("Image",wellDataFiles,value=TRUE,
                          ignore.case = TRUE)
    nucleiDataFile <- grep("Nuclei",wellDataFiles,value=TRUE,
                           ignore.case = TRUE)
    if (ss %in% c("SS1","SS3")){
      cellsDataFile <- grep("Cell",wellDataFiles,value=TRUE,
                            ignore.case = TRUE)
      cytoplasmDataFile <- grep("Cytoplasm",wellDataFiles,value=TRUE,
                                ignore.case = TRUE)
    }
    #Read in csv data
    image <- convertColumnNames(fread(imageDataFile))
    setkey(image,CP_ImageNumber)
    nuclei <- convertColumnNames(fread(nucleiDataFile))
    if (curatedOnly) nuclei <- nuclei[,grep(curatedCols,colnames(nuclei)), with=FALSE]
    setkey(nuclei,CP_ImageNumber,CP_ObjectNumber)
    if (ss %in% c("SS1","SS3")){
      cells <- convertColumnNames(fread(cellsDataFile))
      if (curatedOnly) cells <- cells[,grep(curatedCols,colnames(cells)), with=FALSE]
      setkey(cells,CP_ImageNumber,CP_ObjectNumber)
      cytoplasm <- convertColumnNames(fread(cytoplasmDataFile))
      if (curatedOnly) cytoplasm <- cytoplasm[,grep(curatedCols,colnames(cytoplasm)), with=FALSE]
      setkey(cytoplasm,CP_ImageNumber,CP_ObjectNumber)
    }

    #Add the data location as a prefix in the column names
    setnames(nuclei,paste0("Nuclei_",colnames(nuclei)))
    if (ss %in% c("SS1","SS3")){
      setnames(cells,paste0("Cells_",colnames(cells)))
      setnames(cytoplasm,paste0("Cytoplasm_",colnames(cytoplasm)))
    }

    #Merge the cells, cytoplasm and nuclei data
    if (ss %in% c("SS1","SS3")){
      setkey(cells,Cells_CP_ImageNumber,Cells_CP_ObjectNumber)
      setkey(cytoplasm,Cytoplasm_CP_ImageNumber,Cytoplasm_CP_ObjectNumber)
      setkey(nuclei,Nuclei_CP_ImageNumber,Nuclei_CP_ObjectNumber)

      DT <- cells[cytoplasm[nuclei]]
      setnames(DT,"Cells_CP_ImageNumber","Spot")
      setnames(DT,"Cells_CP_ObjectNumber","ObjectNumber")
    } else {
      DT <- nuclei
      setnames(DT,"Nuclei_CP_ImageNumber","Spot")
      setnames(DT,"Nuclei_CP_ObjectNumber","ObjectNumber")
    }

    #Add the well name as a parameter
    DT <- DT[,Well := well]

    #Merge the data with its metadata based on the row it's in
    m <- regexpr("[[:alpha:]]",well)
    row <- regmatches(well,m)
    setkey(DT,Spot)
    DT <- switch(row, A = merge(DT,spotMetadata,all=TRUE),
                 B = merge(DT,spotMetadata180,all=TRUE))

    return(DT)
  })

  #Create the cell data.table with spot metadata for the plate
  pcDT <- rbindlist(wellDataList, fill = TRUE)
  #Read the well metadata from a multi-sheet Excel file
  wellMetadata <- data.table(readMetadata(paste0("./",cellLine,"/",
                                                 ss,"/Metadata/",barcode,".xlsx")), key="Well")

  #merge well metadata with the data and spot metadata
  pcDT <- merge(pcDT,wellMetadata,by = "Well")
  pcDT <- pcDT[,Barcode := barcode]
  #Count the cells at each spot
  pcDT<-pcDT[,Spot_PA_SpotCellCount := .N,by="Barcode,Well,Spot"]

  pcDT <- pcDT[pcDT$Nuclei_CP_AreaShape_Area > nuclearAreaThresh,]
  pcDT <- pcDT[pcDT$Nuclei_CP_AreaShape_Area < nuclearAreaHiThresh,]

  return(pcDT)
}, mc.cores = 4)

cDT <- rbindlist(expDTList, fill = TRUE)
densityRadius <- sqrt(median(cDT$Nuclei_CP_AreaShape_Area)/pi)

#Create short display names, then replace where not unique
cDT$ECMp <- gsub("_.*","",cDT$ECMpAnnotID)
cDT$Ligand <- gsub("_.*","",cDT$LigandAnnotID)

#Use entire AnnotID for ligands with same uniprot IDs
cDT$Ligand[cDT$Ligand == "NRG1"] <- cDT$LigandAnnotID[cDT$Ligand == "NRG1"]
cDT$Ligand[cDT$Ligand == "TGFB1"] <- cDT$LigandAnnotID[cDT$Ligand == "TGFB1"]
cDT$Ligand[cDT$Ligand == "CXCL12"] <- cDT$LigandAnnotID[cDT$Ligand == "CXCL12"]

cDT$MEP <- paste(cDT$ECMp,cDT$Ligand,sep = "_")

```


After merging the metadata with the cell-level data, several types of derived parameters are added. These include:

The origin of coordinate system is placed at the median X and Y of each spot and the local cartesian and polar coordinates are added to the dataset.

The number of nuclei within three nuclear radii of `r densityRadius ` around each nuclei is counted and stored as a neighbor count parameter. The neighbor count value is thresholded to classify each cell as Sparse or not.The distance from the local origin is used to classify each cell as an OuterCell or not. The Sparse, OutCell and Wedge classifications are used to classify each cell as a Perimeter cell or not.

For staining set 2, each cell is classified as EdU+ or EdU-. The threshold for EdU+ is based on kmeans threshold of the mean EdU intensity from the control well of each plate.

The intensity values are normalized at each spot so that spot-level variations can be analyzed.

The cell level raw data and metadata is saved as Level 1 data. Normalized vaules are added to the dataset and saved as Level 2 data.

```{r, echo=FALSE, message=FALSE, warnings=FALSE}

#Add the local polar coordinates and Neighbor Count
cDT <- cDT[,Nuclei_PA_Centered_X :=  Nuclei_CP_AreaShape_Center_X-median(Nuclei_CP_AreaShape_Center_X)]
cDT <- cDT[,Nuclei_PA_Centered_Y :=  Nuclei_CP_AreaShape_Center_Y-median(Nuclei_CP_AreaShape_Center_Y)]
cDT <- cDT[, Nuclei_PA_AreaShape_Center_R := sqrt(Nuclei_PA_Centered_X^2 + Nuclei_PA_Centered_Y^2)]
cDT <- cDT[, Nuclei_PA_AreaShape_Center_Theta := calcTheta(Nuclei_PA_Centered_X, Nuclei_PA_Centered_Y)]
cDT <- cDT[,Nuclei_PA_AreaShape_Neighbors := cellNeighbors(.SD, radius = densityRadius*7), by = "Barcode,Well,Spot"]

#Rules for classifying perimeter cells
cDT <- cDT[,Spot_PA_Sparse := Nuclei_PA_AreaShape_Neighbors < neighborsThresh]

#Add a local wedge ID to each cell based on conversations with Michel Nederlof
cDT <- cDT[,Spot_PA_Wedge:=ceiling(Nuclei_PA_AreaShape_Center_Theta/wedgeAngs)]

#Define the perimeter cell if it exists in each wedge
#Classify cells as outer if they have a radial position greater than a thresh
cDT <- cDT[,Spot_PA_OuterCell := labelOuterCells(Nuclei_PA_AreaShape_Center_R, thresh=outerThresh),by="Barcode,Well,Spot"]

#Require the cell not be in a sparse region
denseOuterDT <- cDT[!cDT$Spot_PA_Sparse  & cDT$Spot_PA_OuterCell]
denseOuterDT <- denseOuterDT[,Spot_PA_Perimeter := findPerimeterCell(.SD) ,by="Barcode,Well,Spot,Spot_PA_Wedge"]
setkey(cDT,Barcode,Well,Spot,ObjectNumber)
setkey(denseOuterDT,Barcode,Well,Spot,ObjectNumber)
cDT <- denseOuterDT[,list(Barcode,Well,Spot,ObjectNumber,Spot_PA_Perimeter)][cDT]
cDT$Spot_PA_Perimeter[is.na(cDT$Spot_PA_Perimeter)] <- FALSE

# #Set 2N and 4N DNA status
cDT <- cDT[,Nuclei_PA_Cycle_State := kmeansDNACluster(Nuclei_CP_Intensity_IntegratedIntensity_Dapi), by="Barcode,Well"]

cDT <- cDT[,Nuclei_PA_Cycle_2NProportion := sum(Nuclei_PA_Cycle_State==1)/length(ObjectNumber),by="Barcode,Well,Spot"]
cDT <- cDT[,Nuclei_PA_Cycle_4NProportion := sum(Nuclei_PA_Cycle_State==2)/length(ObjectNumber),by="Barcode,Well,Spot"]

#Add spot level normalizations for selected intensities
if(normToSpot){
  intensityNamesAll <- grep("_CP_Intensity_Median",colnames(cDT), value=TRUE)
  intensityNames <- grep("MedNorm",intensityNamesAll,invert=TRUE,value=TRUE)
  for(intensityName in intensityNames){
    #Median normalize the median intensity at each spot
    cDT <- cDT[,paste0(intensityName,"_SpotNorm") := medianNorm(.SD,intensityName),by="Barcode,Well,Spot"]
  }
}

#Create staining set specific derived parameters
if(ss %in% c("SS1")){

} else if (ss == "SS2"){

  cDT <- cDT[,Nuclei_PA_Gated_EduPositive := kmeansCluster(.SD, value="Nuclei_CP_Intensity_MedianIntensity_Edu", ctrlLigand = "FBS"), by="Barcode"]
  #Calculate the EdU Positive Percent at each spot
  cDT <- cDT[,Nuclei_PA_Gated_EduPositiveProportion := sum(Nuclei_PA_Gated_EduPositive)/length(ObjectNumber),by="Barcode,Well,Spot"]

} else if (ss == "SS3"){
  #Calculate a lineage ratio of luminal/basal or KRT19/KRT5
  cDT <- cDT[,Cytoplasm_PA_Intensity_LineageRatio := Cytoplasm_CP_Intensity_MedianIntensity_KRT19/Cytoplasm_CP_Intensity_MedianIntensity_KRT5]

} else stop("Invalid ss parameter")

# Eliminate Variations in the Endpoint metadata
endpointNames <- grep("End",colnames(cDT), value=TRUE)
endpointWL <- regmatches(endpointNames,regexpr("[[:digit:]]{3}|DAPI",endpointNames))
setnames(cDT,endpointNames,paste0("Endpoint",endpointWL))

#Identify parameters that shouldn't be normalized
normParameters <- grep("Sparse|Wedge|OuterCell|Spot_PA_Perimeter|Nuclei_PA_Cycle_State",colnames(cDT),value=TRUE,invert=TRUE)

#Save the un-normalized parameters to merge in later
mdKeep <- cDT[,grep("Barcode|Well|^Spot$|ObjectNumber|Sparse|Wedge|OuterCell|Spot_PA_Perimeter|Nuclei_PA_Cycle_State",colnames(cDT),value=TRUE), with = FALSE]

#Normalized all measured parameters to their plate's control well
cDT <- normDataset(cDT[,normParameters,with=FALSE])
cDT <- merge(cDT,mdKeep)

```

The cell-level data is median summarized to the spot level and coefficients of variations on the replicates are calculated. The spot level data and metadata are saved as Level 3 data.

```{r Level3, echo=FALSE}

#Summarize cell data to spot level (sl) by taking the medians of the parameters
parameterNames<-grep(pattern="(Children|_CP_|_PA_|Barcode|^Spot$|^Well$)",x=names(cDT),value=TRUE)

#Remove any spot-normalized and cell level parameters
parameterNames <- grep("SpotNorm|^Nuclei_PA_Gated_EduPositive$|^Nuclei_PA_Gated_EduPositive_MedNorm$",parameterNames,value=TRUE,invert=TRUE)

cDTParameters<-cDT[,parameterNames,with=FALSE]
slDT<-cDTParameters[,lapply(.SD,numericMedian),keyby="Barcode,Well,Spot"]

#Merge back in the spot and well metadata
metadataNames <- grep("(Row|Column|PrintOrder|Block|^ID$|Array|CellLine|Ligand|Endpoint|ECMp|MEP|Barcode|^Well$|^Spot$)", x=colnames(cDT), value=TRUE)
setkey(cDT,Barcode, Well,Spot)
mDT <- cDT[,metadataNames,keyby="Barcode,Well,Spot", with=FALSE]
slDT <- mDT[slDT, mult="first"]

#Add a count of replicates
slDT <- slDT[,Spot_PA_ReplicateCount := .N,by="LigandAnnotID,ECMpAnnotID"]

#Add the loess model of the SpotCellCount on a per well basis
slDT <- slDT[,Spot_PA_LoessSCC := loessModel(.SD, value="Spot_PA_SpotCellCount", span=.5), by="Barcode,Well"]

#Add well level QA Scores
lthresh <- 0.6
slDT <- slDT[,QAScore := calcQAScore(.SD,threshold=lthresh,maxNrSpot = max(cDT$ArrayRow)*max(cDT$ArrayColumn),value="Spot_PA_LoessSCC"),by="Barcode,Well"]

```

The spot level data is median summarized to the replicate level and is stored as Level 4 data and metadata.

```{r Level3Data, echo=FALSE}

#Summarize spot level data to MEP level by taking the medians of the parameters
mepNames<-grep("AreaShape|Children|Intensity|_PA_|LigandAnnotID|ECMpAnnotID|Barcode", x=names(slDT),value=TRUE)

mepKeep<-slDT[,mepNames,with=FALSE]
mepDT<-mepKeep[,lapply(.SD,numericMedian),keyby="LigandAnnotID,ECMpAnnotID,Barcode"]
#Merge back in the replicate metadata
mDT <- slDT[,list(Well,CellLine,Ligand,Endpoint488,Endpoint555,Endpoint647,EndpointDAPI,ECMp,MEP),keyby="LigandAnnotID,ECMpAnnotID,Barcode"]
mepDT <- mDT[mepDT, mult="first"]


```


```{r WriteData, echo=FALSE, message=FALSE, warnings=FALSE, eval=TRUE}

#Write out cDT without normalized values as level 1 dataset
level1Names <- grep("Norm",colnames(cDT),value=TRUE,invert=TRUE)
write.table(format(cDT[,level1Names, with=FALSE], digits=4), paste0("./",cellLine,"/", ss, "/AnnotatedData/", unique(cDT$CellLine),"_",ss,"_","Level1.txt"), sep = "\t",row.names = FALSE, quote=FALSE)

#Write out cDT with normalized values as level 2 dataset
write.table(format(cDT, digits=4), paste0("./",cellLine,"/", ss, "/AnnotatedData/", unique(cDT$CellLine),"_",ss,"_","Level2.txt"), sep = "\t",row.names = FALSE, quote=FALSE)

write.table(format(slDT, digits = 4), paste0("./",cellLine,"/", ss, "/AnnotatedData/", unique(slDT$CellLine),"_",ss,"_","Level3.txt"), sep = "\t",row.names = FALSE, quote=FALSE)

write.table(format(mepDT, digits = 4), paste0("./",cellLine,"/",ss, "/AnnotatedData/", unique(slDT$CellLine),"_",ss,"_","Level4.txt"), sep = "\t",row.names = FALSE, quote=FALSE)

```