Home > Q7. Why is the default percentile set to '20' in the 'Filter probe set by expression option'? Also, how does changing the number of samples make a difference to the filtering results?
There are two factors based on which the filtering is done here. The percentile cutoff and the filter criteria of in how many samples must a probe set have intensity value within the specified range. Together, these two factors will determine what kind of probe sets is eliminated.
Factor 1: Percentile cutoff
We try to set this percentile cutoff to only eliminate genes that are not expressed. What we are saying is that if an intensity value of a probe set is below the 20th percentile in that sample, the gene is probably not expressed in that sample. It is known that any given tissue, not all genes in the human genome is expressed. On average across different types of tissues, we can expect that 20% of the genes are not expressed. Therefore, you can expect that about 20% of the probe sets on any given genome-wide array (sample) have intensity values that represent noise (since they are not expressed).
Factor 2: Number of samples
If probe sets were filtered such that they must have values within the range (above 20th percentile) in all samples (or in both conditions), then there is a possibility of interesting genes being excluded. Thus, potentially interesting biological changes between experimental conditions could be missed. To decrease the chances of missing these changes, the stringency of the filter is set such that even if the gene is only expressed in one sample in the experiment, the probe set will pass the filter. But, this criterion could be changed by the user according to their interest.