Geo-social gradients in predicted COVID-19 prevalence in Great Britain: results from 1 960 242 users of the COVID-19 Symptoms Study app

Understanding the geographical distribution of COVID-19 through the general population is key to the provision of adequate healthcare services. Using self-reported data from 1 960 242 unique users in Great Britain (GB) of the COVID-19 Symptom Study app, we estimated that, concurrent to the GB government sanctioning lockdown, COVID-19 was distributed across GB, with evidence of ‘urban hotspots’. We found a geo-social gradient associated with predicted disease prevalence suggesting urban areas and areas of higher deprivation are most affected. Our results demonstrate use of self-reported symptoms data to provide focus on geographical areas with identified risk factors.

Early in the pandemic, case distribution was not evenly spread across countries, with dense urban centres being the most affected. 1 Individuals in deprived areas have lower life expectancy, 2 are more likely to have multiple underlying comorbidities, have a higher level of influenza-associated hospitalisation 3 and therefore could be more susceptible to COVID-19. 2 Based on the known socioeconomic health gradient, we hypothesised that individuals in deprived areas were at greater risk of contracting COVID-19. Understanding the geographical distribution of the virus in a socioeconomic context is key to assist adequate healthcare resourcing, particularly intensive care beds. 4 Here we investigated the geographical distribution of COVID-19 in Great Britain (GB) and its association with area-level deprivation using selfreported data from almost 2 million users of the COVID-19 Symptom Study. 5 We studied 1 960 242 unique GB app users (20-69 years old) reporting on COVID-19 symptoms, hospitalisation, reverse-transcription PCR (RT-PCR) test outcomes, demographic information and pre-existing medical conditions (online supplemental methods) over 23 days (29 March-19 April) of major social distancing measures ('lockdown').
We computed a proxy of contracting COVID-19, based on reported symptoms 6 (positive predicted value=0.69 (0.66; 0.71) (online supplemental methods). We then calculated a predicted prevalence as the proportion of app users that we predicted to have COVID-19 within each area (online supplementary figure S1).
Following aggregation of variables to local authority district level (LAD/geographic unit representing ~17 000 individuals), we tested the geographical distribution of predicted prevalence at eight different time points spanning 23 days. We used Local Moran's I tests, which assess for nonrandom spatial distribution and clustering of a feature and can be used to identify disease hotspots and cold spots relative to the mean GB predicted prevalence 7 (online supplemental methods).
Next, we used data from the eight different time points and used multivariable mixed-effects models to investigate the association of predicted area-level prevalence (at middle super output area level (MSOA)) and deprivation (as captured by the Index of Multiple Deprivatio) adjusting for different factors including geo-social mediators and confounders (air pollution, general practitioners per MSOA, household density and urbanicity) area level aggregates of obesity and comorbidities) and area-level adjusted mean age and sex and spatial autocorrelations 8 (online supplemental methods).  table table 1 1 and online supplemental table S1. The number of predicted COVID-19 positive individuals ranged between 15 991 and 79 378.
Local Moran's I showed that predicted COVID-19 prevalence clustered in urban areas across GB when considered as a proportion of the population per LAD 7 (figure 1 and online supplemental figure S2) adjusting for multiple testing. Predicted prevalence decreased over time, consistent with 'lockdown' (figure 1 and online supplemental figure S2) (pairwise Wilcoxon rank-sum tests, prevalence: all time points except T2:T3 and T1:T4, p<0.001), but some hotspots remained.
In the MSOA-level analysis, area-level deprivation was significantly associated with predicted area-level prevalence in all models (M1-M6, see online supplemental table S2), including in the full model (M6) when adjusting for all geo-social covariates and comorbidities (M6: Beta (95% CI)=−0.15 (−0.17 to -0.130, p<0.001). This suggests that people in deprived areas were at higher risk.
Predicted COVID-19 prevalence was higher in urban areas compared with rural and in more deprived areas compared with less deprived. This could reflect the likelihood of individuals in more deprived areas working/living with people whose vocations mean they are unable to work from home and are thus more likely to be exposed to circulating COVID-19. Accumulation of socioenvironmental exposures across the life course are known to contribute to a greater health deficit and disease burden 2 ; our results suggest that COVID-19 is no exception.
Moreover, our study illustrates how app data could be used to successfully monitor COVID-19 over time and identify hotspots as the viral pandemic progresses and social distancing measures are implemented or eased. Using this method, we detected a geo-social gradient associated with prevalence in the context of COVID-19, suggesting the focus of resources should be on deprived urban areas.
Our study has some limitations and assumptions. We used self-reported data on symptoms that can lead to bias. For example, should users in deprived areas report more symptoms due to a facet of the socioeconomic environment (eg, higher air pollution), this could lead to an incorrectly higher predicted prevalence in deprived areas. Second, app users are a self-selected group, not representative of the general population. Our approach to adjust for age and sex differences at MSOA level is unlikely to sufficiently overcome selection and collider bias. 9 Third, our predicted COVID-19 prevalence is not from confirmed tests via RT-PCR, but rather based on selfreported symptoms. Additionally, we assume that people who have symptoms or have been exposed to COVID-19 are equally likely to use the app as those who do not. We performed a sensitivity analysis by rerunning the pooled analysis on individuals who were self-reportedly healthy at sign up and found the observed associations remained (online supplemental table S3), suggesting selection bias associated with being unhealthy at sign up is not influencing the observed associations of COVID-19 and deprivation. We also assume that people report symptoms in the same way and that their drop-out patterns do not differ by space, time and symptom reports. Finally, we aggregated data at MSOA level that could lead to ecological bias. We also

Figure 1
Geographical distribution of predicted COVID-19 prevalence across four time points. Prevalence is presented as proportional to the responders per local authority district (LAD). Analyses are adjusted for multiple testing using Benjamini-Hochberg false discovery rate correction (p<0.05). Inset highlights London where LAD areas are smaller. Hot and cold spots are defined relatively to their neighbours and the mean GB predicted prevalence. Red/blue coloured perimeter lines around each LAD denote hotspot/coldspot. cannot conclude that deprivation increased COVID-19 prevalence, as there could be unmeasured confounders or other factors. Future work should check our assumptions and seek to integrate these data with data on area-level morbidity, extended pollution data, ethnicity and disease severity. Indeed, higher mortality has been observed among minority ethnic groups, 10 and disentangling the environmental and biological factors contributing to greater disease burden in both deprived areas and among ethnic minorities is an essential focus of future work to ensure resources and intervention are better assigned. In this study, we included 1,960,242 unique users as outlined in the flow diagram below (Figure A). Because we were primarily interested in understanding the geography of COVID-19 distribution, and 34 how aspects of an area, in particular area-level deprivation, associated with COVID-19 prevalence we 35 aggregated user data at different GB geographic areas. This was particularly of use as the geosocial 36 variables considered (please see below) are also defined geographically and are time invariant (as 37 they are not defined by the app users themselves but by GB geographic area). 38 The maps (Figure 1, S2) were created using a shapefile of Local Authority Districts (LADs) from the 39 Office for National Statistics (ONS) using the geopandas package in Python. Overlaid on the map are 40 statistically significant 'hot-spots' and 'cold-spots' at LAD level. To assess the significance of these 41 regions, we used Local Moran's I test, as introduced below. In order to do this, spatial weights were 42 BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)

Hotspot and Coldspot definition 50
Predicted prevalence hotspots at LAD levels were defined using Local Moran' s I. The Moran's I 51 statistic gives a value indicating the spatial clustering of a variable relative to its neighbours. Where 52 there are significant (false discovery rate (FDR)adjusted p < 0.05) high positive local Moran's I in high 53 value neighbourhood (i.e. where the significant area also had a predicted prevalence greater than 54 the mean predicted prevalence and greater than the mean of the lagged variable, which effectively 55 represents how similar COVID-19 prevalence is to the areas that surround it) this implies the area 56 can be considered a 'hotspot' 3 . This method ensures we do not consider areas as hotspots where 57 they may have higher predicted prevalence to the surrounding areas but are lower than average for 58 the UK, although it might miss areas that are surrounded on all borders by other areas which would 59 be considered hotspots. A coldspot is assessed similarly using Local Moran's I, but where the area is 60 less than the mean and mean of the lagged variable. 61

Index of Multiple Deprivation (IMD) 63
The IMD was downloaded from the relevant government websites as below, and the most recent Therefore, we used within-country defined deciles. As the IMD is calculated for smaller area 73 geographies than MSOA, we calculated the average IMD per MSOA. This was then categorised into 74 quintiles where 1 is the least deprived and 5 is the most deprived. 75

Rural-urban gradient (RUC) 76
The RUC was downloaded from the relevant government websites as below: 77 Data sources for occupancy data were downloaded from the following sources: 99 BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) on-households-and-dwellings 102  Scotland: https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-103 theme/households/household-estimates/small-area-statistics-on-households-and-dwellings 104

MSOA-level mixed-effects models 105
We employed multivariable mixed-effects models to understand the relationship of predicted 106 COVID-19 prevalence at MSOA level with deprivation. As a reminder, these models were ran at 107 MSOA-level rather than individual-level. This included the following variables: 108 The Index of Multiple Deprivation, our primary explanatory variable (IMD, categorised into quintiles 109 generated on the average IMD within each MSOA, where 1 is most deprived and 5 is least, and 110 considered as a continuous variable). We additionally adjusted for the following variables derived from app response data, considered as 119 percentage of responders within the MSOA: those who reported having kidney, heart or lung 120 disease, and who are diabetic, a smoker or obese (calculated as BMI<30). We derived mean-adjusted 121 age and sex variables to partially adjust for response bias (i.e. the extent responders in an MSOA 122 represented the demographic of that MSOA). This was calculated as the difference of the expected 123 mean/ratio of age/sex in the MSOA (derived from ONS population data) and the observed 124 mean/ratio of age/sex amongst respondents. 125 We included a spatial lagged variable of the COVID-19 prevalence outcome. Inclusion of the lagged 126 variable is one method that accounts for spatial autocorrelation (SAC) 4 . It attempts to adjust for 127 spatial autocorrelation by capturing the variance explained by the influence of neighbouring regions 128 on the value of interestin this case COVID-19 severity/prevalence. The lagged variable is calculated 129 at MSOA level by applying a spatial weights matrix (calculated in this instance under queen's 130 BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)