CSVS Research
  • Home
  • Antibiogram
  • Chronic Diseases & Environment
  • COVID-19 Archive
  • OB Outcomes
  • ↗ CSVS Main Site

On this page

  • 1 Introduction
  • 2 Overview
    • 2.1 Exploring the Data
      • 2.1.1 Distributions and frequencies
        • 2.1.1.1 Other histograms
        • 2.1.1.2 Gravidarity-Parity distribution by Zipcode
        • 2.1.1.3 Age categories distribution by zip
        • 2.1.1.4 Gravidarity-Parity distribution by Age
        • 2.1.1.5 GAP: Pregnancy - Delivery gap
      • 2.1.2 Time Series
      • 2.1.3 Data
        • 2.1.3.1 DATA FIELDS
    • 2.2 Descriptive statistics
      • 2.2.1 Numeric Variables
      • 2.2.2 Cluster Analysis
    • 2.3 High Risk OB
    • 2.4 Service Area by Zipcode
      • 2.4.1 Map
  • 3 Data Analysis
    • 3.0.0.0.1 Time series decomposition
    • 3.0.0.1 Checking for trend disparity
    • 3.0.1 Time Series Forecasts
    • 3.0.2 Zip Code effect
  • 4 Hospital: Labor & Delivery
    • 4.0.1 Variables
    • 4.0.2 Distribution analysis
      • 4.0.2.1 By zip
      • 4.0.2.2 By High Risk OB
      • 4.0.2.3 By Preterm:
      • 4.0.2.4 Labor type by Delivery method: an example
      • 4.0.2.5 Delivery method by High Risk OB: an example
    • 4.0.3 Survival Analysis (SA)
      • 4.0.3.0.1 Definitions
      • 4.0.3.1 COX Survival, K-M, Cumulative Hazard and Hazard Ratio
        • 4.0.3.1.0.1 Cox Surv, CumHaz, strata = ALL
        • 4.0.3.1.0.2 Cox Surv, CumHaz –stratified examples
        • 4.0.3.1.1 Cox with stratification
          • 4.0.3.1.1.1 Intrapartal Conditions consolidated
          • 4.0.3.1.1.2 Intrapartal events Consolidated
          • 4.0.3.1.1.3 Labor Type consolidated
          • 4.0.3.1.1.4 UDS Age grouping 25-44
          • 4.0.3.1.1.5 Continuous variables internally grouped
          • 4.0.3.1.1.6 By Zip Code
          • 4.0.3.1.1.7 High Risk OB and Preterm
          • 4.0.3.1.1.8 Summary of Significant Cox Proportional Hazard modelling
        • 4.0.3.1.2 COX HR
          • 4.0.3.1.2.1 All Variables
          • 4.0.3.1.2.2 Only significantly positive HR’s
    • 4.0.4 ODDS Ratio: GLM modeling
      • 4.0.4.0.1 Example: Intrapartal events and intrapartal conditions.
      • 4.0.4.0.2 Other outcomes
        • 4.0.4.0.2.1 (with Primary C/S, Non-spontaneous delivery, intrapartal conditions and events)
    • 4.1 Significant Coefficients
    • 4.2 Variable Importance
      • 4.2.0.1 Intrapartal Events and Intrapartal Conditions

CSVS OB DATA Analytics

Author

Oguchi Nkwocha, MD., MS.

1 Introduction

OB service at CSVS is delivered by dedicated clinicians and staff who provide tools- and procedure-validated comprehensive natal education through the CPSP program; prenatal care, hospital and delivery care, as well as post-natal care. OB deliveries are currently performed only at NMC which keeps a standard meticulous record of the delivery event and related issues. Each phase of OB care –prenatal, peripartal and postpartal– presents opportunities and challenges depending on the effects of relevant factors which impact care, especially the central event of OB care, Delivery.

The goal of this document is to explore the data and present analytical results for information, education and action where indicated.

2 Overview

Between 2021-12-30 and 2024-03-30, a period of 9 quarters or 27 months, 1720 CSVS prenatal patients delivered at NMC, which is the principal source of data for this analysis.This included 853 females, 845 males and 22 unknowns (49.6, 49.1, 1.3 % respectively). There were 14 multiple gestations –all twins – a 0.8% incidence. There was 1 Fetal Demise during the period.

Primary Cesarean section rate was 12.2%, constituting 51.4% of all c-sections. Repeat C-Section rate was 11%. Total C-section rate was 23.7%. For primip vertex presentations, the overall C-section rate is 4.4%. This accounts for 18.7% of all C-Sections, which is 36.4% of all Primary Cesaerian Sections.

The average monthly delivery for CSVS for the period was 64, with annualized rate of 764 deliveries. Each day, on the average, 2 deliveries occurred.

Average birth weight was 3300gm, with minimum 180gm and maximum of 5030gm. Maternal age range was between 13 and 44, with a median age of 27. (See Histogram below)

2.1 Exploring the Data

2.1.1 Distributions and frequencies

2.1.1.1 Other histograms

From the figures above, we learn that there

  • relatively fewer pregnancies under age 19 than over 35
  • primips account for 26%; second pregnancies 26%; so slightly over half of all our deliveries are to women with 1 or 2 deliveries, counting the current pregnancy as a delivery
  • most deliveries score APGAR-1 8 and over; and APGAR-5 9

Does the Zip code make any difference with respect to gravidarity or parity?

2.1.1.2 Gravidarity-Parity distribution by Zipcode

Distribution of Gravidarity by Zip code
1 2 3 4 5 6 7 8 9 10 11 12 p test
n 407 403 359 271 154 75 31 12 3 2 1 2
zip (%) 0.507
93901 16 ( 3.9) 16 ( 4.0) 12 ( 3.3) 11 ( 4.1) 10 ( 6.5) 3 ( 4.0) 2 ( 6.5) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93905 150 (36.9) 132 (32.8) 126 (35.1) 78 (28.8) 41 (26.6) 17 (22.7) 7 (22.6) 3 (25.0) 1 (33.3) 0 ( 0.0) 0 ( 0.0) 1 (50.0)
93906 87 (21.4) 80 (19.9) 87 (24.2) 59 (21.8) 28 (18.2) 13 (17.3) 8 (25.8) 1 ( 8.3) 0 ( 0.0) 1 (50.0) 0 ( 0.0) 1 (50.0)
93907 8 ( 2.0) 16 ( 4.0) 9 ( 2.5) 5 ( 1.8) 2 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93908 0 ( 0.0) 2 ( 0.5) 2 ( 0.6) 3 ( 1.1) 0 ( 0.0) 1 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93925 1 ( 0.2) 2 ( 0.5) 1 ( 0.3) 2 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93926 5 ( 1.2) 6 ( 1.5) 2 ( 0.6) 0 ( 0.0) 0 ( 0.0) 1 ( 1.3) 0 ( 0.0) 2 (16.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93927 64 (15.7) 67 (16.6) 54 (15.0) 57 (21.0) 33 (21.4) 22 (29.3) 10 (32.3) 3 (25.0) 0 ( 0.0) 1 (50.0) 1 (100.0) 0 ( 0.0)
93930 10 ( 2.5) 26 ( 6.5) 17 ( 4.7) 13 ( 4.8) 14 ( 9.1) 8 (10.7) 0 ( 0.0) 1 ( 8.3) 1 (33.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93933 2 ( 0.5) 1 ( 0.2) 2 ( 0.6) 0 ( 0.0) 2 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93955 4 ( 1.0) 0 ( 0.0) 1 ( 0.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93960 19 ( 4.7) 13 ( 3.2) 16 ( 4.5) 14 ( 5.2) 8 ( 5.2) 2 ( 2.7) 2 ( 6.5) 1 ( 8.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95012 34 ( 8.4) 34 ( 8.4) 23 ( 6.4) 25 ( 9.2) 11 ( 7.1) 7 ( 9.3) 2 ( 6.5) 1 ( 8.3) 1 (33.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95039 2 ( 0.5) 0 ( 0.0) 1 ( 0.3) 0 ( 0.0) 2 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95076 0 ( 0.0) 6 ( 1.5) 2 ( 0.6) 1 ( 0.4) 2 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
99999 5 ( 1.2) 2 ( 0.5) 4 ( 1.1) 3 ( 1.1) 1 ( 0.6) 1 ( 1.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
          Stratified by grav_fct
           1           2           3           4           5          
  n        407         403         359         271         154        
  zip (%)                                                             
     93901  16 ( 3.9)   16 ( 4.0)   12 ( 3.3)   11 ( 4.1)   10 ( 6.5) 
     93905 150 (36.9)  132 (32.8)  126 (35.1)   78 (28.8)   41 (26.6) 
     93906  87 (21.4)   80 (19.9)   87 (24.2)   59 (21.8)   28 (18.2) 
     93907   8 ( 2.0)   16 ( 4.0)    9 ( 2.5)    5 ( 1.8)    2 ( 1.3) 
     93908   0 ( 0.0)    2 ( 0.5)    2 ( 0.6)    3 ( 1.1)    0 ( 0.0) 
     93925   1 ( 0.2)    2 ( 0.5)    1 ( 0.3)    2 ( 0.7)    0 ( 0.0) 
     93926   5 ( 1.2)    6 ( 1.5)    2 ( 0.6)    0 ( 0.0)    0 ( 0.0) 
     93927  64 (15.7)   67 (16.6)   54 (15.0)   57 (21.0)   33 (21.4) 
     93930  10 ( 2.5)   26 ( 6.5)   17 ( 4.7)   13 ( 4.8)   14 ( 9.1) 
     93933   2 ( 0.5)    1 ( 0.2)    2 ( 0.6)    0 ( 0.0)    2 ( 1.3) 
     93955   4 ( 1.0)    0 ( 0.0)    1 ( 0.3)    0 ( 0.0)    0 ( 0.0) 
     93960  19 ( 4.7)   13 ( 3.2)   16 ( 4.5)   14 ( 5.2)    8 ( 5.2) 
     95012  34 ( 8.4)   34 ( 8.4)   23 ( 6.4)   25 ( 9.2)   11 ( 7.1) 
     95039   2 ( 0.5)    0 ( 0.0)    1 ( 0.3)    0 ( 0.0)    2 ( 1.3) 
     95076   0 ( 0.0)    6 ( 1.5)    2 ( 0.6)    1 ( 0.4)    2 ( 1.3) 
     99999   5 ( 1.2)    2 ( 0.5)    4 ( 1.1)    3 ( 1.1)    1 ( 0.6) 
          Stratified by grav_fct
           6          7          8          9         10        11        
  n        75         31         12         3         2         1         
  zip (%)                                                                 
     93901  3 ( 4.0)   2 ( 6.5)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93905 17 (22.7)   7 (22.6)   3 (25.0)  1 (33.3)  0 ( 0.0)  0 (  0.0) 
     93906 13 (17.3)   8 (25.8)   1 ( 8.3)  0 ( 0.0)  1 (50.0)  0 (  0.0) 
     93907  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93908  1 ( 1.3)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93925  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93926  1 ( 1.3)   0 ( 0.0)   2 (16.7)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93927 22 (29.3)  10 (32.3)   3 (25.0)  0 ( 0.0)  1 (50.0)  1 (100.0) 
     93930  8 (10.7)   0 ( 0.0)   1 ( 8.3)  1 (33.3)  0 ( 0.0)  0 (  0.0) 
     93933  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93955  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     93960  2 ( 2.7)   2 ( 6.5)   1 ( 8.3)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     95012  7 ( 9.3)   2 ( 6.5)   1 ( 8.3)  1 (33.3)  0 ( 0.0)  0 (  0.0) 
     95039  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     95076  0 ( 0.0)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
     99999  1 ( 1.3)   0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0) 
          Stratified by grav_fct
           12        p      test
  n        2                    
  zip (%)             0.507     
     93901 0 ( 0.0)             
     93905 1 (50.0)             
     93906 1 (50.0)             
     93907 0 ( 0.0)             
     93908 0 ( 0.0)             
     93925 0 ( 0.0)             
     93926 0 ( 0.0)             
     93927 0 ( 0.0)             
     93930 0 ( 0.0)             
     93933 0 ( 0.0)             
     93955 0 ( 0.0)             
     93960 0 ( 0.0)             
     95012 0 ( 0.0)             
     95039 0 ( 0.0)             
     95076 0 ( 0.0)             
     99999 0 ( 0.0)             

Distribution of Parity by Zip code
0 1 2 3 4 5 6 7 8 9 11 p test
n 445 447 380 270 110 45 17 2 2 1 1
zip (%) 0.043
93901 16 ( 3.6) 15 ( 3.4) 15 ( 3.9) 16 ( 5.9) 5 ( 4.5) 2 ( 4.4) 1 ( 5.9) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93905 177 (39.8) 139 (31.1) 122 (32.1) 76 (28.1) 27 (24.5) 12 (26.7) 2 (11.8) 0 ( 0.0) 1 (50.0) 0 ( 0.0) 0 ( 0.0)
93906 88 (19.8) 105 (23.5) 90 (23.7) 48 (17.8) 21 (19.1) 8 (17.8) 4 (23.5) 0 ( 0.0) 0 ( 0.0) 1 (100.0) 0 ( 0.0)
93907 11 ( 2.5) 15 ( 3.4) 7 ( 1.8) 4 ( 1.5) 3 ( 2.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93908 0 ( 0.0) 2 ( 0.4) 3 ( 0.8) 1 ( 0.4) 1 ( 0.9) 1 ( 2.2) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93925 1 ( 0.2) 2 ( 0.4) 1 ( 0.3) 2 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93926 5 ( 1.1) 4 ( 0.9) 4 ( 1.1) 1 ( 0.4) 1 ( 0.9) 0 ( 0.0) 0 ( 0.0) 1 (50.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93927 71 (16.0) 66 (14.8) 62 (16.3) 64 (23.7) 26 (23.6) 13 (28.9) 8 (47.1) 1 (50.0) 0 ( 0.0) 0 ( 0.0) 1 (100.0)
93930 17 ( 3.8) 24 ( 5.4) 26 ( 6.8) 12 ( 4.4) 9 ( 8.2) 1 ( 2.2) 0 ( 0.0) 0 ( 0.0) 1 (50.0) 0 ( 0.0) 0 ( 0.0)
93933 2 ( 0.4) 3 ( 0.7) 0 ( 0.0) 2 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93955 3 ( 0.7) 1 ( 0.2) 1 ( 0.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93960 20 ( 4.5) 17 ( 3.8) 18 ( 4.7) 12 ( 4.4) 6 ( 5.5) 2 ( 4.4) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95012 30 ( 6.7) 39 ( 8.7) 24 ( 6.3) 28 (10.4) 9 ( 8.2) 6 (13.3) 2 (11.8) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95039 1 ( 0.2) 1 ( 0.2) 1 ( 0.3) 0 ( 0.0) 2 ( 1.8) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
95076 0 ( 0.0) 8 ( 1.8) 1 ( 0.3) 2 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
99999 3 ( 0.7) 6 ( 1.3) 5 ( 1.3) 2 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
          Stratified by para_fct
           0           1           2           3           4          
  n        445         447         380         270         110        
  zip (%)                                                             
     93901  16 ( 3.6)   15 ( 3.4)   15 ( 3.9)   16 ( 5.9)    5 ( 4.5) 
     93905 177 (39.8)  139 (31.1)  122 (32.1)   76 (28.1)   27 (24.5) 
     93906  88 (19.8)  105 (23.5)   90 (23.7)   48 (17.8)   21 (19.1) 
     93907  11 ( 2.5)   15 ( 3.4)    7 ( 1.8)    4 ( 1.5)    3 ( 2.7) 
     93908   0 ( 0.0)    2 ( 0.4)    3 ( 0.8)    1 ( 0.4)    1 ( 0.9) 
     93925   1 ( 0.2)    2 ( 0.4)    1 ( 0.3)    2 ( 0.7)    0 ( 0.0) 
     93926   5 ( 1.1)    4 ( 0.9)    4 ( 1.1)    1 ( 0.4)    1 ( 0.9) 
     93927  71 (16.0)   66 (14.8)   62 (16.3)   64 (23.7)   26 (23.6) 
     93930  17 ( 3.8)   24 ( 5.4)   26 ( 6.8)   12 ( 4.4)    9 ( 8.2) 
     93933   2 ( 0.4)    3 ( 0.7)    0 ( 0.0)    2 ( 0.7)    0 ( 0.0) 
     93955   3 ( 0.7)    1 ( 0.2)    1 ( 0.3)    0 ( 0.0)    0 ( 0.0) 
     93960  20 ( 4.5)   17 ( 3.8)   18 ( 4.7)   12 ( 4.4)    6 ( 5.5) 
     95012  30 ( 6.7)   39 ( 8.7)   24 ( 6.3)   28 (10.4)    9 ( 8.2) 
     95039   1 ( 0.2)    1 ( 0.2)    1 ( 0.3)    0 ( 0.0)    2 ( 1.8) 
     95076   0 ( 0.0)    8 ( 1.8)    1 ( 0.3)    2 ( 0.7)    0 ( 0.0) 
     99999   3 ( 0.7)    6 ( 1.3)    5 ( 1.3)    2 ( 0.7)    0 ( 0.0) 
          Stratified by para_fct
           5          6          7         8         9          11        
  n        45         17         2         2         1          1         
  zip (%)                                                                 
     93901  2 ( 4.4)   1 ( 5.9)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93905 12 (26.7)   2 (11.8)  0 ( 0.0)  1 (50.0)  0 (  0.0)  0 (  0.0) 
     93906  8 (17.8)   4 (23.5)  0 ( 0.0)  0 ( 0.0)  1 (100.0)  0 (  0.0) 
     93907  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93908  1 ( 2.2)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93925  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93926  0 ( 0.0)   0 ( 0.0)  1 (50.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93927 13 (28.9)   8 (47.1)  1 (50.0)  0 ( 0.0)  0 (  0.0)  1 (100.0) 
     93930  1 ( 2.2)   0 ( 0.0)  0 ( 0.0)  1 (50.0)  0 (  0.0)  0 (  0.0) 
     93933  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93955  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     93960  2 ( 4.4)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     95012  6 (13.3)   2 (11.8)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     95039  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     95076  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
     99999  0 ( 0.0)   0 ( 0.0)  0 ( 0.0)  0 ( 0.0)  0 (  0.0)  0 (  0.0) 
          Stratified by para_fct
           p      test
  n                   
  zip (%)   0.043     
     93901            
     93905            
     93906            
     93907            
     93908            
     93925            
     93926            
     93927            
     93930            
     93933            
     93955            
     93960            
     95012            
     95039            
     95076            
     99999            

Yes, there is a difference in the distribution, but only for Parity (number of deliveries) – for both 93905 (higher in lower parity levels) and 93927 (higher in higher parity levels), with a significant p-value of 0.043. There was no difference with respect to Gravidarity (number of pregnancies)

Is there a difference in the distribution of different age-groups among and within zip codes? As before, first, check the p-value. It is 0.105, so the answer is, No.

2.1.1.3 Age categories distribution by zip

Distribution of Age-groups by Zip Code
<21 21-29 29-35 >35 p test
n 345 763 396 216
zip (%) 0.105
93901 15 ( 4.3) 31 ( 4.1) 14 ( 3.5) 10 ( 4.6)
93905 113 (32.8) 231 (30.3) 125 (31.6) 87 (40.3)
93906 78 (22.6) 169 (22.1) 82 (20.7) 36 (16.7)
93907 9 ( 2.6) 26 ( 3.4) 4 ( 1.0) 1 ( 0.5)
93908 3 ( 0.9) 2 ( 0.3) 3 ( 0.8) 0 ( 0.0)
93925 3 ( 0.9) 2 ( 0.3) 0 ( 0.0) 1 ( 0.5)
93926 1 ( 0.3) 10 ( 1.3) 1 ( 0.3) 4 ( 1.9)
93927 65 (18.8) 139 (18.2) 75 (18.9) 33 (15.3)
93930 9 ( 2.6) 39 ( 5.1) 27 ( 6.8) 15 ( 6.9)
93933 2 ( 0.6) 1 ( 0.1) 3 ( 0.8) 1 ( 0.5)
93955 1 ( 0.3) 4 ( 0.5) 0 ( 0.0) 0 ( 0.0)
93960 13 ( 3.8) 36 ( 4.7) 17 ( 4.3) 9 ( 4.2)
95012 28 ( 8.1) 58 ( 7.6) 35 ( 8.8) 17 ( 7.9)
95039 1 ( 0.3) 1 ( 0.1) 3 ( 0.8) 0 ( 0.0)
95076 1 ( 0.3) 5 ( 0.7) 3 ( 0.8) 2 ( 0.9)
99999 3 ( 0.9) 9 ( 1.2) 4 ( 1.0) 0 ( 0.0)

2.1.1.4 Gravidarity-Parity distribution by Age

Distribution of Age-groups by Parity
<21 21-29 29-35 >35 p test
n 345 763 396 216
para_fct (%) <0.001
0 212 (61.4) 180 (23.6) 39 ( 9.8) 14 ( 6.5)
1 110 (31.9) 256 (33.6) 58 (14.6) 23 (10.6)
2 17 ( 4.9) 188 (24.6) 118 (29.8) 57 (26.4)
3 5 ( 1.4) 110 (14.4) 98 (24.7) 57 (26.4)
4 1 ( 0.3) 20 ( 2.6) 55 (13.9) 34 (15.7)
5 0 ( 0.0) 6 ( 0.8) 21 ( 5.3) 18 ( 8.3)
6 0 ( 0.0) 3 ( 0.4) 7 ( 1.8) 7 ( 3.2)
7 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 2 ( 0.9)
8 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 2 ( 0.9)
9 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 0.5)
11 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 0.5)
Distribution of Age-groups by Gravidarity
<21 21-29 29-35 >35 p test
n 345 763 396 216
grav_fct (%) <0.001
1 209 (60.6) 163 (21.4) 25 ( 6.3) 10 ( 4.6)
2 102 (29.6) 229 (30.0) 54 (13.6) 18 ( 8.3)
3 29 ( 8.4) 190 (24.9) 95 (24.0) 45 (20.8)
4 4 ( 1.2) 119 (15.6) 100 (25.3) 48 (22.2)
5 1 ( 0.3) 40 ( 5.2) 72 (18.2) 41 (19.0)
6 0 ( 0.0) 14 ( 1.8) 33 ( 8.3) 28 (13.0)
7 0 ( 0.0) 5 ( 0.7) 15 ( 3.8) 11 ( 5.1)
8 0 ( 0.0) 2 ( 0.3) 2 ( 0.5) 8 ( 3.7)
9 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 3 ( 1.4)
10 0 ( 0.0) 1 ( 0.1) 0 ( 0.0) 1 ( 0.5)
11 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 0.5)
12 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 2 ( 0.9)

Yes, there is a significant difference (p = <0.001 for both) in the distribution of Gravidarity and of Parity within and among zip codes.

The distribution patterns suggest a trend where younger women (<21) predominantly have their first child, and as age increases, women are more likely to have higher parity deliveries. The 29-35 and >35 age groups, in particular, show a significant percentage of deliveries at higher parity levels, indicating that some women are indeed delaying childbirth to later ages.

We turn our attention next to non-completed or undelivered pregnancies or pregnancies without delivery, the count-difference between gravidarity and parity.This may reflect the state of family planning.

2.1.1.5 GAP: Pregnancy - Delivery gap

Incidence of Pregnancies without Delivery by Age Group
age_group total_pregnancies total_deliveries total_preg_without_delivery incidence_rate annual_incidence_rate
<21 521 373 148 28.40691 13
21-29 2012 1264 748 37.17694 17
29-35 1497 991 506 33.80094 15
>35 959 639 320 33.36809 15
p_value = 0.0012 

Pregnancies without Delivery, by ZIP code
zip <21 21-29 29-35 >35
93901 5.4 3.5 4.2 3.4
93905 29.1 29.9 29.4 35.9
93906 21.6 23.7 20.8 19.7
93907 3.4 3.2 0.6 0.3
93908 3.4 0.0 0.6 0.0
93925 1.4 0.3 0.0 0.3
93926 0.0 1.1 0.2 1.6
93927 20.9 18.3 18.8 16.2
93930 2.7 6.1 9.9 6.9
93933 0.7 0.3 1.2 0.0
93955 0.0 0.4 0.0 0.0
93960 2.7 4.9 4.2 5.6
95012 7.4 6.1 7.1 8.8
95039 0.0 0.0 0.6 0.0
95076 0.7 0.8 1.0 1.2
99999 0.7 1.3 1.6 0.0
p_value = 0.9999953 

For our patient population, annually, an average of 15% of pregnancies were not delivered. There is a statistically significant association between age group and pregnancy outcome (p = 0.0012). This suggests that the likelihood of pregnancies resulting in delivery or not, differs significantly across the different age groups. When considering the annual incidence rate:

  • Age group 21-29 has the highest annual incidence rate of pregnancies without delivery (17 per year).

  • Age group <21 has the lowest annual incidence rate (13 per year).

  • Age groups 29-35 and >35 have similar annual incidence rates (15 per year).

These findings suggest that women in the 21-29 age group experience the highest rate of pregnancies without delivery, while the youngest group (<21) has the lowest rate. The older age groups (29-35 and >35) have intermediate rates. This could imply that younger women (<21) and older women (>35) have different risk factors or behaviors affecting pregnancy outcomes compared to women in their late 20s.

Of note: there is no significant variation by zip code.

Overall, the significant association and the annual incidence rates highlight important age-related differences in pregnancy outcomes, which could be crucial for targeted interventions and healthcare planning.

Numbers were derived by adding the gravidarity and subtracting parity, while crediting one imminent delivery to P0; and accounting for multiple gestation.

For only the first-ever delivery alone (Para 0), here is the outcome:

Incidence of Pregnancies without Delivery for primips by Age Group
age_group total_pregnancies total_deliveries total_preg_without_delivery incidence_rate
<21 237 212 25 11
21-29 236 180 56 24
29-35 67 39 28 42
>35 23 14 9 39

Pregnancies without Delivery at time of First Delivery, by ZIP code
zip <21 21-29 29-35 >35
93901 0 1.8 0.0 0.0
93905 48 32.1 35.7 11.1
93906 4 21.4 21.4 0.0
93907 4 7.1 0.0 0.0
93925 0 0.0 0.0 0.0
93926 0 0.0 0.0 0.0
93927 32 16.1 14.3 77.8
93930 4 3.6 25.0 11.1
93933 0 0.0 0.0 0.0
93955 0 0.0 0.0 0.0
93960 8 16.1 0.0 0.0
95012 0 1.8 3.6 0.0
95039 0 0.0 0.0 0.0
99999 0 0.0 0.0 0.0

p_value = 1.900904e-08 

Focusing on first deliveries only, for every 4 delivered pregnancies, one pregnancy was not delivered, given an overall not-delivered incidence rate of 21%. At the time of the first deliveries only, there is a significant variance by zip code:

  • 93905 has highest delivery gaps in age-group <21, least in >35; even in the rest of the groups
  • 93906 has least gap in <21 and >35; rest are even
  • 93927 has the highest gap in >35 followed by <21; rest are even
  • 93930 has the least gap in 21-29 and <21, highest gap in 29-35, followed by >35

Thus, regarding pregnancies without delivery disparity at the time of the first delivery, the data may show a pattern (See plot above) of:

  • high disparity at younger ages and low disparity for older ages in 93905;

  • high disparity for the middle age groups and low disparity for the younger and the older ages in 93906;

  • high disparity for the younger and the older age groups and low disparity for the middle age groups for 93927;

  • high disparity for the older age groups and low disparity for the younger age groups for 93930

Our records were collected on a continuum of time, so let’s take a look. The time series plot is based on weekly deliveries: One can guesstimate the average weekly delivery and well as get a hint as to the trending of weekly deliveries.

2.1.2 Time Series

2.1.3 Data

Delivery Data for our patients at NMC is kept in NMC OB Delivered Log repository; it captures the following types and categories of information relevant to our analysis:

2.1.3.1 DATA FIELDS

Data Field Names
zip age grav
para adm_date labor_type
membrane_rupture intrapartal_conditions intrapartal_events
delivery_method baby_sq delivery_date
gender weight apg1
apg5 gest_age presentation
gest_age_days NA NA

To facilitate analyses, we initially added these derived columns below:

Derived Fields
Derived_column Explanation
gest_age_days gestational age in DAYS
adm_to_del_tm Admission-to-Delivery time
diff_grav_para grav minus para
event Delivery
time Duration: Delivery time minus Admit time
adm_to_del_tm_cat Duration ranges
uds_age Age ranges used by UDS
hghRsk High Risk OB by definition
conditions_cnsldt Consolidated intrapartum conditions
lbr_type_cnsldt Consolidated labor types
presentation_cnsldt Consolidated presentation
del_method_cnsldt Consolidated Delivery method
cluster PreTerm / Non-Preterm, informed by Clustering

2.2 Descriptive statistics

Let’s summarize the CSVS data from NMC Delivery Log, starting with continuous or numerical variables:

2.2.1 Numeric Variables

Overall
n 1720
age (mean (SD)) 27.42 (6.37)
grav (mean (SD)) 2.90 (1.67)
para (mean (SD)) 1.65 (1.46)
baby_sq (mean (SD)) 1.01 (0.09)
weight (mean (SD)) 3267.80 (576.12)
apg1 (mean (SD)) 7.93 (1.23)
apg5 (mean (SD)) 8.80 (0.85)
gest_age_days (mean (SD)) 273.51 (15.83)
adm_to_del_tm (mean (SD)) 15.68 (17.54)
diff_grav_para (mean (SD)) 1.33 (0.70)
time (mean (SD)) 15.68 (17.54)

Next, we study the delivery frequency among different variables

Exploratory Data analyses yields rich information. However, we need to proceed with analytics so we can query the data in order to get desired information. We need to know how variables of interest affect one another, and how they compare with one another in their contribution to the results or outcome of interest to us. The outcome of interest to us in OB can be found in prenatal care (before delivery), during delivery; the event of delivery itself and right after delivery.

A pre-delivery outcome, for example, is pre-term (prematurity) status; another is High Risk OB status. During delivery, we are concerned about outcomes categorized by “intrapartal conditions” such as preeclampsia and also “intrapartal events” which encompass abnormal labor dynamics. We want to know: Did delivery occur at all? How long was the time from admission to delivery? What are the chances that at a given time delivery will have occurred, and what percent of patients will have delivered by a certain time? For each of these outcomes or combinations thereof, our interest is how and if variables like maternal age, zip code, parity, high risk status, affected the end-result.

We all are curious to know if and how residential geography affects OB outcomes. We will use zipcode information (the variable, “zip”) and a standard MAP to explore this curiosity. High Risk OB is a technical definition; we can explore what contributes to it, as well as determine how it affects other outcomes as mentioned above. A fundamental method in statistical data analytics is to “throw in” all the data, together, “stir it up” and see if the data will settle out in a natural separation into groups–called clustering–that achieves statistical significance. If we can find significant clusters, we can also use them, along with zip and high risk status as probes with which to analyze outcomes.

We applied cluster statistics on the data using fetal weight against gestational age: here is our finding:

2.2.2 Cluster Analysis

Right now, we are going to look into the important OB metrics of Gestational Age and Birth weight of our patients using Cluster Analysis1, a useful method in medical research for identifying patterns among patients that help to characterize clinical states and associated risks and treatment outcomes without predefined categories.

Using k-means clustering algorithm with a choice of 3 clusters (arbitrary), we obtained these stats and plots:

Can we identify this cluster grouping using known and relevant OB metrics for birth weight and gestational age categorization? Yes:

Cluster Stats
cluster mean.gest_age_days mean.weight count percent
Term, full 274 3431 692 40.2
Term, early 277 3267 957 55.6
Pre-term 222 1691 71 4.1

The plot shows the separation of a left lower quad (LLQ) cluster from an un-separated 2 other clusters. Further analysis shows that the LLQ cluster is statistically significantly different from either of the 2 other clusters, while this is not the case between these 2 clusters themselves. By adding the 3000 GRAM birth weight dashed horizontal line and the 37 week dashed vertical line, representing the demarcations for SGA and Preterm respectively we find a correspondence between SGA and preterm delivery, a widely known outcome. The clusters were named after this finding, as: Pre-term; Term, early; and Term, full (latter two are from 37 weeks and over). CAVEAT: Actual preterm deliveries as defined by gestational age (alone) under 37 weeks account for only 8% of the deliveries, compared to the cluster group here defined by both SGA AND Preterm (4.1%); both are relatively small numbers compared to the total number of deliveries.

2.3 High Risk OB

Let’s officially define “High Risk OB”, based on maternal age greater than 35, diagnosis of HTN or Preeclampsia, DM and (for CSVS) multiple gestation and or non-vertex presentation. By these criteria, 21% are classified as High Risk.

2.4 Service Area by Zipcode

Moving on to zip codes…what is CSVS OB service area? Let’s now turn our attention to where our patients come from using a zipcode information (–99999 represents no zip code entered);

Deliveries by Zip Code
Overall
n 1720
zip (%)
93901 70 ( 4.1)
93905 556 (32.3)
93906 365 (21.2)
93907 40 ( 2.3)
93908 8 ( 0.5)
93925 6 ( 0.3)
93926 16 ( 0.9)
93927 312 (18.1)
93930 90 ( 5.2)
93933 7 ( 0.4)
93955 5 ( 0.3)
93960 75 ( 4.4)
95012 138 ( 8.0)
95039 5 ( 0.3)
95076 11 ( 0.6)
99999 16 ( 0.9)

The top 5 are 93905 (one-third of all deliveries), 93906 (one-fifth), 93927 (one-sixth); 95012 (one-twelfth), 93930 (one-twentieth). One thing stands out: considering its size and straddling location, 99308 is a sparse draw for CSVS deliveries.

Let’s place this information within the context of a map of the County:

2.4.1 Map

This map shows that zipcode 93908 sitting smack within “CSVS territory” has extremely low deliveries. Census Information2 reveals that this discrepancy is demographical (socioeconomic) and related to CSVS usual drawing pool. This zipcode, 93908 (“Salinas - Corral De Tierra”), compared with say, 93905:

  • has primarily white residents Vs “primarily other race” residents

  • has Median Household Income $108,093 Vs $41,607

  • has Median Home Value $616,000, Vs $203,4003

3 Data Analysis

Having identified certain parameters of interest (Zip code distribution, Prematurity and High Risk), let’s proceed to see what our data is telling us. We will start by taking another look at the weekly deliveries, to check for patterns (seasonality) and overall direction of the total number of deliveries (trend); and then, predict number of deliveries as forecast for the next quarter.

Is there seasonality to the deliveries? Do we have a trend?

3.0.0.0.1 Time series decomposition

The top graph shows the same weekly deliveries time series seen earlier. The second and third panels tease out the series to extract seasonality information and trend, respectively. The last panel superimposes the seasonality and trend graphs on the original time series. We can see a trend that “trough-ed” around end of the first quarter of 2023 and is rising since then. The seasonality graph nearly matches the original data, indicating that there is a significant seasonal variation in weekly deliveries among our patients.

What might account for the trough? Below we compare the original (full data) trend with the trend of each of the components of various variables to look for patterns.

3.0.0.1 Checking for trend disparity

Trends Contribution Effect
variable_level mann_kendall_p interpretation
zip: 93905 0.6689561 No significant trend, does not contribute uniquely
zip: 93906 0.4620307 No significant trend, does not contribute uniquely
zip: 93927 0.1123541 No significant trend, does not contribute uniquely
zip: 95012 0.9522035 No significant trend, does not contribute uniquely
zip: 93930 0.0309631 Significant trend, contributes uniquely
hghRsk: no 0.0242098 Significant trend, contributes uniquely
hghRsk: yes 0.9042370 No significant trend, does not contribute uniquely
cluster: 0 0.2896511 No significant trend, does not contribute uniquely
cluster: 1 0.1969658 No significant trend, does not contribute uniquely
age_group: <21 0.3806900 No significant trend, does not contribute uniquely
age_group: 21-29 0.2355140 No significant trend, does not contribute uniquely
age_group: 29-35 0.0203920 Significant trend, contributes uniquely
age_group: >35 0.5482932 No significant trend, does not contribute uniquely
uds_age: 15-19 0.5842622 No significant trend, does not contribute uniquely
uds_age: 20-24 0.5104073 No significant trend, does not contribute uniquely
uds_age: 25-44 0.3273173 No significant trend, does not contribute uniquely
uds_age: lt 15 NaN NA

If the trend of the original data (“Overall Trend) parallels the variable’s trend visually, the latter must be contributing significantly to the shape and direction of the former. However, we must rely on p-values for statistical significance. Thus, the following contribute significantly to the trend that we observed:

  • zip: 93930
  • hghRsk: no
  • age_group: 29-35

but, not the rest of the zipcodes, cluster, age groups (including UDS age group)

An important and practical goal of data analysis is to make future forecasts. Based on the weekly delivery numbers for 117 weeks or nine quarters, we can predict the following quarter’s performance: check the plot, and then the table for comparison between actual and predicted data. (Use the cursor cross-mark to expand any segment by pressing the button and dragging the mouse)

3.0.1 Time Series Forecasts

Next Quarter’s Weekly Deliveries Forecast
Date Forecast
2024-04-07 12
2024-04-14 14
2024-04-21 16
2024-04-28 16
2024-05-05 14
2024-05-12 12
2024-05-19 13
2024-05-26 15
2024-06-02 17
2024-06-09 17
2024-06-16 15
2024-06-23 14
2024-06-30 14

We are going to continue the analysis of our data using zip-code distribution, Clustering and High Risk OB before moving to generalized analyses.

How does zip code affect the distribution of outcomes like Preterm (derived from cluster analysis)and High Risk OB? We will throw in another interesting outcome, Delivery Method (vaginal versus Caeserian section)?

3.0.2 Zip Code effect

Distribution of Preterm by Zip code
0 1 p test
n 1559 161
zip (%) 0.508
93901 59 ( 3.8) 11 ( 6.8)
93905 511 (32.8) 45 (28.0)
93906 325 (20.8) 40 (24.8)
93907 38 ( 2.4) 2 ( 1.2)
93908 8 ( 0.5) 0 ( 0.0)
93925 6 ( 0.4) 0 ( 0.0)
93926 13 ( 0.8) 3 ( 1.9)
93927 281 (18.0) 31 (19.3)
93930 81 ( 5.2) 9 ( 5.6)
93933 7 ( 0.4) 0 ( 0.0)
93955 5 ( 0.3) 0 ( 0.0)
93960 72 ( 4.6) 3 ( 1.9)
95012 124 ( 8.0) 14 ( 8.7)
95039 5 ( 0.3) 0 ( 0.0)
95076 10 ( 0.6) 1 ( 0.6)
99999 14 ( 0.9) 2 ( 1.2)
Distribution of High Risk OB by Zip code
no yes p test
n 1357 363
zip (%) 0.792
93901 56 ( 4.1) 14 ( 3.9)
93905 426 (31.4) 130 (35.8)
93906 289 (21.3) 76 (20.9)
93907 37 ( 2.7) 3 ( 0.8)
93908 7 ( 0.5) 1 ( 0.3)
93925 4 ( 0.3) 2 ( 0.6)
93926 12 ( 0.9) 4 ( 1.1)
93927 248 (18.3) 64 (17.6)
93930 70 ( 5.2) 20 ( 5.5)
93933 6 ( 0.4) 1 ( 0.3)
93955 4 ( 0.3) 1 ( 0.3)
93960 63 ( 4.6) 12 ( 3.3)
95012 109 ( 8.0) 29 ( 8.0)
95039 5 ( 0.4) 0 ( 0.0)
95076 8 ( 0.6) 3 ( 0.8)
99999 13 ( 1.0) 3 ( 0.8)
Distribution of Delivery Methods by Zip code
C/S, Primary C/S, Repeat Vaginal p test
n 209 198 1313
zip (%) 0.976
93901 8 ( 3.8) 7 ( 3.5) 55 ( 4.2)
93905 69 (33.0) 65 (32.8) 422 (32.1)
93906 51 (24.4) 40 (20.2) 274 (20.9)
93907 2 ( 1.0) 3 ( 1.5) 35 ( 2.7)
93908 1 ( 0.5) 1 ( 0.5) 6 ( 0.5)
93925 0 ( 0.0) 1 ( 0.5) 5 ( 0.4)
93926 2 ( 1.0) 3 ( 1.5) 11 ( 0.8)
93927 35 (16.7) 33 (16.7) 244 (18.6)
93930 12 ( 5.7) 12 ( 6.1) 66 ( 5.0)
93933 0 ( 0.0) 0 ( 0.0) 7 ( 0.5)
93955 1 ( 0.5) 0 ( 0.0) 4 ( 0.3)
93960 11 ( 5.3) 10 ( 5.1) 54 ( 4.1)
95012 11 ( 5.3) 17 ( 8.6) 110 ( 8.4)
95039 1 ( 0.5) 1 ( 0.5) 3 ( 0.2)
95076 2 ( 1.0) 2 ( 1.0) 7 ( 0.5)
99999 3 ( 1.4) 3 ( 1.5) 10 ( 0.8)

First, always check the p-values for the various distribution tables.

The answer is: there is no significant difference in the relative distribution, among and within different zip codes, of Pre-term deliveries (p=0.508), High Risk OB (p=0.792), Delivery Method (p=0.976).

What about the High Risk OB and the outcomes of Preterm & Delivery Method?

Distribution of Preterm by High Risk OB
0 1 p test
n 1559 161
hghRsk = no/yes (%) 1260/299 (80.8/19.2) 97/64 (60.2/39.8) <0.001
Distribution of Delivery Method by High Risk OB
C/S, Primary C/S, Repeat Vaginal p test
n 209 198 1313
hghRsk = no/yes (%) 116/93 (55.5/44.5) 133/65 (67.2/32.8) 1108/205 (84.4/15.6) <0.001

With a p-value <0.001, a statistically significant difference is obvious in the distribution of Cesarean sections compared to Vaginal deliveries among High Risk OB. The same holds true for Preterm distribution: a significantly higher percent are in High Risk group than non-Preterm (39.8% Vs 19.2%, p-value <0.001).

Finally, same question for Preterm and outcomes of High Risk and Delivery Method:

Distribution of Preterm by High Risk OB
no yes p test
n 1357 363
cluster = 1 (%) 97 (7.1) 64 (17.6) <0.001
Distribution of Preterm by Delivery Method
C/S, Primary C/S, Repeat Vaginal p test
n 209 198 1313
cluster = 1 (%) 21 (10.0) 28 (14.1) 112 (8.5) 0.038
  • The distribution of High Risk OB is significantly higher compared with non_High Risk among Preterm deliveries (p=<0.001).

  • The distribution of C-sections is significantly higher compared with vaginal delivery among Preterm deliveries (p=0.039).

4 Hospital: Labor & Delivery

Let’s now shift our attention to hospital events related to OB care. First, we will extend the previous analyses to hospital events. Then, we will use a different method of analysis known as Survival Analysis (SA) or by Time-to-Event (TTE) Analysis to probe our data. The hospital event of interest is Delivery; along with related matters of labor type, delivery method, intrapartal events and intrapartal conditions. Here are the variables and their respective member-items:

4.0.1 Variables

Variable Items
lbr_type_cnsldt Augmented
Induced
Not Applicable
Spontaneous
Spontaneous, Augmented
Suppressed
del_method_cnsldt C/S, Primary
C/S, Repeat
Vaginal
conditions_cnsldt Abruption Placenta
Hemorrhage
infection
None
Placenta Previa
Preeclampsia
Preeclampsia_Infection
Prolonged ROM
intrapartal_events Ineffective Pushing
None
Precip Labor <3 hrs
Prolonged 2nd Stg >2.5 hr
Prolonged 2nd Stg >2.5 hr, Ineffective Pushing
Prolonged Labor >20 hrs
Prolonged Labor >20 hrs, Ineffective Pushing
Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr
Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr, Ineffective Pushing

We have consolidated some of the items within their respective original variables and named them with “-cnsldt” appendum.

Below are the frequency and the distribution of deliveries by these variables:

Distribution of Deliveries among Hospital Variables
Variable Item Value
n 1720
lbr_type_cnsldt....
Augmented 170 ( 9.9)
Induced 428 (24.9)
Not.Applicable 244 (14.2)
Spontaneous 843 (49.0)
Spontaneous..Augmented 28 ( 1.6)
Suppressed 7 ( 0.4)
presentation_cnsldt...Vertex.... 1636 (95.1)
intrapartal_events....
Ineffective.Pushing 16 ( 0.9)
None 1501 (87.3)
Precip.Labor..3.hrs 95 ( 5.5)
Prolonged.2nd.Stg..2.5.hr 37 ( 2.2)
Prolonged.2nd.Stg..2.5.hr..Ineffective.Pushing 5 ( 0.3)
Prolonged.Labor..20.hrs 54 ( 3.1)
Prolonged.Labor..20.hrs..Ineffective.Pushing 1 ( 0.1)
Prolonged.Labor..20.hrs..Prolonged.2nd.Stg..2.5.hr 8 ( 0.5)
Prolonged.Labor..20.hrs..Prolonged.2nd.Stg..2.5.hr..Ineffective.Pushing 3 ( 0.2)
conditions_cnsldt....
Abruption.Placenta 15 ( 0.9)
Hemorrhage 6 ( 0.3)
infection 259 (15.1)
None.1 1318 (76.6)
Placenta.Previa 8 ( 0.5)
Preeclampsia 69 ( 4.0)
Preeclampsia_Infection 11 ( 0.6)
Prolonged.ROM 34 ( 2.0)
del_method_cnsldt....
C.S..Primary 209 (12.2)
C.S..Repeat 198 (11.5)
Vaginal 1313 (76.3)

4.0.2 Distribution analysis

4.0.2.1 By zip

Like we did with previous situations, we ask if hospital events and hospital conditions at delivery show a biased distribution with regards to zip codes.

Distribution of Labor Types by Zip code
Augmented Induced Not Applicable Spontaneous Spontaneous, Augmented Suppressed p test
n 170 428 244 843 28 7
zip (%) 0.142
93901 8 ( 4.7) 14 ( 3.3) 11 ( 4.5) 34 ( 4.0) 2 ( 7.1) 1 (14.3)
93905 53 (31.2) 159 (37.1) 78 (32.0) 256 (30.4) 8 (28.6) 2 (28.6)
93906 38 (22.4) 86 (20.1) 53 (21.7) 180 (21.4) 7 (25.0) 1 (14.3)
93907 6 ( 3.5) 7 ( 1.6) 4 ( 1.6) 23 ( 2.7) 0 ( 0.0) 0 ( 0.0)
93908 0 ( 0.0) 1 ( 0.2) 1 ( 0.4) 6 ( 0.7) 0 ( 0.0) 0 ( 0.0)
93925 2 ( 1.2) 0 ( 0.0) 1 ( 0.4) 3 ( 0.4) 0 ( 0.0) 0 ( 0.0)
93926 2 ( 1.2) 6 ( 1.4) 2 ( 0.8) 5 ( 0.6) 0 ( 0.0) 1 (14.3)
93927 25 (14.7) 84 (19.6) 33 (13.5) 158 (18.7) 10 (35.7) 2 (28.6)
93930 9 ( 5.3) 14 ( 3.3) 19 ( 7.8) 48 ( 5.7) 0 ( 0.0) 0 ( 0.0)
93933 0 ( 0.0) 0 ( 0.0) 2 ( 0.8) 5 ( 0.6) 0 ( 0.0) 0 ( 0.0)
93955 0 ( 0.0) 4 ( 0.9) 0 ( 0.0) 1 ( 0.1) 0 ( 0.0) 0 ( 0.0)
93960 10 ( 5.9) 21 ( 4.9) 12 ( 4.9) 31 ( 3.7) 1 ( 3.6) 0 ( 0.0)
95012 12 ( 7.1) 30 ( 7.0) 21 ( 8.6) 75 ( 8.9) 0 ( 0.0) 0 ( 0.0)
95039 2 ( 1.2) 0 ( 0.0) 1 ( 0.4) 2 ( 0.2) 0 ( 0.0) 0 ( 0.0)
95076 0 ( 0.0) 1 ( 0.2) 3 ( 1.2) 7 ( 0.8) 0 ( 0.0) 0 ( 0.0)
99999 3 ( 1.8) 1 ( 0.2) 3 ( 1.2) 9 ( 1.1) 0 ( 0.0) 0 ( 0.0)
Distribution of Intrapartal Conditions by Zip code
Abruption Placenta Hemorrhage infection None Placenta Previa Preeclampsia Preeclampsia_Infection Prolonged ROM p test
n 15 6 259 1318 8 69 11 34
zip (%) 0.151
93901 1 ( 6.7) 0 ( 0.0) 16 ( 6.2) 50 ( 3.8) 0 ( 0.0) 3 ( 4.3) 0 ( 0.0) 0 ( 0.0)
93905 7 (46.7) 3 (50.0) 84 (32.4) 418 (31.7) 1 (12.5) 25 (36.2) 6 (54.5) 12 (35.3)
93906 4 (26.7) 1 (16.7) 46 (17.8) 294 (22.3) 3 (37.5) 11 (15.9) 1 ( 9.1) 5 (14.7)
93907 0 ( 0.0) 0 ( 0.0) 8 ( 3.1) 28 ( 2.1) 0 ( 0.0) 1 ( 1.4) 0 ( 0.0) 3 ( 8.8)
93908 0 ( 0.0) 0 ( 0.0) 2 ( 0.8) 5 ( 0.4) 0 ( 0.0) 1 ( 1.4) 0 ( 0.0) 0 ( 0.0)
93925 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 6 ( 0.5) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93926 0 ( 0.0) 0 ( 0.0) 4 ( 1.5) 10 ( 0.8) 0 ( 0.0) 2 ( 2.9) 0 ( 0.0) 0 ( 0.0)
93927 2 (13.3) 1 (16.7) 48 (18.5) 236 (17.9) 3 (37.5) 13 (18.8) 3 (27.3) 6 (17.6)
93930 0 ( 0.0) 0 ( 0.0) 9 ( 3.5) 77 ( 5.8) 1 (12.5) 2 ( 2.9) 0 ( 0.0) 1 ( 2.9)
93933 0 ( 0.0) 0 ( 0.0) 3 ( 1.2) 4 ( 0.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93955 0 ( 0.0) 0 ( 0.0) 1 ( 0.4) 4 ( 0.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93960 0 ( 0.0) 0 ( 0.0) 12 ( 4.6) 60 ( 4.6) 0 ( 0.0) 0 ( 0.0) 1 ( 9.1) 2 ( 5.9)
95012 1 ( 6.7) 0 ( 0.0) 19 ( 7.3) 106 ( 8.0) 0 ( 0.0) 9 (13.0) 0 ( 0.0) 3 ( 8.8)
95039 0 ( 0.0) 0 ( 0.0) 1 ( 0.4) 2 ( 0.2) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 2 ( 5.9)
95076 0 ( 0.0) 0 ( 0.0) 1 ( 0.4) 10 ( 0.8) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
99999 0 ( 0.0) 1 (16.7) 5 ( 1.9) 8 ( 0.6) 0 ( 0.0) 2 ( 2.9) 0 ( 0.0) 0 ( 0.0)
Distribution of Intrapartal Events by Zip code
Ineffective Pushing None Precip Labor <3 hrs Prolonged 2nd Stg >2.5 hr Prolonged 2nd Stg >2.5 hr, Ineffective Pushing Prolonged Labor >20 hrs Prolonged Labor >20 hrs, Ineffective Pushing Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr, Ineffective Pushing p test
n 16 1501 95 37 5 54 1 8 3
zip (%) 0.020
93901 1 ( 6.2) 63 ( 4.2) 4 ( 4.2) 1 ( 2.7) 1 (20.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93905 4 (25.0) 488 (32.5) 30 (31.6) 10 (27.0) 3 (60.0) 20 (37.0) 1 (100.0) 0 ( 0.0) 0 ( 0.0)
93906 3 (18.8) 318 (21.2) 20 (21.1) 12 (32.4) 0 ( 0.0) 8 (14.8) 0 ( 0.0) 2 (25.0) 2 (66.7)
93907 0 ( 0.0) 32 ( 2.1) 3 ( 3.2) 1 ( 2.7) 0 ( 0.0) 4 ( 7.4) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93908 0 ( 0.0) 5 ( 0.3) 2 ( 2.1) 0 ( 0.0) 0 ( 0.0) 1 ( 1.9) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93925 0 ( 0.0) 6 ( 0.4) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93926 0 ( 0.0) 14 ( 0.9) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 1.9) 0 ( 0.0) 1 (12.5) 0 ( 0.0)
93927 3 (18.8) 273 (18.2) 22 (23.2) 5 (13.5) 1 (20.0) 7 (13.0) 0 ( 0.0) 1 (12.5) 0 ( 0.0)
93930 1 ( 6.2) 81 ( 5.4) 4 ( 4.2) 2 ( 5.4) 0 ( 0.0) 2 ( 3.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93933 0 ( 0.0) 7 ( 0.5) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93955 1 ( 6.2) 4 ( 0.3) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
93960 1 ( 6.2) 67 ( 4.5) 1 ( 1.1) 2 ( 5.4) 0 ( 0.0) 3 ( 5.6) 0 ( 0.0) 1 (12.5) 0 ( 0.0)
95012 1 ( 6.2) 116 ( 7.7) 9 ( 9.5) 3 ( 8.1) 0 ( 0.0) 7 (13.0) 0 ( 0.0) 1 (12.5) 1 (33.3)
95039 0 ( 0.0) 3 ( 0.2) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 ( 1.9) 0 ( 0.0) 1 (12.5) 0 ( 0.0)
95076 0 ( 0.0) 11 ( 0.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0)
99999 1 ( 6.2) 13 ( 0.9) 0 ( 0.0) 1 ( 2.7) 0 ( 0.0) 0 ( 0.0) 0 ( 0.0) 1 (12.5) 0 ( 0.0)
  • For labor types (p-value = 0.142) and intrapartal conditions (p-value = 0.151) there is no significant difference in their distribution among and within the zipcodes.

  • A different conclusion for Intrapartal events (p_value = 0.02): Focusing on the 3 most frequent intrapartal events, there is a significant difference in precipitous labor (93905, 93927, 93906); Prolonged Labor >20 hrs (93905, 93906, 93927); and Prolonged 2nd Stg >2.5 hr (93905, 93906, 93927); 93905 had the most difference in each category.

4.0.2.2 By High Risk OB

Distribution of Labor Types by High Risk OB
Augmented Induced Not Applicable Spontaneous Spontaneous, Augmented Suppressed p test
n 170 428 244 843 28 7
hghRsk = no/yes (%) 144/26 (84.7/15.3) 299/129 (69.9/30.1) 159/85 (65.2/34.8) 725/118 (86.0/14.0) 25/3 (89.3/10.7) 5/2 (71.4/28.6) <0.001
Distribution of Intrapartal Conditions by High Risk OB
Abruption Placenta Hemorrhage infection None Placenta Previa Preeclampsia Preeclampsia_Infection Prolonged ROM p test
n 15 6 259 1318 8 69 11 34
hghRsk = yes (%) 5 (33.3) 2 (33.3) 47 (18.1) 219 (16.6) 4 (50.0) 69 (100.0) 11 (100.0) 6 (17.6) <0.001
Distribution of Intrapartal Events by High Risk OB
Ineffective Pushing None Precip Labor <3 hrs Prolonged 2nd Stg >2.5 hr Prolonged 2nd Stg >2.5 hr, Ineffective Pushing Prolonged Labor >20 hrs Prolonged Labor >20 hrs, Ineffective Pushing Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr, Ineffective Pushing p test
n 16 1501 95 37 5 54 1 8 3
hghRsk = yes (%) 0 (0.0) 328 (21.9) 10 (10.5) 3 (8.1) 1 (20.0) 15 (27.8) 1 (100.0) 4 (50.0) 1 (33.3) 0.002

High Risk OB, on the other hand, showed significant differences in all three parameters:

  • Labor Type (p= <0.001) : 30% induced; 14% spontaneous – greater association with labor induction;
  • Intrapartal conditions (p = <0.001): None 16.6% versus 100% Preclampsia, 18.1% infection, and 17.6 % Prolonged Rupture of Membrane
  • Intrapartal events (p = 0.002): None 21.9% Vs 10.5% for Precipitous labor < 3hrs, 8.1 % for Prolonged 2nd stage >2.5 hr and 27.8 % for Prolonged labor > 20 hr

4.0.2.3 By Preterm:

Distribution of Labor Types by Preterm
Augmented Induced Not Applicable Spontaneous Spontaneous, Augmented Suppressed p test
n 170 428 244 843 28 7
cluster = 0/1 (%) 155/15 (91.2/8.8) 400/28 (93.5/6.5) 211/33 (86.5/13.5) 760/83 (90.2/9.8) 28/0 (100.0/0.0) 5/2 (71.4/28.6) 0.009
Distribution of Intrapartal Conditions by Preterm
Abruption Placenta Hemorrhage infection None Placenta Previa Preeclampsia Preeclampsia_Infection Prolonged ROM p test
n 15 6 259 1318 8 69 11 34
cluster = 1 (%) 6 (40.0) 1 (16.7) 10 (3.9) 106 (8.0) 7 (87.5) 21 (30.4) 1 (9.1) 9 (26.5) <0.001
Distribution of Intrapartal Events by Preterm
Ineffective Pushing None Precip Labor <3 hrs Prolonged 2nd Stg >2.5 hr Prolonged 2nd Stg >2.5 hr, Ineffective Pushing Prolonged Labor >20 hrs Prolonged Labor >20 hrs, Ineffective Pushing Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr Prolonged Labor >20 hrs, Prolonged 2nd Stg >2.5 hr, Ineffective Pushing p test
n 16 1501 95 37 5 54 1 8 3
cluster = 1 (%) 1 (6.2) 139 (9.3) 16 (16.8) 0 (0.0) 0 (0.0) 4 (7.4) 0 (0.0) 1 (12.5) 0 (0.0) 0.172

The analysis shows that Preterm

  • is significantly associated with different frequencies of different Labor types (p=0.009) AND different intrapartal conditions (p= <0.001)

  • but NOT different intrapartal events (p=0.172)

In summary, for this series of frequency and distribution analysis, we can conclude that High Risk OB has significant impact on Intrapartal events and Intrapartal conditions, as well as Labor Type.

There are many other ways to analyze the data with this method. For example, we can address questions like: does labor induction lead to more C-sections?

4.0.2.4 Labor type by Delivery method: an example

Distribution of Delivery Method by Labor type
C/S, Primary C/S, Repeat Vaginal p test
n 209 198 1313
lbr_type_cnsldt (%) <0.001
Augmented 24 (11.5) 9 ( 4.5) 137 (10.4)
Induced 76 (36.4) 22 (11.1) 330 (25.1)
Not Applicable 58 (27.8) 136 (68.7) 50 ( 3.8)
Spontaneous 44 (21.1) 27 (13.6) 772 (58.8)
Spontaneous, Augmented 5 ( 2.4) 1 ( 0.5) 22 ( 1.7)
Suppressed 2 ( 1.0) 3 ( 1.5) 2 ( 0.2)

We show a statistically significant effect (p = <0.001): Induced labor is associated with a higher frequency of Primary C-Section when compared to other delivery types (36.4%), but also when Primary C-Section rate is compared to Vaginal Delivery rate for the labor induction group (36.4% Vs 25.1%). That solves out to Odds Ratio of 1.70 (We will see Odds Ratio again later); meaning that the odds of having Primary C-Section are 1.7 times higher for labor induction group, and it is statistically significant at p-value <0.001)

4.0.2.5 Delivery method by High Risk OB: an example

Another example: Is Caesarian section more frequent in High Risk OB?

Distribution of Delivery Method by High Risk
C/S, Primary C/S, Repeat Vaginal p test
n 209 198 1313
hghRsk = yes (%) 93 (44.5) 65 (32.8) 205 (15.6) <0.001

Yes, in fact, at a p-value = <0.001, the percent distribution is: 44.5%, 32.8%, and 15.6% for C/S– Primary; C/S– Repeat; and Vaginal delivery respectively when High Risk OB is a present condition.

Moving on to a different analysis schema: Survival Analysis (SA) or by Time-to-Event (TTE) Analysis.

4.0.3 Survival Analysis (SA)

Survival or Time-to_Event (TTE) Analysis

Survival analysis, or time-to-event analysis, is a statistical method used to analyze the expected duration until one or more events occur, such as death or failure. It involves the use of survival functions to estimate the probability of surviving past a certain time point. The Kaplan-Meier (KM) estimator is a non-parametric statistic used to estimate the survival function from observed survival times. Cumulative hazard (cumHaz) functions represent the accumulated risk of the event over time. Hazard ratios (HR) compare the hazard or event rates between two groups, indicating the relative likelihood of the event occurring in one group compared to another.

Survival Analysis can answer the questions:

  • What is the probability of being undelivered at a given time after admission? (yields Median Survival)

  • What percent of all deliveries would have occurred at a given time after admission? (Cumulative Hazard)

  • What is the comparative likelihood of delivery having occurred given a certain condition versus absence of such a condition at a certain time (Hazard Ratio, (HR))?

These are important questions for planning, predictions and for information for patient.

4.0.3.0.1 Definitions

Here’s a definition of terms relevant to Survival or Time-to_Event (TTE) Analysis :

Event: Delivery or the outcome

There is an associated delivery date with time that defines the event occurrence

Intervention: Admission to the NMC

A date-time is associated with Intervention

Interval: Duration of time between Intervention and Event.

Time: any slice of, or point , in time (not Duration) within Interval

Survival: “Survived” Event at any given Time

Hazard: Risk of Event at any point in time at any given Time

Cumulative Hazard: Accumulated Risk of event up to a given point in time at any given time

Hazard Ratio (HR): Relative hazard rates between two groups at any given point in time at any given Time

In our case, given that the Event is Delivery, Survival means “survived delivery”, ie., delivery has NOT occurred at the specified Time; or, a status of undelivered.

These definitions will facilitate the interpretation of our analyses when we use the popular analytic method called Survival Analysis (SA) or Time to Event (TTE) analysis.

For Survival or TTE Analysis, we will use the method of Cox Proportional Hazards; this permits the taking into account together of all the variables that contribute to an outcome (or event) even while focusing on a particular variable of interest. The data subset will exclude deliveries done before patient had a chance for admission-registration, the latter being the time-stamp to start the clock of Intervention. That leaves 1598 deliveries, a substantial 93% of all deliveries.

4.0.3.1 COX Survival, K-M, Cumulative Hazard and Hazard Ratio

Standard plot and example:

4.0.3.1.0.1 Cox Surv, CumHaz, strata = ALL

General: ALL variables, SA, CumHaz

There’s the top plot (Survival Plot) which shows the probability of Survival at any time; and the bottom plot (Cumulative Hazard Plot) which shows accumulated events at any time. In the top / Survival Plot, the intersecting vertical and horizontal lines determine “median survival (time)”: the time at which survival probability is 50% (time at which half the population will have survived (or experienced) the event). Time, for our purposes, is measured in hours since hospital admission registration.

Keeping in mind that for our data,

  • “Survival” = not delivered (or undelivered);

  • “Event” = “Hazard” = Delivered.

  • “Risk” = the risk of Delivery.

We can interpret the top graph as:

  • median survival is 12 hours; or,

  • the time when half of our patients is expected to be still undelivered is 12 hours; or,

  • Our patients have a 50% probability of being undelivered at 12 hours after admission.

For the bottom graph, we determine the time when the cumulative deliveries show 50% still need to deliver (or are still to deliver). It works out to be about 12 hours. Note: it’s just a coincidence that the median survival time (the time when half of our patients is still undelivered, based on probability) is also 12 hours.

This method is most informative when used to compare of different variable subsets, for example:

4.0.3.1.0.2 Cox Surv, CumHaz –stratified examples

The example above compares the condition of membrane-rupture with no membrane-rupture. First thing to check is the p-value. A p-value of 0.34 is > 0.05, there is therefore no statistically significant difference among this condition-binary with respect to probability of survival (undelivered); or cumulative hazard (total delivered) at any time. Based on the graph, at 11 hours, the probability of being undelivered is 50% for either of them. At 12 hours (your pick of time; it could have been at 20, 30, or 36 hours) the cumulative hazard (accumulated deliveries) is 55% and 61% respectively, which do not meet the test of statistical significance for being different.

On the other hand, below, when comparing the intrapartal condition, Preeclampsia present or not, we can see the difference in Survival (undelivered status) and Cumulative deliveries; p-value <0.0001. 50% undelivered probability is at ~12-hr mark if no Preeclampsia, but doubles to 25 hours when Preeclampsia is present. At the same time , cumulative delivered is 57% with Preeclampsia absent but only 24% if Preeclampsia is present.

Since we are always curious about zipcode effects, we use them for the next example below. First, a p-value of 0.12 means no significant difference among the 4 top zipcodes in terms of the probability of being undelivered or the cumulative deliveries at any point in time:

4.0.3.1.1 Cox with stratification
4.0.3.1.1.1 Intrapartal Conditions consolidated
4.0.3.1.1.2 Intrapartal events Consolidated
4.0.3.1.1.3 Labor Type consolidated
4.0.3.1.1.4 UDS Age grouping 25-44
4.0.3.1.1.5 Continuous variables internally grouped
4.0.3.1.1.6 By Zip Code
4.0.3.1.1.7 High Risk OB and Preterm

We performed the same analysis and obtained plots for the different variables and tabulated the result in a summary table:

4.0.3.1.1.8 Summary of Significant Cox Proportional Hazard modelling
Summary Table, Predictor effects on Survival Probability
Variable Probability
hghRsk_yes Decreases Undelivered Probability
membrane_rupture_None Decreases Undelivered Probability
intrapartal_events_Precip_Labor_lt3_hrs Decreases Undelivered Probability
uds_age_25_44 Decreases Undelivered Probability
cluster_1 Increases Undelivered Probability
intrapartal_events_Prolonged_Labor_gt20_hrs Increases Undelivered Probability
conditions_cnsldt_Placenta_Previa Increases Undelivered Probability
conditions_cnsldt_Preeclampsia Increases Undelivered Probability
conditions_cnsldt_Preeclampsia_Infection Increases Undelivered Probability
Summary Table, Predictor effects on Cumulative Deliveries
Variable Cumulative_hazard
hghRsk_yes Higher Cumulative Deliveries
cluster_1 Higher Cumulative Deliveries
intrapartal_events_Prolonged_Labor_gt20_hrs Higher Cumulative Deliveries
conditions_cnsldt_Placenta_Previa Higher Cumulative Deliveries
conditions_cnsldt_Preeclampsia Higher Cumulative Deliveries
conditions_cnsldt_Preeclampsia_Infection Higher Cumulative Deliveries
membrane_rupture_None Lower Cumulative Deliveries
intrapartal_events_Precip_Labor_lt3_hrs Lower Cumulative Deliveries
uds_age_25_44 Lower Cumulative Deliveries

We see that:

  • the probability at any time during OB admission of not_delivered_yet is higher with PreTerm, Placenta Previa, Preeclampsia, Preeclampsia with infection and Prolonged Labor greater than 20 hours.

  • On the other hand, High Risk OB, No Membrane Rupture, Precipitous Labor less than 3 hours, and Age_Group 25-44 years decrease such probability.

We will continue next with Cox Proportional Hazards Survival Analysis and address Hazard Ratio: the comparative Risk of the Event: (the Risk of) Delivery between two conditions that impact the Event. A reminder:

Hazard: Risk of Event (risk of Delivery) at any point in time at any given Time

Hazard Ratio (HR): Relative hazard rates between two sub-groups (usually of the binarized or “referencialized” group) at any given point in time at any given Time

First, we’ll obtain the HR’s of all variables of interest, then we will select the statistically significant ones which are determined by p-values < 0.05 (and marked by asterisks next to the p-value)

4.0.3.1.2 COX HR

We’ve explored cumulative hazard (where hazard = Delivery) which is accumulated delivery count over time at any time. Now, we will look at Risk Ratio, otherwise known as Hazard Ratio (HR) (risk in this case = Delivery): a comparative Risk analysis, here, between either binarized variables or between subgroups within a group with one subgroup as a reference.

4.0.3.1.2.1 All Variables

The busy HR plot above gives an idea of all the variables considered; it shows which HRs are statistically significant based on p-value less than 0.05. The next plot shows only variables whose HR values are significant; this is followed by the corresponding table.

4.0.3.1.2.2 Only significantly positive HR’s

HR calculations showing conditions or variables with a statistically significant self-referential or binarized relative risk or likelihood of Delivery at any point in time

In summary,

  • Higher Risk of Delivery (HR > 1)
    • Increasing Gestational Age (not by much): Each additional day of gestational age slightly increases the risk of delivery at any point in time.
    • Higher APGAR 1 Score: Higher APGAR scores at 1 minute are associated with an increased risk of delivery at any point in time.
    • Maternal Age Group 25-44 Years: This age group shows a higher risk of delivery compared to the reference group.
    • Absence of Membrane Rupture: When there is no membrane rupture, the risk of delivery is higher.
    • Spontaneous Labor: Spontaneous labor is associated with a higher risk of delivery.
    • Precipitous Labor Less than 3 Hours: Quick labor (less than 3 hours) increases the risk of delivery.
  • Lower Risk of Delivery (HR < 1)
    • Placenta Previa: This condition significantly lowers the risk of delivery at any point in time.
    • Prolonged Rupture of Membrane (ROM): Prolonged ROM is associated with a lower risk of delivery.
    • Preeclampsia (with or without infection): Preeclampsia, whether alone or with infection, decreases the risk of delivery.
    • Prolonged Labor Greater than 20 Hours: Labor lasting more than 20 hours is associated with a reduced risk of delivery.
    • Labor Induction: Induced labor is associated with a lower risk of delivery.
    • APGAR 5: Higher APGAR scores at 5 minutes are associated with a reduced risk of delivery.
    • Increasing Maternal Age (not by much): Each additional year of maternal age slightly decreases the risk of delivery.
    • Zip Codes 95076 and 93927: These specific zip codes show a lower risk of delivery at any point in time.

We leave Cox methods and go on to GLM modelling to determine ODDS Ratio

4.0.4 ODDS Ratio: GLM modeling

ODDS Ratio (OR): measure of association between an exposure and an outcome

OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring without that exposure. (In logistic regression, it is used to quantify the strength of the association between predictor variables and a binary outcome.) For our data, we will pick, as examples, two outcomes (binarized): presence or absence of intrapartal events; and presence or absence of intrapartal conditions.

Odds Ratio can answer the questions:

  • What is the likelihood of an event occurring in the presence of a certain condition compared to its absence?

  • How does exposure to a particular factor influence the probability of an outcome compared to non-exposure?

  • Can we identify the strength of association between a predictor and an outcome in a binary response scenario?

4.0.4.0.1 Example: Intrapartal events and intrapartal conditions.

predictor OR conf.low conf.high p.value model
zip_category93906 0.499 -1.2948598 -0.0700991 0.0255 Conditions
zip_category93930 0.397 -1.7635018 -0.1128271 0.0273 Conditions
age 0.943 -0.0975070 -0.0196921 0.0032 Conditions
uds_age25-44 1.960 0.0638677 1.2919942 0.0315 Conditions
hghRskyes 3.820 1.0065969 1.6774569 0.0000 Conditions
lbr_type_cnsldtNot Applicable 0.346 -1.5914286 -0.5365382 0.0001 Conditions
lbr_type_cnsldtSpontaneous 0.344 -1.4440112 -0.6841074 0.0000 Conditions
del_method_cnsldtVaginal 0.681 -0.7427305 -0.0216892 0.0363 Conditions
lbr_type_cnsldtNot Applicable 0.323 -2.0937576 -0.2522798 0.0151 Events
lbr_type_cnsldtSpontaneous, Augmented 3.155 0.1445341 2.1106558 0.0209 Events
del_method_cnsldtC/S, Repeat 0.345 -1.9429528 -0.2702074 0.0118 Events
del_method_cnsldtVaginal 0.393 -1.3782503 -0.4820587 0.0000 Events
conditions_cnsldtProlonged ROM 9.170 0.3702476 5.1984706 0.0478 Events

We have similar plots of all the outcomes we are interested in, but will summarize the findings below with bullet points, pared down to only the statistically significant.

4.0.4.0.2 Other outcomes
4.0.4.0.2.1 (with Primary C/S, Non-spontaneous delivery, intrapartal conditions and events)

Summary of Significant Odds Ratios from Logistic Regression Models Significant Variables and Their Interpretation:

  • Delivery Method (C/S, Primary)

    • para: OR = 0.689, p = 0.0041

      • Interpretation: Higher parity (number of pregnancies) is associated with lower odds of having a primary C-section.
    • uds_age25-44: OR = 2.543, p = 0.0260

      • Interpretation: Mothers aged 25-44 have higher odds of having a primary C-section compared to the reference age group.
    • hghRskyes: OR = 4.571, p < 0.0001

      • Interpretation: High-risk pregnancies are significantly associated with higher odds of having a primary C-section.
    • lbr_type_cnsldtSpontaneous: OR = 0.401, p = 0.0013

      • Interpretation: Spontaneous labor is associated with lower odds of having a primary C-section.
  • Labor Type (Non-Spontaneous)

    • age: OR = 1.055, p = 0.0038

      • Interpretation: Increasing maternal age is associated with higher odds of non-spontaneous labor.
    • para: OR = 0.834, p = 0.0249

      • Interpretation: Higher parity is associated with lower odds of non-spontaneous labor.
    • hghRskyes: OR = 1.387, p = 0.0473

      • Interpretation: High-risk pregnancies are associated with higher odds of non-spontaneous labor.
    • del_method_cnsldtC/S, Repeat: OR = 1.810, p = 0.0320

      • Interpretation: Having a repeat C-section is associated with higher odds of non-spontaneous labor.
    • del_method_cnsldtVaginal: OR = 0.226, p < 0.0001

      • Interpretation: Vaginal delivery is associated with lower odds of non-spontaneous labor.
  • Conditions

    • zip_category93906: OR = 0.499, p = 0.0255

      • Interpretation: Zip code 93906 is associated with lower odds of having intrapartal conditions.
    • zip_category93930: OR = 0.397, p = 0.0273

      • Interpretation: Zip code 93930 is associated with lower odds of having intrapartal conditions.
    • age: OR = 0.943, p = 0.0032

      • Interpretation: Increasing maternal age is associated with lower odds of having intrapartal conditions.
    • uds_age25-44: OR = 1.960, p = 0.0315

      • Interpretation: Mothers aged 25-44 have higher odds of having intrapartal conditions compared to the reference age group.
    • hghRskyes: OR = 3.820, p < 0.0001

      • Interpretation: High-risk pregnancies are significantly associated with higher odds of having intrapartal conditions.
    • lbr_type_cnsldtNot Applicable: OR = 0.346, p = 0.0001

      • Interpretation: Having a consolidated labor type categorized as “Not Applicable” is associated with lower odds of having intrapartal conditions.
    • lbr_type_cnsldtSpontaneous: OR = 0.344, p < 0.0001

      • Interpretation: Spontaneous labor is associated with lower odds of having intrapartal conditions.
    • del_method_cnsldtVaginal: OR = 0.681, p = 0.0363

      • Interpretation: Vaginal delivery is associated with lower odds of having intrapartal conditions.
  • Events

    • lbr_type_cnsldtNot Applicable: OR = 0.231, p = 0.0011

      • Interpretation: Having a consolidated labor type categorized as “Not Applicable” is associated with lower odds of having intrapartal events.
    • lbr_type_cnsldtSpontaneous, Augmented: OR = 2.960, p = 0.0235

      • Interpretation: Augmented spontaneous labor is associated with higher odds of having intrapartal events.
    • del_method_cnsldtC/S, Repeat: OR = 0.325, p = 0.0062

      • Interpretation: Having a repeat C-section is associated with lower odds of having intrapartal events.
    • del_method_cnsldtVaginal: OR = 0.392, p < 0.0001

      • Interpretation: Vaginal delivery is associated with lower odds of having intrapartal events.
  • Zipcodes

    • High Risk OB and Suppressed Labor Types exert strong influence consistently across all zip codes

Our ongoing analyses of the data now take us to the derivation of significant Coefficients

4.1 Significant Coefficients

In our analysis, we utilized multinomial regression to uncover significant coefficients that explain the relationships between our predictors and the binary outcomes. Unlike the odds ratios, which provide a measure of the strength and direction of the association, the coefficients give a direct indication of the effect size for each predictor.

The plot below displays the significant coefficients (p < 0.05) for our usual outcomes of intrapartal events and intrapartal conditions. Each bar represents the coefficient value for a predictor, indicating the magnitude and direction of its effect on the outcome. Positive coefficients suggest an increase in the likelihood of the outcome with higher predictor values, while negative coefficients indicate a decrease.

This visualization provides a straightforward interpretation of how each significant predictor influences the outcome. The following sections will further detail the specific predictors and their coefficient values for each outcome.

Significant Coefficients Across All Models
model predictor coefficient p.value
model_del para -0.3566279 0.0076588
model_del uds_age25-44 0.9870524 0.0187211
model_del hghRskyes 1.7664275 0.0000000
model_del lbr_type_cnsldtNot Applicable 0.7174672 0.0173116
model_del lbr_type_cnsldtSpontaneous -0.8883419 0.0022149
model_del conditions_cnsldtNone -1.3044247 0.0415526
model_del conditions_cnsldtPreeclampsia -2.1668716 0.0032286
model_lbr age 0.0799769 0.0000456
model_lbr para -0.1787460 0.0300220
model_lbr del_method_cnsldtC/S, Repeat 0.6364332 0.0231146
model_lbr del_method_cnsldtVaginal -1.5524602 0.0000000
model_lbr conditions_cnsldtPlacenta Previa -2.4506119 0.0200594
model_conditions (Intercept) 1.2264470 0.0198288
model_conditions zip_category93906 -0.6949386 0.0255433
model_conditions zip_category93930 -0.9239900 0.0273164
model_conditions age -0.0584206 0.0032245
model_conditions uds_age25-44 0.6731746 0.0314894
model_conditions hghRskyes 1.3403485 0.0000000
model_conditions lbr_type_cnsldtNot Applicable -1.0601195 0.0000801
model_conditions lbr_type_cnsldtSpontaneous -1.0664066 0.0000000
model_conditions del_method_cnsldtVaginal -0.3846327 0.0363224
model_events lbr_type_cnsldtNot Applicable -1.1304858 0.0150603
model_events lbr_type_cnsldtSpontaneous, Augmented 1.1488652 0.0209428
model_events del_method_cnsldtC/S, Repeat -1.0645409 0.0118063
model_events del_method_cnsldtVaginal -0.9347261 0.0000420
model_events conditions_cnsldtProlonged ROM 2.2159081 0.0477904

Summary of results as above: The multinomial logistic regression model was used to analyze the relationship between various predictors and the occurrence of either Intrapartal Events or Intrapartal conditions. The outcome variable was re-coded to model the presence of intrapartal events as “Yes” versus “None”; same for intrapartal conditions. Our interest is in the coefficients.

General Interpretation:

  • Positive Coefficients: Predictors with positive coefficients increase the likelihood of intrapartal events or conditions

  • Negative Coefficients: Predictors with negative coefficients decrease the likelihood of intrapartal events or conditions

  • Magnitude: The larger the absolute value of the coefficient, the stronger the association with intrapartal events or conditions

  • Significance: Predictors with significant p-values (e.g., p < 0.05) are considered to have a statistically significant association with the outcome.

We will next engage one more analysis schema, looking for Variable Importance, because we may use our data to make forecasts.

4.2 Variable Importance

Variable Importance can answer the questions:

  • Which predictors are most influential in determining the outcome?

  • How do the different features rank in terms of their contribution to the predictive power of the model?

  • Can we identify the key drivers of the model’s performance, helping to focus on the most critical variables for decision-making?

Variable Importance: indicates how useful each variable is for making accurate predictions.

The Mean Decrease in Gini Index (MeanDecreaseGini) is one common metric used to measure variable importance. Variables with higher MeanDecreaseGini values are more important for the model: the higher the MeanDecreaseGini, the more influential the variable is in predicting the outcome.

Significance of Variable Importance:

  • Key Predictors: Variables with the highest MeanDecreaseGini values are key predictors for the outcome. They contribute the most to the model’s accuracy.

  • Interpretability: Knowing which variables are most important helps in understanding the underlying patterns and relationships in the data.

We will select the top 20 variables for each outcome of Intrapartal Events and Intrapartal Conditions.

4.2.0.1 Intrapartal Events and Intrapartal Conditions

Variable Importance for Intrapartal Events Outcome
0 1 MeanDecreaseAccuracy MeanDecreaseGini Variable
weight 4.5902849 -2.0778438 3.6540708 35.958210 weight
gest_age_days 11.1593229 -1.4375984 10.7585617 33.689827 gest_age_days
age 8.4521212 -3.0654237 7.4985672 26.659181 age
grav 7.9092019 1.2315512 8.5161817 14.352846 grav
para 8.4484800 -3.0670915 7.9465599 13.033847 para
apg1 2.7544646 2.2923108 3.4342843 12.028137 apg1
del_method_cnsldt_Vaginal 9.7805361 5.5813449 11.4179407 7.943900 del_method_cnsldt_Vaginal
gender_Male 0.5512928 0.1138712 0.5961818 7.088120 gender_Male
membrane_rupture_Spontaneous 0.7602006 2.3857518 1.6014326 6.697260 membrane_rupture_Spontaneous
zip_93905 -1.3000893 1.4802916 -0.6276862 5.796317 zip_93905
diff_grav_para 0.7170627 0.8549659 0.9716276 5.697163 diff_grav_para
zip_93906 -2.8706686 3.2095937 -1.5080081 5.549586 zip_93906
lbr_type_cnsldt_Spontaneous -4.4252211 4.2839805 -2.9622716 5.420347 lbr_type_cnsldt_Spontaneous
zip_93927 -0.5453530 3.6164299 0.8124317 5.401203 zip_93927
lbr_type_cnsldt_Induced 0.6478530 -0.8183917 0.2538101 5.397835 lbr_type_cnsldt_Induced
conditions_cnsldt_None 1.0131830 6.0949542 3.5427495 4.549934 conditions_cnsldt_None

Variable Importance for Intrapartal Conditions Outcome
0 1 MeanDecreaseAccuracy MeanDecreaseGini Variable
weight 11.658143 -0.5611922 11.0115895 59.590272 weight
gest_age_days 11.878062 -0.3231742 11.5805753 52.827189 gest_age_days
age 13.906581 -4.7875646 12.5280274 44.203519 age
para 8.744769 -0.4700082 8.1882656 23.829359 para
hghRsk_yes 30.871103 10.3736320 33.2838444 23.608181 hghRsk_yes
grav 8.969830 -2.7862747 8.0067804 22.822463 grav
apg1 -1.406105 0.9545298 -0.7386330 19.243622 apg1
lbr_type_cnsldt_Spontaneous 11.989385 5.5702236 13.9069416 13.635588 lbr_type_cnsldt_Spontaneous
membrane_rupture_Spontaneous -2.115340 4.5288393 0.3290470 12.172031 membrane_rupture_Spontaneous
diff_grav_para 4.013150 0.8659881 4.0052569 11.953602 diff_grav_para
gender_Male -1.503673 -0.9301480 -1.8796093 10.989558 gender_Male
lbr_type_cnsldt_Induced 8.394840 1.7907287 8.4203040 10.170030 lbr_type_cnsldt_Induced
zip_93905 -1.320391 0.4502400 -0.9657428 10.039409 zip_93905
del_method_cnsldt_Vaginal 4.869490 3.1102970 6.3000130 9.475491 del_method_cnsldt_Vaginal
zip_93906 3.108489 -0.1518147 2.6927133 8.996988 zip_93906
intrapartal_events_None 9.627765 5.6751728 11.0684859 8.465820 intrapartal_events_None

Footnotes

  1. Cluster analysis is a statistical technique used to group sets of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is especially useful in medical research for identifying patterns among patients, aiding in understanding behaviors, disease progression, and treatment outcomes without predefined categories.↩︎

  2. (source: https://www.unitedstateszipcodes.org/93908/, viewed 2024-05-27)↩︎

  3. (source: https://www.unitedstateszipcodes.org/93908/, viewed 2024-05-27)↩︎

Source Code
---
title: "CSVS OB DATA Analytics"
author: "Oguchi Nkwocha, MD., MS."
format:
  html:
    page-layout: full
    code-tools: true
    toc: true
    toc-depth: 6
    number-sections: true
    toc-location: left
    theme: cosmo
    comments:
      hypothesis: true
---


```{r echo=FALSE, warning=FALSE, message=FALSE}
#setwd("C:/Users/onkwocha.CSVSLINK/Dropbox/OB_DATA")
library(readr)
library(openxlsx)
library(xlsx)
library(DataExplorer)
library(explore)
library(XICOR)
library(survival)
library(survminer)
library(klaR)
library(forestplot) 
library(mice)
library(readxl)
library(dplyr)
library(purrr)
library(stringr)
library(randomForest)
library(leaflet)
library(sf)
library(tableone)
library(DT)
library(kableExtra)
library(tidyr)
#library(shiny)
library(quarto)
library(fastDummies)
library(downlit)
library(tidyverse)
library(magrittr)
library(ggplot2)


```

```{r echo=FALSE, warning=FALSE, message=FALSE}
ob_data_chr <- readRDS("Working OB Dataset.RDS")
ob_data_fctr <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS")
```

# Introduction

OB service at CSVS is delivered by dedicated clinicians and staff who provide tools- and procedure-validated comprehensive natal education through the CPSP program; prenatal care, hospital and delivery care, as well as post-natal care. OB deliveries are currently performed only at NMC which keeps a standard meticulous record of the delivery event and related issues. Each phase of OB care --prenatal, peripartal and postpartal-- presents opportunities and challenges depending on the effects of relevant factors which impact care, especially the central event of OB care, Delivery.

The goal of this document is to explore the data and present analytical results for information, education and action where indicated.

# Overview

```{r echo=FALSE, warning=FALSE, message=FALSE}
ob_stats_block <- ob_data_chr

# Calculate days and months
library(lubridate)

# Define start and end dates
ob_data_orig <- readRDS("Base OB Data DB.RDS")
start_date <- min(ob_data_orig$adm_date)
end_date <- max(ob_data_orig$adm_date)

# Calculate the number of days
number_of_days <- as.numeric(end_date - start_date)
#print(paste("Number of days:", number_of_days))

# Calculate the number of months
date_interval <- lubridate::interval(start_date, end_date)
number_of_months <- time_length(date_interval, "months")
#print(paste("Number of months:", number_of_months))

# Other quick stats

dte_range_NMC_data <- range(ob_stats_block$adm_date)
del_tm_dur_range <- round(range((ob_stats_block$gest_age_days)/7),1)
age_range <- range(ob_stats_block$age)
nbr_del_nmc <- nrow(ob_stats_block)
prcnt_preterm <- round(sum(ob_stats_block$gest_age_days <259) / sum(ob_stats_block$gest_age_days >=140)*100,0)
gest_age_range <- range(ob_stats_block$gest_age_days, na.rm = TRUE)
prcnt_hghRsk <- round(sum(ob_stats_block$hghRsk=="yes")/nrow(ob_stats_block)*100,0)
mltpl_gest <- sum(ob_data_chr$baby_sq>1)
mltpl_gest_prcnt <- round(mltpl_gest/nrow(ob_data_chr)*100,1)
mltpl_gest_lvls <- levels(factor(ob_data_chr$baby_sq))
gndr <- table(ob_data_chr$gender)
gndr_prcnt <- round(prop.table(gndr)*100,1)
c_section <- sum(ob_stats_block$del_method_cnsldt!="Vaginal")
c_section_prcnt <- round(c_section / nrow(ob_stats_block)*100,1)
prim_c_section <- sum(ob_stats_block$del_method_cnsldt=="C/S, Primary")
#prim_c_section <- sum(ob_stats_block$del_method_cnsldt=="C/S, Primary")
prim_c_section_prcnt <- round(prim_c_section/nrow(ob_stats_block)*100, 1)
prim_c_section_prcnt_all.cs <- round(prim_c_section/c_section*100, 1)
repeat_cs_prcnt <- round(sum(ob_stats_block$delivery_method =="Repeat Cesarean Section")/nrow(ob_stats_block)*100, 1)

#Primip C/S with vertex presentation
primip_c_s <- ob_stats_block %>% 
  filter(grav ==1 & para ==0 & baby_sq == 1 & presentation == "Vertex" & str_detect(delivery_method, "Prim" ))

primip_prct_all_del <- round(nrow(primip_c_s)/nrow(ob_stats_block)*100,1)

primip_prct_all_c_section <- round(nrow(primip_c_s)/c_section*100, 1)
                                   
primip_prct_primary_c_section <- round(nrow(primip_c_s)/prim_c_section*100,1)

ave_daily_del <- round(nrow(ob_stats_block)/number_of_days,0)
ave_monthly_del <- round(nrow(ob_stats_block)/number_of_months,0)
ave_annual_del <- round(nrow(ob_stats_block)/9*4,0)
min_birth_wt <- min(ob_stats_block$weight[ob_stats_block$weight > 0])
max_birth_wt <- max(ob_stats_block$weight)
ave_birth_wt <- round(mean(ob_stats_block$weight[ob_stats_block$weight > 0],1))

sa_data <- nrow(ob_stats_block %>% filter(time >= 0))

sa_data_prcnt <- round(sa_data/nrow(ob_stats_block)*100,0)
```

Between `r dte_range_NMC_data[1]` and `r dte_range_NMC_data[2]`, a period of **9 quarters** or 27 months, ***`r nbr_del_nmc`*** CSVS prenatal patients delivered at NMC, which is the principal source of data for this analysis.This included **`r gndr[1]` females**, **`r gndr[2]` males** and **`r gndr[3]` unknowns** (`r gndr_prcnt[c(1:3)]` % respectively). There were **`r mltpl_gest` multiple gestations** --all twins -- a `r mltpl_gest_prcnt`% incidence. There was **1 Fetal Demise** during the period.

Primary Cesarean section rate was **```r prim_c_section_prcnt```%**, constituting **`r prim_c_section_prcnt_all.cs`%** of all c-sections.  Repeat C-Section rate was **```r repeat_cs_prcnt```%**. Total C-section rate was **```r c_section_prcnt```%**.
_For **primip vertex** presentations_, the overall C-section rate is **```r primip_prct_all_del```%**. This accounts for **```r primip_prct_all_c_section```%** of all C-Sections, which is **```r primip_prct_primary_c_section```%** of all Primary Cesaerian Sections.

The average monthly delivery for CSVS for the period was **`r ave_monthly_del`**, with _annualized rate_ of **```r ave_annual_del```** deliveries. Each day, on the average, **`r ave_daily_del` deliveries** occurred. 

Average birth weight was **`r ave_birth_wt`gm**, with minimum **`r min_birth_wt`gm** and maximum of **`r max_birth_wt`gm**. Maternal age range was between ***`r age_range[1]`*** and ***`r age_range[2]`***, with a ***median age of `r median(ob_stats_block$age)`***. (See Histogram below)

## Exploring the Data

### Distributions and frequencies

```{r echo=FALSE, warning=FALSE, message=FALSE}

# Load necessary libraries
library(ggplot2)
library(ggdist)
library(dplyr)

# Function to create a raincloud plot
create_raincloud_plot <- function(data, variable, title, xlabel) {
  ggplot2::ggplot(data, aes_string(x = variable, y = "1")) +
    ggdist::stat_halfeye(fill = 'dodgerblue4') +
    ggdist::stat_dots(
      aes(y = 0.8),
      fill = 'dodgerblue4', 
      color = 'black',
      side = 'bottom'
    ) +
    geom_boxplot(width = 0.25, color = 'black', linewidth = 1) +
    theme_minimal(base_size = 9) +
    labs(x = xlabel, y = element_blank(), title = title) +
    theme(
      panel.grid.minor = element_blank(),
      panel.grid.major.y = element_blank(),
      plot.title = element_text(size = rel(1.5), face = 'bold'),
      legend.position = 'none'
    ) +
    coord_cartesian(xlim = range(data[[variable]], na.rm = TRUE))
}

# Ensure columns are present and handle NA values
ob_data_chr_violin <- ob_data_chr %>%
  dplyr::select(age, weight) %>%
  na.omit()

# Create plots for age and weight
p1 <- create_raincloud_plot(ob_data_chr_violin, "age", "Raincloud plot of Distribution of Age", "Age (years)")
p2 <- create_raincloud_plot(ob_data_chr_violin, "weight", "Raincloud plot of Distribution of Birth Weight", "Weight (g)")

# Print plots
print(p2)
print(p1)


```

#### Other histograms

```{r echo=FALSE, message=FALSE, warning=FALSE}
#Gravity:
hist(ob_stats_block$grav, main = "gravidarity", xlab = "gravida")
#parity
hist(ob_stats_block$para, main ="parity", xlab = "para")
#APGAR 1
hist(ob_stats_block$apg1, main ="APGAR 1", xlab = "APGAR1")
#APGAR 2
hist(ob_stats_block$apg5, main ="APGAR 5", xlab = "APGAR5")

```

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
library(lubridate)
library(plotly)
library(ggplot2)

para_0_dta <- ob_stats_block %>%
  filter(para ==0)
para_1_dta <- ob_stats_block %>%
  filter(para ==1)

prcnt_p_0 <- round(nrow(para_0_dta)/nrow(ob_stats_block)*100,0)
prcnt_p_1 <- round(nrow(para_1_dta)/nrow(ob_stats_block)*100,0)

```


From the figures above, we learn that there 

- relatively fewer pregnancies under age 19 than over 35
- primips account for **```r prcnt_p_0```%**; second pregnancies **```r prcnt_p_0```%**; so slightly over half of all our deliveries are to women with 1 or 2 deliveries, counting the current pregnancy as a delivery
- most deliveries score APGAR-1 8 and over; and APGAR-5 9



Does the Zip code make any difference with respect to gravidarity or parity?

#### Gravidarity-Parity distribution by Zipcode

```{r echo=FALSE, warning=FALSE, message=FALSE}
ob_data_tbl_xtra <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster)) %>% 
  mutate(grav_fct = factor(grav),
         para_fct = factor(para)) %>% 
  mutate(age_group = cut(age, breaks = c(-Inf, 21, 29, 35, Inf), labels = c("<21", "21-29", "29-35", ">35")),
        age_group = factor(age_group))


##gravidarity

tmp <- tempfile()
sink(tmp)

strata_tbl_grav_zip <- CreateTableOne(data = ob_data_tbl_xtra, strata = "grav_fct", vars = "zip")



kableone(strata_tbl_grav_zip, cramVars = "grav_fct", caption = "Distribution of Gravidarity by Zip code")

sink()

# Remove the temporary file
unlink(tmp)

#Heat map to explain p-Value

library(stringr)

# Suppress print output by redirecting to a temporary file
tmp <- tempfile()
sink(tmp)
grav_zip_plt <- as.data.frame(print(strata_tbl_grav_zip, cramVars = "grav_fct"))
sink()

# Remove the temporary file
unlink(tmp)
# Verify that grav_zip_plt is now a data frame
#head(grav_zip_plt)

grav_zip_plt.hb <- grav_zip_plt %>%
  slice(-1:-2) %>%
  select(-p, -test)

# Clean and separate the percentages
cleaned_data <- grav_zip_plt.hb %>%
  rownames_to_column(var = "zip") %>%
  mutate(across(`1`:`12`, ~ str_extract(., "(?<=\\().+?(?=\\))"))) %>%
  rename_with(~ paste0("grav_", 1:12), `1`:`12`)

# Reshape the data into long format
long_data <- cleaned_data %>%
  pivot_longer(cols = starts_with("grav_"), names_to = "gravidity", values_to = "percent") %>%
  mutate(percent = as.numeric(percent)) %>%
  filter(!is.na(percent)) %>%
  mutate(gravidity = factor(gravidity, levels = paste0("grav_", 1:12)))

# Create the heatmap plot
ggplot(long_data, aes(x = gravidity, y = zip, fill = percent)) +
  geom_tile() +
  labs(title = "Heatmap of Gravidity Levels by ZIP Code",
       x = "Gravidity Level",
       y = "ZIP Code",
       fill = "Percentage") +
  theme_minimal() +
  scale_fill_gradient(low = "white", high = "blue", na.value = "grey50") +
  scale_y_discrete(limits = rev(levels(as.factor(long_data$zip))))



##parity
strata_tbl_para_zip <- CreateTableOne(data = ob_data_tbl_xtra, strata = "para_fct", vars = "zip")
kableone(strata_tbl_para_zip, cramVars = "para_fct", caption = "Distribution of Parity by Zip code")


##parity -Zip heatmap


library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)

# Assuming para_zip_plt is your data frame

# Suppress print output by redirecting to a temporary file
tmp <- tempfile()
sink(tmp)
para_zip_plt <- as.data.frame(print(strata_tbl_para_zip, cramVars = "para_fct"))
sink()

# Remove the temporary file
unlink(tmp)
# Verify that grav_zip_plt is now a data frame
#head(grav_zip_plt)

para_zip_plt.hb <- para_zip_plt %>%
  slice(-1:-2) %>%
  select(-p, -test)


# Clean and separate the percentages
cleaned_data <- para_zip_plt.hb %>%
  rownames_to_column(var = "zip") %>%
  mutate(across(`0`:`9`, ~ str_extract(., "(?<=\\().+?(?=\\))"))) %>%
  mutate(across(`11`, ~ str_extract(., "(?<=\\().+?(?=\\))"))) %>%
  rename(para_0 = `0`, para_1 = `1`, para_2 = `2`, para_3 = `3`, para_4 = `4`, 
         para_5 = `5`, para_6 = `6`, para_7 = `7`, para_8 = `8`, para_9 = `9`, 
         para_11 = `11`)

# Reshape the data into long format
long_data <- cleaned_data %>%
  pivot_longer(cols = starts_with("para_"), names_to = "parity", values_to = "percent") %>%
  mutate(percent = as.numeric(percent)) %>%
  filter(!is.na(percent)) %>%
  mutate(parity = factor(parity, levels = paste0("para_", c(0:9, 11))))

# Create the heatmap plot
ggplot(long_data, aes(x = parity, y = zip, fill = percent)) +
  geom_tile() +
  labs(title = "Heatmap of Parity Levels by ZIP Code",
       x = "Parity Level",
       y = "ZIP Code",
       fill = "Percentage") +
  theme_minimal() +
  scale_fill_gradient(low = "white", high = "blue", na.value = "grey50") +
  scale_y_discrete(limits = rev(levels(as.factor(long_data$zip))))

```

Yes, there is a difference in the distribution,  but only for Parity (number of deliveries) -- for both 93905 (higher in lower parity levels) and 93927 (higher in higher parity levels), with a significant p-value of 0.043. There was no difference with respect to Gravidarity (number of pregnancies)

Is there a difference in the distribution of different age-groups among and within zip codes? As before, first, check the p-value. It is 0.105, so the answer is, No.

#### Age categories distribution by zip

```{r echo=FALSE, message=FALSE, warning=FALSE}
strata_tbl_age_cat_para_zip <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "zip")
kableone(strata_tbl_age_cat_para_zip,  caption = "Distribution of Age-groups by Zip Code")

# strata_tbl_age_cat_grav_zip <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "zip")
# kableone(strata_tbl_age_cat_grav_zip, cramVars = "grav_fct", caption = "Distribution of Age-groups by Gravidarity by Zip Code")
```


<!-- #### Age categories distribution by garavidarity-parity -->

<!-- ```{r echo=FALSE, message=FALSE, warning=FALSE} -->
<!-- strata_tbl_age_cat_para <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "para_fct") -->
<!-- kableone(strata_tbl_age_cat_para, cramVars = "age_group", caption = "Distribution of Age-groups by Parity") -->

<!-- strata_tbl_age_cat_grav <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "grav_fct") -->
<!-- kableone(strata_tbl_age_cat_grav, cramVars = "grav_fct", caption = "Distribution of Age-groups by Gravidarity") -->
<!-- ``` -->




#### Gravidarity-Parity distribution by Age

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Load required libraries
library(dplyr)
library(ggplot2)
library(tableone)
library(knitr)

# Read data
ob_data_tbl_xtra <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster)) %>% 
  mutate(grav_fct = factor(grav), para_fct = factor(para)) %>% 
  mutate(age_group = cut(age, breaks = c(-Inf, 21, 29, 35, Inf), labels = c("<21", "21-29", "29-35", ">35")),
         age_group = factor(age_group))

# Create table one
strata_tbl_age_cat_para <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "para_fct")
kableone(strata_tbl_age_cat_para, cramVars = "age_group", caption = "Distribution of Age-groups by Parity")

strata_tbl_age_cat_grav <- CreateTableOne(data = ob_data_tbl_xtra, strata = "age_group", vars = "grav_fct")
kableone(strata_tbl_age_cat_grav, cramVars = "grav_fct", caption = "Distribution of Age-groups by Gravidarity")


# Summarize data for plotting
plot_data_para <- ob_data_tbl_xtra %>%
  group_by(age_group, para_fct) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group) %>%
  mutate(percentage = count / sum(count) * 100)

# Plot the distribution
ggplot(plot_data_para, aes(x = para_fct, y = percentage, fill = age_group)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.1f", percentage)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
  labs(title = "Distribution of Age-groups by Parity",
       x = "Parity",
       y = "Percentage",
       fill = "Age Group") +
  theme_minimal()


# Summarize data for plotting
plot_data <- ob_data_tbl_xtra %>%
  group_by(age_group, para_fct) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  mutate(para_fct = as.numeric(as.character(para_fct))) %>%
  arrange(age_group, para_fct)

# Plot the distribution with lines and area shading
ggplot(plot_data, aes(x = para_fct, y = percentage, color = age_group, fill = age_group)) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = 0, ymax = percentage), alpha = 0.3) +
  labs(title = "Distribution of Age-groups by Parity",
       x = "Parity",
       y = "Percentage",
       fill = "Age Group",
       color = "Age Group") +
  scale_x_continuous(breaks = seq(min(plot_data$para_fct), max(plot_data$para_fct), by = 2)) +
  theme_minimal()



# Summarize data for plotting
plot_data_grav <- ob_data_tbl_xtra %>%
  group_by(age_group, grav_fct) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group) %>%
  mutate(percentage = count / sum(count) * 100)

# Plot the distribution
ggplot(plot_data_grav, aes(x = grav_fct, y = percentage, fill = age_group)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = sprintf("%.1f", percentage)), 
            position = position_dodge(width = 0.9), vjust = -0.5, size = 2) +
  labs(title = "Distribution of Age-groups by Gravidarity",
       x = "Gravidarity",
       y = "Percentage",
       fill = "Age Group") +
  theme_minimal()

# Summarize data for plotting
plot_data <- ob_data_tbl_xtra %>%
  group_by(age_group, grav_fct) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  group_by(age_group) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  mutate(grav_fct = as.numeric(as.character(grav_fct))) %>%
  arrange(age_group, grav_fct)

# Plot the distribution with lines and area shading
ggplot(plot_data, aes(x = grav_fct, y = percentage, color = age_group, fill = age_group)) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = 0, ymax = percentage), alpha = 0.3) +
  labs(title = "Distribution of Age-groups by Gravidarity",
       x = "Gravidarity",
       y = "Percentage",
       fill = "Age Group",
       color = "Age Group") +
  scale_x_continuous(breaks = seq(min(plot_data$grav_fct), max(plot_data$grav_fct), by = 2)) +
  theme_minimal()
 

```

Yes, there is a significant difference (p = <0.001 for both) in the distribution of _Gravidarity_ and of _Parity_ within and among zip codes. 

The distribution patterns suggest a trend where younger women (<21) predominantly have their first child, and as age increases, women are more likely to have higher parity deliveries. The 29-35 and >35 age groups, in particular, show a significant percentage of deliveries at higher parity levels, indicating that some women are indeed delaying childbirth to later ages.

We turn our attention next to non-completed or undelivered pregnancies or pregnancies without delivery, the count-difference between gravidarity and parity.This may reflect the state of family planning. 

#### GAP: Pregnancy - Delivery gap

```{r echo=FALSE, warning=FALSE, message=FALSE}

# Load required libraries
library(dplyr)
library(ggplot2)
library(knitr)

# Read data
ob_data_tbl_xtra_initial <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(grav_fct = factor(grav),
         para_fct = factor(para),
         age_group = cut(age, breaks = c(-Inf, 21, 29, 35, Inf), labels = c("<21", "21-29", "29-35", ">35")),
         age_group = factor(age_group))

# Special count for pending first deliveries
ob_data_filtered <- ob_data_tbl_xtra_initial %>%
  mutate(corrected_para = if_else(baby_sq > 1, para - 1, para),
         para.xtr= if_else(corrected_para==0, 1, corrected_para))
         #grav.xtr= if_else(corrected_para==0 & grav==1, 1, grav))

# # Correct para based on baby_sq condition
# ob_data_corrected <- ob_data_filtered %>%
#   mutate(corrected_para = if_else(baby_sq > 1, para - 1, para))

# Calculate the number of pregnancies without delivery
ob_data_preg_without_delivery <- ob_data_filtered%>%
  mutate(preg_without_delivery = grav - para.xtr)

# Summarize the data by age group
summary_data_gap <- ob_data_preg_without_delivery %>%
  group_by(age_group) %>%
  summarise(total_pregnancies = sum(grav),
            total_deliveries = sum(para.xtr),
            total_preg_without_delivery = sum(preg_without_delivery),
            incidence_rate = (total_preg_without_delivery / total_pregnancies * 100),
            annual_incidence_rate = round(incidence_rate/9*4,0))



# Display the summary table
kable(summary_data_gap, caption = "Incidence of Pregnancies without Delivery by Age Group")

# Test for statistical significance:

# Assuming the dataframe is named summary_data_gap and has the required columns

# Creating the contingency table
contingency_table <- as.table(rbind(summary_data_gap$total_deliveries, summary_data_gap$total_preg_without_delivery))
colnames(contingency_table) <- summary_data_gap$age_group
rownames(contingency_table) <- c("total_deliveries", "total_preg_without_delivery")

# Performing the chi-square test
chi_square_test <- chisq.test(contingency_table)

# Displaying the results
p_val <- round(chi_square_test$p.value, 4)

cat("p_value =", p_val, "\n")

# Plot the incidence rates by age group
ggplot(summary_data_gap, aes(x = age_group, y = annual_incidence_rate, fill = age_group)) +
  geom_bar(stat = "identity") +
  labs(title = "Incidence of Pregnancies without Delivery by Age Group",
       x = "Age Group",
       y = "Incidence Rate (%)",
       fill = "Age Group") +
  theme_minimal()

overall_gap_annual  <- round((sum(summary_data_gap$total_preg_without_delivery)/sum(summary_data_gap$total_pregnancies))*100/9*4,0)


##By Zip:

# Summarize the data by ZIP & age group
gap_zip.df <- ob_data_preg_without_delivery %>%
  group_by(zip, age_group) %>%
  summarise(total_pregnancies = sum(grav),
            total_deliveries = sum(para.xtr),
            total_preg_without_delivery = sum(preg_without_delivery),
            incidence_rate = (total_preg_without_delivery / total_pregnancies * 100),
            annual_incidence_rate = round(incidence_rate/9*4,0)) %>% 
            select(zip, age_group, total_preg_without_delivery) %>% 
            pivot_wider(names_from = age_group, values_from = total_preg_without_delivery)


# Calculate column sums
col_sums <- colSums(gap_zip.df[,-1], na.rm = TRUE)

# Apply the percentage calculation
gap_zip.df.prcnt <- gap_zip.df %>%
  mutate(across(`<21`:`>35`, ~ round(. / col_sums[as.character(cur_column())] * 100, 1))) %>%
  mutate(across(`<21`:`>35`, ~ replace_na(., 0)))

kableExtra::kable(gap_zip.df.prcnt, caption ="Pregnancies without Delivery, by ZIP code")

# Step 2: Sum observed counts for each age group
observed_counts <- colSums(gap_zip.df.prcnt[,-1], na.rm = TRUE)

# Step 3: Calculate expected counts assuming uniform distribution
total_count <- sum(observed_counts)
num_age_groups <- length(observed_counts)
expected_counts <- rep(total_count / num_age_groups, num_age_groups)

# Step 4: Perform chi-squared test
chisq_test <- chisq.test(observed_counts, p = expected_counts / total_count)

# Extract the p-value
p_value <- chisq_test$p.value

# View the p-value
cat("p_value =", p_value, "\n")


```

For our patient population, annually, an average  of **```r overall_gap_annual```%** of pregnancies were not delivered. There is a statistically significant association between age group and pregnancy outcome (p = ```r p_val```). This suggests that the likelihood of pregnancies resulting in delivery or not, differs significantly across the different age groups.
When considering the annual incidence rate:

- Age group 21-29 has the highest annual incidence rate of pregnancies without delivery (17 per year).

- Age group <21 has the lowest annual incidence rate (13 per year).

- Age groups 29-35 and >35 have similar annual incidence rates (15 per year).

These findings suggest that women in the 21-29 age group experience the highest rate of pregnancies without delivery, while the youngest group (<21) has the lowest rate. The older age groups (29-35 and >35) have intermediate rates. This could imply that younger women (<21) and older women (>35) have different risk factors or behaviors affecting pregnancy outcomes compared to women in their late 20s.

_Of note: there is no significant variation by zip code._

Overall, the significant association and the annual incidence rates highlight important age-related differences in pregnancy outcomes, which could be crucial for targeted interventions and healthcare planning. 

Numbers were derived by adding the gravidarity and subtracting parity, while crediting one imminent delivery to P0; and accounting for multiple gestation.

For only the first-ever delivery alone (Para 0), here is the outcome:

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Load required libraries
library(dplyr)
library(ggplot2)
library(knitr)

# Read data
ob_data_tbl_xtra_initial <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(grav_fct = factor(grav),
         para_fct = factor(para),
         age_group = cut(age, breaks = c(-Inf, 21, 29, 35, Inf), labels = c("<21", "21-29", "29-35", ">35")),
         age_group = factor(age_group))

# Include only current pending first deliveries only, so para == 0
ob_data_filtered <- ob_data_tbl_xtra_initial %>%
  filter(para == 0)

# Calculate the number of pregnancies without delivery
ob_data_preg_without_delivery <- ob_data_filtered %>%
  mutate(preg_without_delivery = grav - para)

# Summarize the data by age group
summary_data <- ob_data_preg_without_delivery %>%
  group_by(age_group) %>%
  summarise(total_pregnancies = sum(grav),
            total_deliveries = n(),
            total_preg_without_delivery = sum(grav) - total_deliveries,
            incidence_rate = round(total_preg_without_delivery / total_pregnancies * 100, 0))

# Display the summary table
kable(summary_data, caption = "Incidence of Pregnancies without Delivery **for primips** by Age Group")

# Plot the incidence rates by age group
ggplot(summary_data, aes(x = age_group, y = incidence_rate, fill = age_group)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(total_preg_without_delivery, "/", total_pregnancies)),
            position = position_stack(vjust = 1.05)) +
  labs(title = "Incidence of Pregnancies without Delivery by Age Group",
       x = "Age Group",
       y = "Incidence Rate (%)",
       fill = "Age Group") +
  annotate("text", x = 1, y = max(summary_data$incidence_rate) + 5, 
           label = "total_preg_without_delivery / total_pregnancies", 
           hjust = 0, vjust = 1, size = 4, color = "black") +
  theme_minimal() +
  theme(legend.position = "none")

overall_incidence_rate <- round(sum(summary_data$total_preg_without_delivery)/sum(summary_data$total_pregnancies)*100,0)

##by ZIP code

library(dplyr)
library(tidyr)

# Step 1: Summarize the data
summarized_df <- ob_data_preg_without_delivery %>%
  group_by(zip, age_group) %>%
  summarise(total_pregnancies = sum(grav),
            total_deliveries = n(),
            total_preg_without_delivery = sum(grav) - total_deliveries,
            incidence_rate = round(total_preg_without_delivery / total_pregnancies * 100, 0)) %>%
  select(zip, age_group, total_preg_without_delivery) %>%
  pivot_wider(names_from = age_group, values_from = total_preg_without_delivery)

# View the summarized data
#print(summarized_df)

# Step 2: Calculate column sums
col_sums <- colSums(summarized_df[,-1], na.rm = TRUE)

# Step 3: Calculate percentages
percent_df <- summarized_df %>%
  mutate(across(`<21`:`>35`, ~ round(. / col_sums[as.character(cur_column())] * 100, 1))) %>%
  mutate(across(`<21`:`>35`, ~ replace_na(., 0)))

# View the data with percentages
#print(percent_df)

kableExtra::kable(percent_df, caption = "Pregnancies without Delivery at time of First Delivery, by ZIP code")


##plotting top 4 zipcodes:

# Manually extracted data
data <- tibble::tribble(
  ~zip,   ~`<21`, ~`21-29`, ~`29-35`, ~`>35`,
  "93901",    0,    1.8,    0.0,    0.0,
  "93905",   48,   32.1,   35.7,   11.1,
  "93906",    4,   21.4,   21.4,    0.0,
  "93907",    4,    7.1,    0.0,    0.0,
  "93925",    0,    0.0,    0.0,    0.0,
  "93926",    0,    0.0,    0.0,    0.0,
  "93927",   32,   16.1,   14.3,   77.8,
  "93930",    4,    3.6,   25.0,   11.1,
  "93933",    0,    0.0,    0.0,    0.0,
  "93955",    0,    0.0,    0.0,    0.0,
  "93960",    8,   16.1,    0.0,    0.0,
  "95012",    0,    1.8,    3.6,    0.0,
  "95039",    0,    0.0,    0.0,    0.0,
  "99999",    0,    0.0,    0.0,    0.0
)

# Transform the data into long format
long_data <- data %>%
  pivot_longer(cols = `<21`:`>35`, names_to = "age_group", values_to = "percentage")

# Filter the data to retain only the specified zip codes
filtered_data <- long_data %>%
  filter(zip %in% c("93905", "93906", "93927", "93930"))

# Ensure the age groups are in the correct order
filtered_data$age_group <- factor(filtered_data$age_group, levels = c("<21", "21-29", "29-35", ">35"))

# Create the area plot with lines
ggplot(filtered_data, aes(x = age_group, y = percentage, fill = zip, group = zip)) +
  geom_area(position = "identity", alpha = 0.5) +
  geom_line(aes(color = zip), size = 1) +
  scale_color_brewer(palette = "Set3") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Percentage of Total Pregnancies Without Delivery at the first delivery by Age Group and Top Four Zip Codes",
       x = "Age Group",
       y = "Percentage",
       fill = "Zip Code",
       color = "Zip Code") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))


#facet

ggplot(filtered_data, aes(x = age_group, y = percentage, fill = zip, group = zip)) +
  geom_area(position = "identity", alpha = 0.5) +
  geom_line(aes(color = zip), size = 1) +
  scale_color_brewer(palette = "Set3") +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Percentage of Total Pregnancies Without Delivery at the first delivery by Age Group and Top Four Zip Codes",
       x = "Age Group",
       y = "Percentage",
       fill = "Zip Code",
       color = "Zip Code") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) +
  facet_wrap(~zip, ncol = 2)




# Step 4: Sum observed counts for each age group
observed_counts <- colSums(summarized_df[,-1], na.rm = TRUE)

# Step 5: Calculate expected counts assuming uniform distribution
total_count <- sum(observed_counts)
num_age_groups <- length(observed_counts)
expected_counts <- rep(total_count / num_age_groups, num_age_groups)

# View observed and expected counts
# print(observed_counts)
# print(expected_counts)

# Step 6: Perform chi-squared test
chisq_test <- chisq.test(observed_counts, p = expected_counts / total_count)

# Extract the p-value
p_value <- chisq_test$p.value

# View the p-value
#print(p_value)
cat("p_value =", p_value, "\n")

```
Focusing on first deliveries only, for every 4 delivered pregnancies, one pregnancy was not delivered, given an overall not-delivered incidence rate of  **```r overall_incidence_rate```%**. At the time of the first deliveries __only__, there is a significant variance by zip code:

- 93905 has highest delivery gaps in age-group <21, least in >35; even in the rest of the groups
- 93906 has least gap in <21 and >35; rest are even
- 93927 has the highest gap in >35 followed by <21; rest are even
- 93930 has the least gap in 21-29 and <21, highest gap in 29-35, followed by >35

Thus, regarding pregnancies without delivery disparity _at the time of the first delivery_, the data may show a pattern (See plot above) of:

- high disparity at younger ages and low disparity for older ages in 93905; 

- high disparity for the middle age groups and low disparity for the younger and the older ages in 93906;

- high disparity for the younger and the older age groups and low disparity for the middle age groups for 93927; 

- high disparity for the older age groups and low disparity for the younger age groups for 93930




Our records were collected on a continuum of time, so let's take a look. The time series plot is based on weekly deliveries: One can guesstimate the average weekly delivery and well as get a hint as to the trending of weekly deliveries.

### Time Series

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
library(lubridate)
library(plotly)

# Example data transformation
tst <- ob_data_chr %>%
  mutate(delivery_date = as.Date(delivery_date),
         week_start = floor_date(delivery_date, "week")) %>%
  group_by(week_start) %>%
  summarise(deliveries = sum(event, na.rm = TRUE)) %>%
  ungroup()


deliveries_ts <- ts(tst$deliveries, start = c(2022, 1), frequency = 52)

# Plot the time series with basic customizations


 # Customize the x-axis
dates <- seq.Date(from = as.Date("2022-01-01"), by = "week", length.out = length(tst$deliveries))

# Extract year and week number
years <- format(dates, "%Y")
weeks <- as.numeric(format(dates, "%V"))  # ISO week numbers from 1-52
 
# Correct the week number for the last week if it exceeds 52
weeks[weeks > 52] <- 52

# Combine year and week number with conditional year display
formatted_dates <- ifelse(weeks == 1, years, weeks)

# Create x-axis labels at year boundaries and weekly
year_changes <- which(weeks == 1)
week_indices <- seq(1, length(dates), by = 1)


library(ggplot2)
library(plotly)

dates <- seq.Date(from = as.Date("2022-01-01"), by = "week", length.out = length(tst$deliveries))
data <- data.frame(Date = dates, Deliveries = tst$deliveries)

# Create the ggplot with updated linewidth aesthetic
p <- ggplot2::ggplot(data, aes(x = Date, y = Deliveries)) +
  geom_line(color = "blue", linewidth = .5) +  # Use `linewidth` instead of `size`
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  labs(title = "Weekly Deliveries", y = "Deliveries", x = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

# Print the ggplot to verify it
#p

# Convert ggplot to plotly
plotly::ggplotly(p)

```

### Data

Delivery Data for our patients at NMC is kept in NMC OB Delivered Log repository; it captures the following types and categories of information relevant to our analysis:

#### DATA FIELDS

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(knitr)
col_names_nmc <- names(ob_data_chr)
# Omit the last 9 column names
col_names_nmc <- head(col_names_nmc, -12)

# Determine the number of rows needed for three columns
n <- length(col_names_nmc)
n_rows <- ceiling(n / 3)

# Pad the column names with NA to fit the matrix
col_names_nmc <- c(col_names_nmc, rep(NA, 3 * n_rows - n))

# Create a matrix with 3 columns
col_matrix <- matrix(col_names_nmc, ncol = 3, byrow = TRUE)

# Convert the matrix to a data frame for better printing
col_df <- as.data.frame(col_matrix)

# Print the table using kable for better formatting in Quarto
kableExtra::kable(col_df, col.names = NULL, caption = "Data Field Names")
```



To facilitate analyses, we initially added these derived columns below:

```{r echo=FALSE, warning=FALSE, message=FALSE}
column_names <- data.frame(Column_Names = names(ob_data_chr))
derived_cols <- data.frame(column_names[(nrow(column_names) - 12):nrow(column_names), ]) 
colnames(derived_cols) =NULL
derived_cols_df <- data.frame(Derived_column = derived_cols,
                              Explanation = c("gestational age in DAYS", "Admission-to-Delivery time", "grav minus para", "Delivery", "Duration: Delivery time minus Admit time",
                           "Duration ranges", "Age ranges used by UDS", "High Risk OB by definition", "Consolidated intrapartum conditions",
                            "Consolidated labor types", "Consolidated presentation","Consolidated Delivery method" , "PreTerm / Non-Preterm, informed by Clustering"))
kable(derived_cols_df, caption = "Derived Fields")
```

## Descriptive statistics

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Load necessary libraries
#library(shiny)
library(tableone)
library(dplyr)
library(DT)
library(kableExtra)


ob_data_tbl1 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster))
practiceTbl <- CreateTableOne(data = ob_data_tbl1)
vars <- setdiff(names(ob_data_chr), "gest_age")
practiceTbl_no_gest_age <- CreateTableOne(data = ob_data_tbl1, vars = vars)

#Num vars
num_var_ds <- ob_data_tbl1 %>% 
  dplyr::select_if(is.numeric) %>% 
  dplyr::select(-event)

num_var_Tbl <- CreateTableOne(data = num_var_ds)
num_var_prn <- kableone(num_var_Tbl)

# Cat vars
non_num_var_ds <- ob_data_tbl1 %>%
  dplyr::select(-c(age, grav, para, weight, apg1, apg5, diff_grav_para, gest_age_days, adm_to_del_tm))
vars <- setdiff(names(non_num_var_ds), "gest_age")
non_num_var_no_gest_age <- CreateTableOne(data = non_num_var_ds, vars = vars)
cat_var_prn <- kableone(non_num_var_no_gest_age)
#print(practiceTbl, showAllLevels = TRUE)

# practiceTbl$CatTable
# practiceTbl$ContTable
#practiceTbl$MetaData

strata_tbl_clst <- CreateTableOne(data = ob_data_tbl1, strata = "cluster", vars = "zip")
#strt_clst_prn <- print(strata_tbl_clst, nonnormal = "cluster", cramVars = "cluster")

strata_tbl_hghRsk <- CreateTableOne(data = ob_data_tbl1, strata = "hghRsk", vars = "zip")
#strt_hghRsk_prn <- print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )

# Use Kable / KableExtra/ kableone

cat_tbl <- kableone(practiceTbl$CatTable)
#kableExtra::kable(cat_tbl, format = "html" )
#cat_tbl

cont_tbl <- kableone(practiceTbl$ContTable)
#kableExtra::kable(cont_tbl, format = "html")
#cont_tbl

strata_tbl_clst <- CreateTableOne(data = ob_data_tbl1, strata = "cluster", vars = "zip")
#print(strata_tbl_clst, nonnormal = "cluster", cramVars = "cluster" )
#kableone(print(strata_tbl_clst, nonnormal = "cluster", cramVars = "cluster" ))
#kableExtra::kable(kableone(strta_zip_clst_prn), format = "html")

strata_tbl_hghRsk <- CreateTableOne(data = ob_data_tbl1, strata = "hghRsk", vars = "zip")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
#kableone(print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk"))
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

strata_tbl_clst_rsk <- CreateTableOne(data = ob_data_tbl1, strata = "hghRsk", vars = "cluster")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
#kableone(print(strata_tbl_clst_rsk, nonnormal = "hghRs", cramVars = "ghgRsk"))
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")


```


Let's summarize the CSVS data from NMC Delivery Log, starting with continuous or numerical variables:

### Numeric Variables

```{r echo=FALSE, message=FALSE, warning=FALSE}
num_var_prn
```

Next, we study the delivery frequency among different variables

<!-- ### Frequency View -->

<!-- ```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE} -->
<!-- # Load necessary libraries -->
<!-- #library(shiny) -->
<!-- library(tableone) -->
<!-- library(dplyr) -->

<!-- # Load your data frame -->
<!-- ob_data_tbl1_shny <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%  -->
<!--   dplyr::select(-c(age, grav, para, weight, apg1, apg5, diff_grav_para, gest_age_days, adm_to_del_tm, gest_age, adm_date, delivery_date, event, time, baby_sq)) %>%  -->
<!--   dplyr::mutate(cluster = as.factor(if_else(cluster == 2, 1, 0))) -->
<!-- ``` -->



Exploratory Data analyses yields rich information. However, we need to proceed with analytics so we can query the data in order to get desired information. We need to know how variables of interest affect one another, and how they compare with one another in their contribution to the results or outcome of interest to us. The outcome of interest to us in OB can be found in prenatal care (before delivery), during delivery; the event of delivery itself and right after delivery.

A pre-delivery outcome, for example, is pre-term (prematurity) status; another is High Risk OB status. During delivery, we are concerned about outcomes categorized by "intrapartal conditions" such as preeclampsia and also "intrapartal events" which encompass abnormal labor dynamics. We want to know: Did delivery occur at all? How long was the time from admission to delivery? What are the chances that at a given time delivery will have occurred, and what percent of patients will have delivered by a certain time? For each of these outcomes or combinations thereof, our interest is how and if variables like maternal age, zip code, parity, high risk status, affected the end-result.

We all are curious to know if and how residential geography affects OB outcomes. We will use zipcode information (the variable, "zip") and a standard MAP to explore this curiosity. High Risk OB is a technical definition; we can explore what contributes to it, as well as determine how it affects other outcomes as mentioned above. A fundamental method in statistical data analytics is to "throw in" all the data, together, "stir it up" and see if the data will settle out in a natural separation into groups--called clustering--that achieves statistical significance. If we can find significant clusters, we can also use them, along with zip and high risk status as probes with which to analyze outcomes.

We applied cluster statistics on the data using fetal weight against gestational age: here is our finding:

### Cluster Analysis

Right now, we are going to look into the important OB metrics of Gestational Age and Birth weight of our patients using Cluster Analysis[^1], a useful method in medical research for identifying patterns among patients that help to characterize clinical states and associated risks and treatment outcomes without predefined categories.

[^1]: Cluster analysis is a statistical technique used to group sets of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is especially useful in medical research for identifying patterns among patients, aiding in understanding behaviors, disease progression, and treatment outcomes without predefined categories.

Using k-means clustering algorithm with a choice of 3 clusters (arbitrary), we obtained these stats and plots:

```{r echo=FALSE, warning=FALSE, message=FALSE}
#Cluster Analysis----

#setwd(getwd())
ob_clustering <- ob_data_fctr %>% 
  dplyr::select(age, gest_age_days, weight)

ob_clustering_scaled <- scale(ob_clustering)


# Cluster Analysis
set.seed(333)

# Perform K-means clustering
k <- 3
kmeans_result <- kmeans(ob_clustering_scaled, centers = k, nstart = 25)
 
# Add the cluster assignments to the original dataset
#Remove current var cluster
ob_data_fctr$cluster <- NULL
ob_data_fctr$cluster <- as.factor(kmeans_result$cluster)
 
# Map numeric cluster IDs to meaningful names
cluster_names <- c("Term, full", "Term, early",   "Pre-term")
names(cluster_names) <- 1:3
ob_data_fctr$cluster <- factor(ob_data_fctr$cluster, levels = 1:3, labels = cluster_names)

# Now, summarizing clusters with across
cluster_summary <- ob_data_fctr %>%
    group_by(cluster) %>%
    summarise(
        across(
            .cols = c(gest_age_days, weight),
            .fns = ~ round(mean(.), 0),
            .names = "mean {.col}"
        ),
        count = n(),
        percent = round((n() / nrow(ob_data_fctr)) * 100, 1)
    )

#kable(cluster_summary, caption = "Cluster Stats")


# Plotting
library(ggplot2)

p <- ggplot2::ggplot(ob_data_fctr, aes(x = gest_age_days, y = weight, color = cluster)) +
    geom_point(alpha = 0.5) +
    #geom_hline(yintercept = 3000, linetype = "dashed", color = "red") +
    #geom_vline(xintercept = 259, linetype = "dashed", color = "blue") +
    theme_minimal() +
    scale_x_continuous(
        name = "Gestational Age in completed Weeks",  # Rename the x-axis to "Weeks"
        breaks = seq(0, max(ob_data_fctr$gest_age_days), by = 7),  # Set breaks every 7 days
        labels = function(x) floor(x / 7)  # Convert days to floor weeks
    ) +
    labs(title = "Cluster Analysis on Gestational Age and Birth Weights",
         x = "Gestational Age", y = "Delivery Birthweightin grams")

p

```

Can we identify this cluster grouping using known and relevant OB metrics for birth weight and gestational age categorization? Yes:

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Plotting
library(ggplot2)
id_clust <- p +
geom_hline(yintercept = 3000, linetype = "dashed", color = "red") +
geom_vline(xintercept = 259, linetype = "dashed", color = "blue") 
id_clust 

```

```{r echo=FALSE, warning=FALSE, message=FALSE}
clst_smm <- cluster_summary %>% 
  data.frame()
#kable(cluster_summary, caption = "Cluster Stats")

kableExtra::kable(clst_smm, caption = "Cluster Stats", format = "html", table.attr = "class='table' style='font-size: 14px; width: auto !important;'") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center", font_size = 14)

  
```

The plot shows the separation of a left lower quad (LLQ) cluster from an un-separated 2 other clusters. Further analysis shows that the LLQ cluster is statistically significantly different from either of the 2 other clusters, while this is not the case between these 2 clusters themselves. By adding the 3000 GRAM birth weight dashed horizontal line and the 37 week dashed vertical line, representing the demarcations for SGA and Preterm respectively we find a correspondence between SGA and preterm delivery, a widely known outcome. The clusters were named after this finding, as: Pre-term; Term, early; and Term, full (latter two are from 37 weeks and over). CAVEAT: Actual preterm deliveries as defined by gestational age (alone) under 37 weeks account for only **`r prcnt_preterm`%** of the deliveries, compared to the cluster group here defined by both SGA AND Preterm (**`r clst_smm[3,5]`%**); both are relatively small numbers compared to the total number of deliveries.

## High Risk OB

Let's officially define "High Risk OB", based on maternal age greater than 35, diagnosis of HTN or Preeclampsia, DM and (for CSVS) multiple gestation and or non-vertex presentation. By these criteria, **`r prcnt_hghRsk`%** are classified as High Risk.

## Service Area by Zipcode

Moving on to zip codes...what is CSVS OB service area? Let's now turn our attention to where our patients come from using a zipcode information (--99999 represents no zip code entered);

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(tableone)
library(dplyr)
library(knitr)
library(kableExtra)
library(tidyr)
library(tibble) 

ob_data_tbl2 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster))

# Create the table
zip <- CreateTableOne(data = ob_data_tbl2, vars = "zip")

# Convert to a clean data frame
zip_df <- as.data.frame(print(zip, printToggle = FALSE))

# Print the table to HTML
kableExtra::kable(zip_df, format = "html", caption = "Deliveries by Zip Code", table.attr = "class='table' style='font-size: 14px; width: auto !important;'") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "center", font_size = 14)

# ggplot horizontal bar plot
zip_df_plt <- zip_df %>%
  rownames_to_column(var = "zip") %>%
  slice(-1:-2) %>%  # Remove the first two rows
  separate(Overall, into = c("freq", "percent"), sep = "\\(") %>%
  mutate(freq = as.numeric(trimws(freq)),
         percent = as.numeric(gsub("[()%]", "", percent)))

# Print the resulting data frame
#print(zip_df_plt)


library(ggplot2)

# Create the bar plot
p <- ggplot2::ggplot(zip_df_plt, aes(x = reorder(zip, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = paste0(freq, " (", round(freq / sum(freq) * 100, 1), "%)")), hjust = -0.1) +
  coord_flip() +
  labs(title = "Zip Code Frequency",
       x = "Zip Code",
       y = "Frequency") +
  theme_minimal() +
  expand_limits(y = max(zip_df_plt$freq) * 1.2)  # Adjust y-axis limit to provide space for labels

# Print the plot
print(p)

# Save the plot with adjusted dimensions if needed
ggsave("zip_code_frequency_plot.png", plot = p, width = 10, height = 6)

```

The top 5 are 93905 (one-third of all deliveries), 93906 (one-fifth), 93927 (one-sixth); 95012 (one-twelfth), 93930 (one-twentieth). One thing stands out: considering its size and straddling location, 99308 is a sparse draw for CSVS deliveries.

Let's place this information within the context of a map of the County:

### Map

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(leaflet)
library(sf)
library(dplyr)
library(RColorBrewer)

# Load spatial and data
zip_code_sf <- readRDS("CSVS service area with Zipcode and Population  with sf and shp.RDS")
ob_data_map <- readRDS("Working OB Dataset.RDS")

# Summarize data by ZIP code
ob_data_map <- ob_data_map %>%
  count(zip)

# Convert ZIP codes to character
ob_data_map$zip <- as.character(ob_data_map$zip)
zip_code_sf$ZCTA5CE20 <- as.character(zip_code_sf$ZCTA5CE20)

# Convert zip_code_sf to WGS84 datum
zip_code_sf <- st_transform(zip_code_sf, crs = 4326)

# Perform the join
ob_data_map_joined <- zip_code_sf %>%
  left_join(ob_data_map, by = c("ZCTA5CE20" = "zip"))

# Check for missing values and handle them
ob_data_map_joined$n[is.na(ob_data_map_joined$n)] <- 0

# Create a color palette function
colorPalette <- colorBin(palette = "YlOrRd", domain = ob_data_map_joined$n, bins = 5)

# Generate the leaflet map
ob_data_map <- leaflet(data = ob_data_map_joined) %>%
  addTiles() %>%
  addPolygons(
    fillColor = ~colorPalette(n),
    color = "#BDBDC3",
    fillOpacity = 0.7,
    weight = 1,
    opacity = 1,
    highlight = highlightOptions(
      weight = 3,
      color = "#666",
      fillOpacity = 0.7,
      bringToFront = TRUE
    ),
    label = ~paste("ZIP Code:", ZCTA5CE20, "<br/>Count:", n),
    labelOptions = labelOptions(
      direction = 'auto',
      noHide = FALSE,
      textOnly = TRUE
    )
  ) %>%
  addLegend(
    pal = colorPalette, 
    values = ~n, 
    opacity = 0.7, 
    title = "Count",
    position = "bottomright"
  ) %>%
  setView(lng = -121.895, lat = 36.674, zoom = 9)

ob_data_map
```

This map shows that zipcode 93908 sitting smack within "CSVS territory" has extremely low deliveries. Census Information[^2] reveals that this discrepancy is demographical (socioeconomic) and related to CSVS usual drawing pool. This zipcode, 93908 ("Salinas - Corral De Tierra"), compared with say, 93905:

[^2]: (source: <https://www.unitedstateszipcodes.org/93908/>, viewed 2024-05-27)

-   has primarily white residents Vs "primarily other race" residents

-   has Median Household Income \$108,093 Vs \$41,607

-   has Median Home Value \$616,000, Vs \$203,400[^3]

[^3]: (source: <https://www.unitedstateszipcodes.org/93908/>, viewed 2024-05-27)

# Data Analysis

Having identified certain parameters of interest (Zip code distribution, Prematurity and High Risk), let's proceed to see what our data is telling us. We will start by taking another look at the weekly deliveries, to check for patterns (seasonality) and overall direction of the total number of deliveries (trend); and then, predict number of deliveries as forecast for the next quarter.

Is there seasonality to the deliveries? Do we have a trend? 

##### Time series decomposition

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(dplyr)
library(lubridate)
library(forecast)
library(ggplot2)
library(tidyr)

# Example data transformation
tst <- ob_data_chr %>%
  mutate(delivery_date = as.Date(delivery_date),
         week_start = floor_date(delivery_date, "week")) %>%
  group_by(week_start) %>%
  summarise(deliveries = sum(event, na.rm = TRUE)) %>%
  ungroup()

# Create the time series object
deliveries_ts <- ts(tst$deliveries, start = c(2022, 1), frequency = 52)

#Original ts:
plot(deliveries_ts, xaxt = "n", main = "Weekly Deliveries", ylab = "Deliveries", xlab = "", col = "blue", lwd = 2)
# Rotate x-axis labels 60 degrees
par(xpd = TRUE)
# text(x = seq(2022, 2022 + (length(tst$deliveries) - 1) / 52, by = 1 / 52), 
#      y = par("usr")[3] - 1, 
#      labels = formatted_dates, 
#      srt = 60, 
#      adj = 1, 
#      cex = 0.7)

# Add year labels
text(x = seq(2022, 2022 + (length(tst$deliveries) - 1) / 52, by = 1 / 52)[year_changes], 
     y = par("usr")[3] - 2, 
     labels = years[year_changes], 
     srt = 60, 
     adj = 1, 
     cex = 0.7, font = 2)


# STL decomposition facet:

stl_fit <- stl(deliveries_ts, s.window = 20)

# Extract the components
components <- as.data.frame(stl_fit$time.series)
components$date <- seq.Date(from = as.Date("2022-01-01"), by = "week", length.out = nrow(components))
components$data <- tst$deliveries

# Reorder the components for plotting
components_long <- components %>%
  dplyr::select(date, data, seasonal, trend) %>%
  pivot_longer(cols = -date, names_to = "component", values_to = "value")

# Plot the components ordered by Data, Seasonal, Trend
plotly::ggplotly(
  ggplot2::ggplot(components_long, aes(x = date, y = value, color = component)) +
  geom_line(size = 1) +
  facet_wrap(~component, scales = "free_y", ncol = 1) +
  labs(title = "STL Decomposition of Weekly Deliveries",
       y = "Value",
       x = "Date",
       color = "Component") +
  scale_color_manual(values = c("data" = "blue", "seasonal" = "green", "trend" = "red")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
)
###

#STL decomposition superimposition
# Perform STL decomposition
stl_fit <- stl(deliveries_ts, s.window = "periodic")

# Extract the components
components <- as.data.frame(stl_fit$time.series)
components$date <- seq.Date(from = as.Date("2022-01-01"), by = "week", length.out = nrow(components))
components$data <- tst$deliveries

# Adjust the seasonal component to match the scale of the data
components$seasonal_adjusted <- components$seasonal + mean(components$data)

# Superimpose data, trend, and adjusted seasonal components
ggplot2::ggplot(components, aes(x = date)) +
  geom_line(aes(y = data, color = "Data"), size = 1) +
  geom_line(aes(y = trend, color = "Trend"), size = 1) +
  geom_line(aes(y = seasonal_adjusted, color = "Seasonal"), size = 1) +
  labs(title = "STL Decomposition of Weekly Deliveries",
       y = "Deliveries",
       x = "Date",
       color = "Component") +
  scale_color_manual(values = c("Data" = "blue", "Trend" = "red", "Seasonal" = "green")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

```

The top graph shows the same weekly deliveries time series seen earlier. The second and third panels tease out the series to extract seasonality information and trend, respectively. The last panel superimposes the seasonality and trend graphs on the original time series. We can see a trend that "trough-ed" around end of the first quarter of 2023 and is rising since then. The seasonality graph nearly matches the original data, indicating that there is a significant seasonal variation in weekly deliveries among our patients.

What might account for the trough? Below we compare the original (full data) trend with the trend of each of the components of various variables to look for patterns.

#### Checking for trend disparity

```{r echo=FALSE, warning=FALSE, message=FALSE}

library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
library(scales)
library(trend)

# Load your dataset
ob_data_fctr <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS")

# Ensure date is in the correct format
ob_data_fctr_1 <- ob_data_fctr %>%
  mutate(delivery_date = as.Date(delivery_date))

# Get top 5 zip codes by delivery frequency
top_zips <- ob_data_fctr_1 %>%
  count(zip, sort = TRUE) %>%
  top_n(5, n) %>%
  pull(zip)

# Refactor the zip column to include only the top 5 zip codes
ob_data_fctr_2 <- ob_data_fctr_1 %>%
  mutate(zip = factor(zip, levels = top_zips))

# Cut age into specified groups
ob_data_fctr_3 <- ob_data_fctr_2 %>%
  mutate(age_group = cut(age, breaks = c(-Inf, 21, 29, 35, Inf), labels = c("<21", "21-29", "29-35", ">35")))

# Compute the overall trend
overall_trend <- ob_data_fctr_3 %>%
  mutate(week_start = floor_date(delivery_date, "week")) %>%
  group_by(week_start) %>%
  summarise(deliveries = sum(event, na.rm = TRUE), .groups = 'drop') %>%
  mutate(trend = predict(loess(deliveries ~ as.numeric(week_start), span = 0.3)))

# Function to create trend plot for a given variable
create_trend_plot <- function(data, var) {
  if (var == "zip") {
    # Filter only top 5 zip codes
    data <- data %>% filter(zip %in% top_zips)
  }
  
  # Aggregate weekly deliveries for the given variable
  ts_data <- data %>%
    mutate(week_start = floor_date(delivery_date, "week")) %>%
    group_by(week_start, !!sym(var)) %>%
    summarise(deliveries = sum(event, na.rm = TRUE), .groups = 'drop') %>%
    ungroup() %>%
    complete(week_start, !!sym(var), fill = list(deliveries = 0))
  
  # Apply Loess smoothing to capture the trend
  ts_data <- ts_data %>%
    group_by(!!sym(var)) %>%
    mutate(trend = predict(loess(deliveries ~ as.numeric(week_start), span = 0.3))) %>%
    ungroup()
  
  # Plot the trend
  ggplot2::ggplot(ts_data, aes(x = week_start, y = trend, color = !!sym(var))) +
    geom_line(size = 1) +
    facet_wrap(as.formula(paste("~", var)), scales = "free_y", ncol = 1) +
    geom_line(data = overall_trend, aes(x = week_start, y = trend, color = "Overall Trend"), linetype = "dashed", size = 1, inherit.aes = FALSE) +
    scale_color_manual(values = c(setNames(hue_pal()(length(unique(ts_data[[var]]))), unique(ts_data[[var]])), "Overall Trend" = "black")) +
    labs(title = paste("Trend of Weekly Deliveries by", var),
         y = "Trend",
         x = "Date",
         color = var) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 60, hjust = 1))
}

# Function to calculate Mann-Kendall trend test p-values
calculate_trend_tests <- function(data, var) {
  data %>%
    mutate(week_start = floor_date(delivery_date, "week")) %>%
    group_by(week_start, !!sym(var)) %>%
    summarise(deliveries = sum(event, na.rm = TRUE), .groups = 'drop') %>%
    filter(!is.na(!!sym(var))) %>%
    group_by(!!sym(var)) %>%
    summarise(
      mann_kendall_p = mk.test(deliveries)$p.value
    ) %>%
    mutate(variable = var)
}

# List of variables to test
variables <- c("zip", "hghRsk", "cluster", "age_group", "uds_age")

# Calculate trend tests for each variable and combine results
trend_results_list <- lapply(variables, function(var) {
  calculate_trend_tests(ob_data_fctr_3, var)
})

# Combine all results into one data frame
trend_results <- bind_rows(trend_results_list)

# Concatenate the variable levels with the variable name and remove NA columns

# Convert variables to characters before using coalesce
trend_results <- trend_results %>%
  mutate(
    zip = as.character(zip),
    hghRsk = as.character(hghRsk),
    cluster = as.character(cluster),
    age_group = as.character(age_group),
    uds_age = as.character(uds_age),
    variable_level = paste(variable, coalesce(zip, hghRsk, cluster, age_group, uds_age), sep = ": ")
  ) %>%
  select(variable_level, mann_kendall_p)

# Add interpretative column
trend_results <- trend_results %>%
  mutate(interpretation = ifelse(mann_kendall_p <= 0.05, "Significant trend, contributes uniquely", "No significant trend, does not contribute uniquely"))

# Print the cleaned p-values summary table with interpretation
#print(trend_results)
#kableExtra::kable(trend_results, caption = "Trends Contribution Effect")

# Create and print trend plots for each specified variable
zip_plot <- create_trend_plot(ob_data_fctr_3, "zip")
print(zip_plot)

hghrsk_plot <- create_trend_plot(ob_data_fctr_3, "hghRsk")
print(hghrsk_plot)

ob_data_fctr_3$cluster <- as.factor(ob_data_fctr_3$cluster)
cluster_plot <- create_trend_plot(ob_data_fctr_3, "cluster")
print(cluster_plot)

age_plot <- create_trend_plot(ob_data_fctr_3, "age_group")
print(age_plot)

uds_age_plot <- create_trend_plot(ob_data_fctr_3, "uds_age")
print(uds_age_plot)

kableExtra::kable(trend_results, caption = "Trends Contribution Effect")


```

If the trend of the original data ("Overall Trend) parallels the variable's trend visually, the latter must be contributing significantly to the shape and direction of the former. However, we must rely on p-values for statistical significance. Thus, the following contribute significantly to the trend that we observed:

- zip: 93930
- hghRsk: no
- age_group: 29-35

but, not the rest of the zipcodes, cluster, age groups (including UDS age group)


An important and practical goal of data analysis is to make future forecasts. Based on the weekly delivery numbers for 117 weeks or nine quarters, we can predict the following quarter's performance: check the plot, and then the table for comparison between actual and predicted data. (Use the cursor cross-mark to expand any segment by pressing the button and dragging the mouse)

### Time Series Forecasts

```{r echo=FALSE, warning=FALSE, message=FALSE}
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(lubridate)
library(timetk)
library(tsibble)
library(prophet)
library(Metrics)
library(feasts)
library(fabletools)
library(plotly)



# Summarize deliveries by week
ob_data_ts <- readRDS("Working OB Dataset.RDS") %>%
  mutate(delivery_date = as.Date(delivery_date),
         week_start = floor_date(delivery_date, "week")) %>%
  group_by(week_start) %>%
  summarise(deliveries = sum(event, na.rm = TRUE)) %>%
  ungroup()

# Ensure date columns are Date objects
ob_data_ts$week_start <- as.Date(ob_data_ts$week_start)

# Prepare data for prophet
prophet_data <- ob_data_ts %>%
  rename(ds = week_start, y = deliveries)

# Fit prophet model with weekly seasonality
prophet_model <- prophet(prophet_data, weekly.seasonality = TRUE)

# Generate future dataframe
future <- make_future_dataframe(prophet_model, periods = 13, freq = "week")

# Forecast
forecast <- predict(prophet_model, future)

# Ensure forecast dataframe ds column is Date object
forecast$ds <- as.Date(forecast$ds)

# Combine historical data and forecast
combined_data <- bind_rows(
  prophet_data %>% mutate(type = "Historical"),
  forecast %>% mutate(type = "Forecast", y = yhat) %>% dplyr::select(ds, y, type)
)

# Check for and remove duplicate rows if any
combined_data <- combined_data %>%
  distinct(ds, .keep_all = TRUE)

# Round forecast values
forecast <- forecast %>% 
  mutate(yhat = round(yhat))

# Extract forecast values as DataFrame
forecast_df <- combined_data %>%
  filter(type == "Forecast") %>%
  dplyr::select(ds, y) %>%
  rename(Date = ds, Forecast = y)


# Plot: Full historical + forecast with explicit weeks for x-axis
p_ts_frcst <- ggplot2::ggplot() +
  geom_line(data = combined_data, aes(x = ds, y = y, color = type), size = 1) +
  scale_x_date(date_breaks = "1 week", date_labels = "%Y-%U") +
  labs(title = "Forecast for the Next Quarter",
       x = "Week",
       y = "Number of Deliveries",
       color = "Legend") +
  scale_color_manual(values = c("Historical" = "black", "Forecast" = "orange")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

#print(p_ts_frcst)
ggplotly(p_ts_frcst)

# Output forecast values as DataFrame
forecast_df <- forecast_df %>% mutate(Forecast = round(Forecast)) %>% data.frame(.)

kableExtra::kable(forecast_df, caption = "Next Quarter's Weekly Deliveries Forecast")


```



We are going to continue the analysis of our data using zip-code distribution, Clustering and High Risk OB before moving to generalized analyses.

How does zip code affect the distribution of outcomes like Preterm (derived from cluster analysis)and High Risk OB? We will throw in another interesting outcome, Delivery Method (vaginal versus Caeserian section)?

### Zip Code effect

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
library(tableone)
ob_data_tbl3 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  mutate(cluster = as.factor(cluster)) %>% 
  dplyr::select(-adm_date, -delivery_date) 


strata_tbl_clst_zip <- CreateTableOne(data = ob_data_tbl3, strata = "cluster", vars = "zip")
#print(strata_tbl_clst, nonnormal = "cluster", cramVars = "cluster" )
kableone(strata_tbl_clst_zip, nonnormal = "cluster", cramVars = "cluster" , caption = "Distribution of Preterm by Zip code")
#kableExtra::kable(kableone(strta_zip_clst_prn), format = "html")

strata_tbl_hghRsk_zip <- CreateTableOne(data = ob_data_tbl3, strata = "hghRsk", vars = "zip")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_hghRsk_zip, cramVars = "hghRsk", caption = "Distribution of High Risk OB by Zip code")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

strata_tbl_del_zip <- CreateTableOne(data = ob_data_tbl3, strata = "del_method_cnsldt", vars = "zip")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_del_zip, cramVars = "del_method_cnsldt", caption = "Distribution of Delivery Methods by Zip code")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

```


First, always check the p-values for the various distribution tables. 

The answer is: there is no significant difference in the relative distribution, among and within different zip codes, of Pre-term deliveries (p=0.508), High Risk OB (p=0.792), Delivery Method (p=0.976).

What about the High Risk OB and the outcomes of Preterm & Delivery Method?

```{r echo=FALSE, warning=FALSE, message=FALSE}
strata_tbl_clst_rsk <- CreateTableOne(data = ob_data_tbl3, strata = "cluster", vars = "hghRsk")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_clst_rsk, nonnormal = "cluster", cramVars = "hghRsk", caption = "Distribution of Preterm by High Risk OB")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

strata_tbl_del_rsk <- CreateTableOne(data = ob_data_tbl3, strata = "del_method_cnsldt", vars = "hghRsk")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_del_rsk, nonnormal = "del_method_cnsldt", cramVars = "hghRsk", caption = "Distribution of Delivery Method by High Risk OB")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

```

With a p-value \<0.001, a statistically significant difference is obvious in the distribution of Cesarean sections compared to Vaginal deliveries among High Risk OB. The same holds true for Preterm distribution: a significantly higher percent are in High Risk group than non-Preterm (39.8% Vs 19.2%, p-value \<0.001).


Finally, same question for Preterm and outcomes of High Risk and Delivery Method:

```{r echo=FALSE, warning=FALSE, message=FALSE}

strata_tbl_rsk_clst <- CreateTableOne(data = ob_data_tbl3, strata = "hghRsk", vars = "cluster")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_rsk_clst, nonnormal = "cluster", cramVars = "hghRsk", caption = "Distribution of Preterm by High Risk OB")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

strata_tbl_del_clst <- CreateTableOne(data = ob_data_tbl3, strata = "del_method_cnsldt", vars = "cluster")
#print(strata_tbl_hghRsk, nonnormal = "cluster", cramVars = "hghRsk" )
kableone(strata_tbl_del_clst, nonnormal = "cluster", cramVars = "del_method_cnsldt", caption = "Distribution of Preterm by Delivery Method")
#kableExtra::kable(kableone(strta_zip_hghRsk_prn), format = "html")

```

- The distribution of High Risk OB is significantly higher compared with non_High Risk among Preterm deliveries (p=\<0.001).

- The distribution of C-sections is significantly higher compared with vaginal delivery among Preterm deliveries (p=0.039).






<!-- Next, we perform correlation analysis on our data -->

<!-- ### Correlation Analysis (Spearman method) -->

<!-- Correlation -->

<!-- :   Correlation is a statistical measure that describes the strength and direction of a relationship between two variables. It ranges from -1 to 1, where values close to 1 or -1 indicate strong positive or negative relationships, respectively, and values close to 0 indicate no relationship. -->

<!-- We have prepared an interactive table to facilitate the exercise. Keep in mind that the included p-value determines whether there is a statistical significance or not. ***There are 15x14 possible pair-combinations, so, think, "parsimony".*** -->

```{r echo=FALSE, warning=FALSE, message=FALSE}
#library(shiny)
library(XICOR)
library(dplyr)
library(ggplot2)
library(DT)

# Assuming the dataset is loaded here
ob_data_clst_xi <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  mutate(cluster = as.factor(cluster)) %>% 
         dplyr::select(zip, lbr_type_cnsldt, membrane_rupture, intrapartal_conditions,
                presentation_cnsldt, conditions_cnsldt, uds_age, adm_to_del_tm_cat, hghRsk, del_method_cnsldt, cluster,
                age, grav, para, diff_grav_para) %>%
  mutate(across(c(zip, lbr_type_cnsldt, membrane_rupture, intrapartal_conditions,
                  presentation_cnsldt, conditions_cnsldt, uds_age, adm_to_del_tm_cat, hghRsk, del_method_cnsldt, cluster), as.factor)) %>%
  mutate(across(c(zip, lbr_type_cnsldt, membrane_rupture, intrapartal_conditions,
                  presentation_cnsldt, conditions_cnsldt, uds_age, adm_to_del_tm_cat, hghRsk, del_method_cnsldt, cluster), as.numeric))
```





# Hospital: Labor & Delivery

Let's now shift our attention to hospital events related to OB care. First, we will extend the previous analyses to hospital events. Then, we will use a different method of analysis known as Survival Analysis (SA) or by Time-to-Event (TTE) Analysis to probe our data. The hospital **event of interest is Delivery**; along with related matters of labor type, delivery method, intrapartal events and intrapartal conditions. Here are the variables and their respective member-items:

### Variables

```{r echo=FALSE, warning=FALSE, message=FALSE}
# List of factor variable names

library(dplyr)
library(tidyr)

ob_data_tbl4 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster= as.factor(cluster))

# List of factor variable names
factor_vars <- c("lbr_type_cnsldt", "del_method_cnsldt", "conditions_cnsldt", "intrapartal_events")  # Replace with your actual factor variable names

# Function to get levels and create a data frame
get_levels_df <- function(var_name) {
  levels_df <- data.frame(
    Variable = var_name,
    Items = levels(ob_data_tbl4[[var_name]]),
    stringsAsFactors = FALSE
  )
  return(levels_df)
}

# Apply the function to each factor variable and combine the results
levels_list <- lapply(factor_vars, get_levels_df)
combined_levels_df <- bind_rows(levels_list)

# Group by Variable and keep only the first instance of Variable name
formatted_df <- combined_levels_df %>%
  group_by(Variable) %>%
  mutate(Variable = ifelse(row_number() == 1, Variable, "")) %>%
  ungroup()

# Print the formatted data frame
#print(formatted_df)
tableone::kableone(formatted_df)


```

We have consolidated some of the items within their respective original variables and named them with "-cnsldt" appendum.

Below are the frequency and the distribution of deliveries by these variables:

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(tableone)
library(dplyr)
library(knitr)
library(kableExtra)
library(tidyr)
library(tibble) 

ob_data_tbl5 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%  
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster))

vars <- c("lbr_type_cnsldt", "presentation_cnsldt", "intrapartal_events", "conditions_cnsldt", "del_method_cnsldt")

# Create the table
selected_hosp <- tableone::CreateTableOne(data = ob_data_tbl5, vars = vars)


# Convert to a clean data frame
selected_hosp_df <- as.data.frame(print(selected_hosp, printToggle = FALSE))

# Move row names to a column
selected_hosp_df <- rownames_to_column(selected_hosp_df, var = "RowID")

# Identify and clean header and item rows
selected_hosp_df <- selected_hosp_df %>%
  mutate(
    Variable = ifelse(grepl("^X\\.*", RowID), "", RowID),
    Item = ifelse(grepl("^X\\.*", RowID), gsub("^X\\.*", "", RowID), ""),
    Value = Overall
  ) %>%
  dplyr::select(Variable, Item, Value) %>%
  fill(Variable, .direction = "down")

# Print the table to HTML
kableExtra::kable(selected_hosp_df, format = "html", caption = "Distribution of Deliveries among Hospital Variables", table.attr = "class='table' style='font-size: 12px; width: auto !important;'") %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left", font_size = 12)

```

### Distribution analysis

#### By zip

Like we did with previous situations, we ask if hospital events and hospital conditions at delivery show a biased distribution with regards to zip codes.

```{r echo=FALSE, warning=FALSE, message=FALSE}
ob_data_tbl6 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster))


strata_tbl_lbr_zip <- CreateTableOne(data = ob_data_tbl6, strata = "lbr_type_cnsldt", vars = "zip")
kableone(strata_tbl_lbr_zip, cramVars = "lbr_type_cnsldt", caption = "Distribution of Labor Types by Zip code")

strata_tbl_conditions_zip <- CreateTableOne(data = ob_data_tbl6, strata = "conditions_cnsldt", vars = "zip")
kableone(strata_tbl_conditions_zip, cramVars = "conditions_cnsldt", caption = "Distribution of Intrapartal Conditions by Zip code")

strata_tbl_events_zip <- CreateTableOne(data = ob_data_tbl6, strata = "intrapartal_events", vars = "zip")
kableone(strata_tbl_events_zip, cramVars = "intrapartal_events", caption = "Distribution of Intrapartal Events by Zip code")

```

- For labor types (p-value = 0.142) and intrapartal conditions (p-value = 0.151) there is no significant difference in their distribution among and within the zipcodes. 

- A different conclusion for Intrapartal events (p_value = 0.02): Focusing on the 3 most frequent intrapartal events, there is a significant difference in precipitous labor (93905, 93927, 93906); Prolonged Labor \>20 hrs (93905, 93906, 93927); and Prolonged 2nd Stg \>2.5 hr (93905, 93906, 93927); 93905 had the most difference in each category. 

#### By High Risk OB

```{r echo=FALSE, warning=FALSE, message=FALSE}
ob_data_tbl7 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  dplyr::select(-adm_date, -delivery_date) %>% 
  mutate(cluster = as.factor(cluster))


strata_tbl_lbr_rsk <- CreateTableOne(data = ob_data_tbl7, strata = "lbr_type_cnsldt", vars = "hghRsk")
kableone(strata_tbl_lbr_rsk, cramVars = "hghRsk", caption = "Distribution of Labor Types by High Risk OB")

strata_tbl_conditions_rsk <- CreateTableOne(data = ob_data_tbl7, strata = "conditions_cnsldt", vars = "hghRsk")
kableone(strata_tbl_conditions_rsk, cramVars = "conditions_cnsldt", caption = "Distribution of Intrapartal Conditions by High Risk OB")

strata_tbl_events_rsk <- CreateTableOne(data = ob_data_tbl7, strata = "intrapartal_events", vars = "hghRsk")
kableone(strata_tbl_events_rsk, cramVars = "intrapartal_events", caption = "Distribution of Intrapartal Events by High Risk OB")


```

High Risk OB, on the other hand, showed significant differences in all three parameters: 

- Labor Type (p= \<0.001) : 30% induced; 14% spontaneous -- greater association with labor induction;  
- Intrapartal conditions (p = \<0.001): None 16.6% versus 100% Preclampsia, 18.1% infection, and 17.6 % Prolonged Rupture of Membrane
- Intrapartal events (p = 0.002): None 21.9% Vs 10.5% for Precipitous labor < 3hrs, 8.1 % for Prolonged 2nd stage \>2.5 hr and 27.8 % for Prolonged labor \> 20 hr

#### By Preterm:

```{r echo=FALSE, warning=FALSE, message=FALSE}

ob_data_tbl8 <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  mutate(cluster = as.factor(cluster)) %>% 
  dplyr::select(-adm_date, -delivery_date) 

strata_tbl_lbr_clstr <- CreateTableOne(data = ob_data_tbl8, strata = "lbr_type_cnsldt", vars = "cluster")
kableone(strata_tbl_lbr_clstr, cramVars = "cluster", caption = "Distribution of Labor Types by Preterm")

strata_tbl_conditions_clstr <- CreateTableOne(data = ob_data_tbl8, strata = "conditions_cnsldt", vars = "cluster")
kableone(strata_tbl_conditions_clstr, cramVars = "conditions_cnsldt", caption = "Distribution of Intrapartal Conditions by Preterm")

strata_tbl_events_clstr <- tableone::CreateTableOne(data = ob_data_tbl8, strata = "intrapartal_events", vars = "cluster")
kableone(strata_tbl_events_clstr, cramVars = "intrapartal_events", caption = "Distribution of Intrapartal Events by Preterm")


```


The analysis shows that Preterm

- is significantly associated with different frequencies of different Labor types (p=0.009) AND different intrapartal conditions (p= \<0.001)

- but NOT  different intrapartal events (p=0.172)


In summary, for this series of frequency and distribution analysis, we can conclude that High Risk OB has significant impact on Intrapartal events and Intrapartal conditions, as well as Labor Type.


There are many other ways to analyze the data with this method. For example, we can address questions like: _does labor induction lead to more C-sections?_

#### Labor type by Delivery method: an example 

```{r echo=FALSE, message=FALSE, warning=FALSE}
example_question <- ob_data_tbl1

strata_tbl_del_method_rsk <- CreateTableOne(data = example_question, strata = "del_method_cnsldt", vars = "lbr_type_cnsldt")

kableone(strata_tbl_del_method_rsk, cramVars = "lbr_type_cnsldt", caption = "Distribution of Delivery Method by Labor type")

```

We show a statistically significant effect (p = <0.001): Induced labor is associated with a higher frequency of Primary C-Section when compared to other delivery types (36.4%), but also when Primary C-Section rate is compared to Vaginal Delivery rate for the labor induction group (36.4% Vs 25.1%). That solves out to **Odds Ratio of 1.70** (We will see Odds Ratio again later); meaning that the odds of having Primary C-Section are 1.7 times higher for labor induction group, and it is statistically significant at p-value <0.001)


#### Delivery method by High Risk OB: an example

Another example: _Is Caesarian section more frequent in High Risk OB?_

```{r echo=FALSE, warning=FALSE, message=FALSE}

example_question_2 <- ob_data_tbl1

strata_tbl_del_method_rsk_2 <- CreateTableOne(data = example_question_2, strata = "del_method_cnsldt", vars = "hghRsk")

kableone(strata_tbl_del_method_rsk_2, cramVars = "del_method_cnsldt", caption = "Distribution of Delivery Method by High Risk")

```

Yes, in fact, at a p-value = \<0.001, the percent distribution  is: 44.5%, 32.8%, and 15.6% for C/S-- Primary;  C/S-- Repeat; and   Vaginal delivery respectively when High Risk OB is a present condition.



Moving on to a different analysis schema: Survival Analysis (SA) or by Time-to-Event (TTE) Analysis.

### Survival Analysis (SA)

Survival or Time-to_Event (TTE) Analysis

:   Survival analysis, or time-to-event analysis, is a statistical method used to analyze the expected duration until one or more events occur, such as death or failure. It involves the use of survival functions to estimate the probability of surviving past a certain time point. The Kaplan-Meier (KM) estimator is a non-parametric statistic used to estimate the survival function from observed survival times. Cumulative hazard (cumHaz) functions represent the accumulated risk of the event over time. Hazard ratios (HR) compare the hazard or event rates between two groups, indicating the relative likelihood of the event occurring in one group compared to another.

Survival Analysis can answer the questions:

- What is the probability of being undelivered at a given time after admission? (yields Median Survival)

- What percent of all deliveries would have occurred at a given time after admission? (Cumulative Hazard)

- What is the comparative likelihood of delivery having occurred given a certain condition versus absence of such a condition  at a certain time (Hazard Ratio, (HR))?

These are important questions for planning, predictions and for information for patient. 

##### Definitions

Here's a definition of terms relevant to Survival or Time-to_Event (TTE) Analysis :

**Event**: *Delivery* or the outcome

There is an associated delivery date with time that defines the event occurrence

**Intervention**: *Admission to the NMC*

A date-time is associated with Intervention

**Interval**: *Duration* of time between Intervention and Event.

**Time**: any *slice* of, or *point* , in time (not Duration) within Interval

**Survival**: *"Survived" Event* at any given Time

**Hazard**: *Risk of Event at any point in time* at any given Time

**Cumulative Hazard**: *Accumulated Risk of event up to a given point in time* at any given time

**Hazard Ratio (HR)**: *Relative hazard rates between two groups at any given point in time* at any given Time

In our case, given that the Event is Delivery, **Survival** means "_survived_ delivery", ie., **delivery has NOT occurred** at the specified Time; or, a status of **undelivered**.

These definitions will facilitate the interpretation of our analyses when we use the popular analytic method called Survival Analysis (SA) or Time to Event (TTE) analysis.

For Survival or TTE Analysis, we will use the method of Cox Proportional Hazards; this permits the taking into account together of all the variables that contribute to an outcome (or event) even while focusing on a particular variable of interest. The data subset will exclude deliveries done before patient had a chance for admission-registration, the latter being the time-stamp to start the clock of Intervention. That leaves **```r sa_data```** deliveries, a substantial **```r sa_data_prcnt```%** of all deliveries.

#### COX Survival, K-M, Cumulative Hazard and Hazard Ratio

Standard plot and example:

###### Cox Surv, CumHaz, strata = ALL

####### General: ALL variables, SA, CumHaz

```{r echo=FALSE, warning=FALSE, message=FALSE}

ob_data_coxph <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  filter(time >= 0) %>% 
  dplyr::select(-adm_date, -delivery_date, -labor_type, -intrapartal_conditions, -delivery_method, 
                -baby_sq, -gest_age, -adm_to_del_tm, -adm_to_del_tm_cat, -presentation) %>%
  mutate(cluster = as.factor(cluster))
  

# One-hot encode categorical variables
categorical_vars <- c("zip", "membrane_rupture", "intrapartal_events", "gender", "uds_age", "hghRsk", "conditions_cnsldt", "lbr_type_cnsldt", "presentation_cnsldt", 
                      "del_method_cnsldt", "cluster")

ob_data_encoded <- dummy_cols(ob_data_coxph, select_columns = categorical_vars, remove_first_dummy = TRUE, remove_selected_columns = TRUE)

ob_data_encoded_coxph <- dummy_cols(ob_data_coxph, select_columns = categorical_vars, remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Ensure no negative or zero survival times
ob_data_encoded_coxph <- ob_data_encoded_coxph %>%
  filter(time > 0)

# Sanitize variable names
sanitize_names <- function(names_vector) {
  names_vector <- gsub(" ", "_", names_vector)
  names_vector <- gsub(",", "", names_vector)
  names_vector <- gsub("<", "lt", names_vector)
  names_vector <- gsub(">", "gt", names_vector)
  names_vector <- gsub("-", "_", names_vector)
  names_vector <- gsub("/", "_", names_vector) # Handle slashes
  return(names_vector)
}

colnames(ob_data_encoded_coxph) <- sanitize_names(colnames(ob_data_encoded_coxph))


all_vars <- names(ob_data_encoded_coxph)
all_vars <- setdiff(all_vars, c("time", "event"))
saveRDS(ob_data_encoded_coxph, "OB data encoded ALL VARIABLES for CoxPH.RDS")  

cox_model_all <- coxph(as.formula(paste("Surv(time, event) ~", paste(all_vars, collapse = " + "))), data = ob_data_encoded_coxph)

#surv_fit <- survfit(Surv(time, event) ~ data = ob_data_encoded_coxph)


surv_fit <- survfit(cox_model_all, data=ob_data_encoded_coxph)
surv_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_encoded_coxph,
  surv.median.line = "hv",
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  risk.table = TRUE,
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Cox Survival Plot without Stratification")
  #legend.title = "ALL"
)
  
print(surv_plot)

# CumHaz
cumhaz_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_encoded_coxph,
  risk.table = TRUE,
  pval = TRUE,
  fun = "cumhaz",
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Cox Cumulative Hazard Plot without Stratification")
  #legend.title = "ALL"
)

# Extract the number of individuals at risk at time zero
at_risk_zero <- surv_fit$n.risk[1]

# Calculate the time point where 50% of individuals are at risk
half_at_risk_time <- min(surv_fit$time[surv_fit$n.risk <= at_risk_zero / 2])


# Add the line for 50% at risk time
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = half_at_risk_time, linetype = "dashed", color = "blue") +
  annotate("text", x = half_at_risk_time, y = max(surv_fit$cumhaz), 
           label = paste("50% At Risk at t =", round(half_at_risk_time, 2)), 
           color = "blue", vjust = -1.5)

# Print the plot
print(cumhaz_plot)
  
#print(cumhaz_plot)


```
There's the top plot (Survival Plot) which shows the probability of Survival at any time; and the bottom plot (Cumulative Hazard Plot) which shows accumulated events at any time. 
In the top / Survival Plot, the intersecting vertical and horizontal lines determine "median survival (time)": the time at which survival probability is 50% (time at which half the population will have survived (or experienced) the event). Time, for our purposes, is measured in hours since hospital admission registration.

Keeping in mind that for our data,

- "Survival" = not delivered (or undelivered);

- "Event" = "Hazard" = Delivered.

- "Risk" = the risk of Delivery.

We can interpret the top graph as: 

- median survival is 12 hours; or,

- the time when half of our patients is __expected__ to be still undelivered is 12 hours; or, 

- Our patients have a 50% probability of being undelivered at 12 hours after admission.


For the bottom graph, we determine the time when the cumulative deliveries show 50% still need to deliver (or are still to deliver). It works out to be about 12 hours. Note: it's just a coincidence that the median survival time (the time when half of our patients is still undelivered, based on probability) is also 12 hours.

This method is most informative when used to compare of different variable subsets, for example:

###### Cox Surv, CumHaz --stratified examples

```{r echo=FALSE, warning=FALSE, message=FALSE}


library(survival)
library(survminer)

ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                               membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                               intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                               conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                               conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                               lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                               lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Choose a categorical variable for stratification
strat_var <- "membrane_rupture_None"

# Add the stratification variable to the data frame
ob_data_encoded$strat_var <- ob_data_encoded[[strat_var]]

# Survival plot with stratification
surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)

surv_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_encoded,
  surv.median.line = "hv",
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Survival Plot with Stratification by", strat_var),
  legend.title = strat_var
)

# Print the survival plot
print(surv_plot)

# Cumulative hazard plot with stratification
cumhaz_plot <- ggsurvplot(
  surv_fit,
  fun = "cumhaz",
  data = ob_data_encoded,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Cumulative Hazard Plot with Stratification by", strat_var),
  legend.title = strat_var
)

# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)


```

The example above compares the condition of __membrane-rupture__ with __no membrane-rupture__. First thing to check is the p-value. A p-value of 0.34 is \> 0.05, there is therefore no statistically significant difference among this condition-binary with respect to probability of survival (undelivered); or cumulative hazard (total delivered) at any time. Based on the graph, at 11 hours, the probability of being undelivered is 50% for either of them. At 12 hours (your pick of time; it could have been at 20, 30, or 36 hours) the cumulative hazard (accumulated deliveries) is 55% and 61% respectively, which do not meet the test of statistical significance for being different.

On the other hand, below, when comparing the intrapartal condition, __Preeclampsia present or not__,  we can see the difference in Survival (undelivered status) and Cumulative deliveries; p-value \<0.0001. 50% undelivered probability is at ~12-hr mark if no Preeclampsia, but doubles to 25 hours when Preeclampsia is present. At the same time , cumulative delivered is 57% with Preeclampsia absent but only 24% if Preeclampsia is present.

```{r echo=FALSE, warning=FALSE, message=FALSE}
library(survival)
library(survminer)

ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Choose a categorical variable for stratification
strat_var <- "conditions_cnsldt_Preeclampsia"

# Add the stratification variable to the data frame
ob_data_encoded$strat_var <- ob_data_encoded[[strat_var]]

# Survival plot with stratification
surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)

surv_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_encoded,
  surv.median.line = "hv",
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Survival Plot with Stratification by", strat_var),
  legend.title = strat_var
)

# Print the survival plot
print(surv_plot)

# Cumulative hazard plot with stratification
cumhaz_plot <- ggsurvplot(
  surv_fit,
  fun = "cumhaz",
  data = ob_data_encoded,
  risk.table = TRUE,
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = paste("Cumulative Hazard Plot with Stratification by", strat_var),
  legend.title = strat_var
)

# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)



```

Since we are always curious about zipcode effects, we use them for the next example below.  First, a p-value of 0.12 means no significant difference among the 4 top zipcodes in terms of the probability of being undelivered or the cumulative deliveries at any point in time:

```{r echo=FALSE, message=FALSE, warning=FALSE}
#https://rpkgs.datanovia.com/survminer/survminer_cheatsheet.pdf
# Load required packages
library(survival)
library(survminer)
library(dplyr)

# Load and preprocess the data
ob_data_coxph_zip <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  filter(time >= 0) %>% 
  dplyr::select(-adm_date, -delivery_date, -labor_type, -intrapartal_conditions, -delivery_method, 
                -baby_sq, -gest_age, -adm_to_del_tm, -adm_to_del_tm_cat, -presentation) 

# Sanitize variable names
sanitize_names <- function(names_vector) {
  names_vector <- gsub(" ", "_", names_vector)
  names_vector <- gsub(",", "", names_vector)
  names_vector <- gsub("<", "lt", names_vector)
  names_vector <- gsub(">", "gt", names_vector)
  names_vector <- gsub("-", "_", names_vector)
  names_vector <- gsub("/", "_", names_vector) # Handle slashes
  return(names_vector)
}

colnames(ob_data_coxph_zip) <- sanitize_names(colnames(ob_data_coxph_zip))

# Identify top 4 ZIP codes by count
top_4_zips <- ob_data_coxph_zip %>%
  count(zip, sort = TRUE) %>%
  top_n(4, n) %>%
  pull(zip)

# Create a new variable for top 4 ZIP codes
ob_data_coxph_zip <- ob_data_coxph_zip %>%
  mutate(zip_top_4 = if_else(zip %in% top_4_zips, as.character(zip), "Other")) %>% 
  filter(zip_top_4 != "Other")

# Convert zip_top_4 to a factor
ob_data_coxph_zip$zip_top_4 <- as.factor(ob_data_coxph_zip$zip_top_4)

# Fit Cox proportional hazards model
cox_model_all <- coxph(Surv(time, event) ~ zip_top_4 + age + grav + para + membrane_rupture + intrapartal_events + gender + weight + apg1 + apg5 + gest_age_days + diff_grav_para + uds_age + 
                         hghRsk + conditions_cnsldt + lbr_type_cnsldt + presentation_cnsldt + del_method_cnsldt + cluster, data = ob_data_coxph_zip)

# Fit survival curves using survfit
surv_fit <- survfit(Surv(time, event) ~ zip_top_4, data = ob_data_coxph_zip)

# Plot survival curves
surv_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_coxph_zip,
  surv.median.line = "hv",
  pval = TRUE,
  conf.int = TRUE,
  xlim = c(0, 100),
  risk.table = TRUE,
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = "Cox Survival Plot with Zip Top 4 Stratification"
)

print(surv_plot)

# Plot cumulative hazard curves
cumhaz_plot <- ggsurvplot(
  surv_fit,
  data = ob_data_coxph_zip,
  risk.table = TRUE,
  pval = TRUE,
  fun = "cumhaz",
  conf.int = TRUE,
  xlim = c(0, 100),
  break.time.by = 10,
  ggtheme = theme_minimal(),
  title = "Cox Cumulative Hazard Plot with Zip Top 4 Stratification"
)

print(cumhaz_plot)


```


##### Cox with stratification

###### Intrapartal Conditions consolidated

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}


ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Variables to use for pairwise combinations
conditions_vars <- c("conditions_cnsldt_Placenta_Previa", "conditions_cnsldt_Preeclampsia", "conditions_cnsldt_Preeclampsia_Infection", "conditions_cnsldt_Prolonged_ROM")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Generate plots for presence and absence of each condition
for (var in conditions_vars) {
  # Create a binary stratification variable
  ob_data_encoded$strat_var <- factor(ifelse(ob_data_encoded[[var]] == 1, "Present", "Absent"))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)

}

```

###### Intrapartal events Consolidated

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Variables to use for pairwise combinations
conditions_vars <- c("intrapartal_events_Precip_Labor_lt3_hrs", "intrapartal_events_Prolonged_Labor_gt20_hrs")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Generate plots for presence and absence of each condition
for (var in conditions_vars) {
  # Create a binary stratification variable
  ob_data_encoded$strat_var <- factor(ifelse(ob_data_encoded[[var]] == 1, "Present", "Absent"))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
 # Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)

}

```

###### Labor Type consolidated

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}

ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Variables to use for pairwise combinations
conditions_vars <- c("lbr_type_cnsldt_Induced", "lbr_type_cnsldt_Not_Applicable", "lbr_type_cnsldt_Spontaneous")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Generate plots for presence and absence of each condition
for (var in conditions_vars) {
  # Create a binary stratification variable
  ob_data_encoded$strat_var <- factor(ifelse(ob_data_encoded[[var]] == 1, "Present", "Absent"))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
 # Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)
}

```

###### UDS Age grouping 25-44

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}

ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Variables to use for pairwise combinations
conditions_vars <- c("uds_age_25_44")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Create the Cox model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous, data = ob_data_encoded)

# Generate plots for presence and absence of each condition
for (var in conditions_vars) {
  # Create a binary stratification variable
  ob_data_encoded$strat_var <- factor(ifelse(ob_data_encoded[[var]] == 1, "Present", "Absent"))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", var),
    legend.title = var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)


}

```

###### Continuous variables internally grouped

```{r echo=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
library(survival)
library(survminer)
library(dplyr)
library(fastDummies)

# Load the encoded dataset with all variables
ob_data_encoded_all <- readRDS("OB data encoded ALL VARIABLES for CoxPH.RDS")

# Create categories for numerical variables with the specified cut points
ob_data_encoded_all_pls <- ob_data_encoded_all %>%
  mutate(age_group = factor(cut(age, breaks = c(-Inf, 35, Inf), labels = c("lte 35", "gt 35"))),
         apg1_group = factor(cut(apg1, breaks = c(-Inf, 6, Inf), labels = c("lt 6", "gte 6"))),
         apg5_group = factor(cut(apg5, breaks = c(-Inf, 7, Inf), labels = c("lt 7", "gte 7"))),
         gest_age_days_group = factor(cut(gest_age_days, breaks = c(-Inf, 259, Inf), labels = c("lt 259", "gte 259"))),
         weight_group = factor(cut(weight, breaks = c(-Inf, 3000, Inf), labels = c("lt 3000", "gte 3000"))))

# One-hot encode the newly created groups
ob_data_encoded_all_recode <- dummy_cols(ob_data_encoded_all_pls, select_columns = c("age_group", "apg1_group", "apg5_group", "gest_age_days_group", "weight_group"))

# List of all significant variables
significant_vars <- c("age", "apg1", "apg5", "gest_age_days", "weight",
                      "membrane_rupture_None", "intrapartal_events_Precip_Labor_lt3_hrs",
                      "intrapartal_events_Prolonged_Labor_gt20_hrs", "uds_age_25_44",
                      "conditions_cnsldt_Placenta_Previa", "conditions_cnsldt_Preeclampsia",
                      "conditions_cnsldt_Preeclampsia_Infection", "conditions_cnsldt_Prolonged_ROM",
                      "lbr_type_cnsldt_Induced", "lbr_type_cnsldt_Not_Applicable",
                      "lbr_type_cnsldt_Spontaneous")

# Ensure significant variables are in the dataset
missing_vars <- setdiff(significant_vars, names(ob_data_encoded_all_recode))
if (length(missing_vars) > 0) {
  stop(paste("The following significant variables are missing in the dataset:", paste(missing_vars, collapse = ", ")))
}

# Sanitize variable names
sanitize_names <- function(names_vector) {
  names_vector <- gsub(" ", "_", names_vector)
  names_vector <- gsub(",", "", names_vector)
  names_vector <- gsub("<", "lt", names_vector)
  names_vector <- gsub(">", "gt", names_vector)
  names_vector <- gsub("-", "_", names_vector)
  names_vector <- gsub("/", "_", names_vector) # Handle slashes
  return(names_vector)
}

colnames(ob_data_encoded_all_recode) <- sanitize_names(colnames(ob_data_encoded_all_recode))

saveRDS(ob_data_encoded_all_recode, "OB DATA encoded with encoded Numericals.RDS")

# Create the Cox model using all significant variables
cox_model_nums <- coxph(as.formula(paste("Surv(time, event) ~", paste(significant_vars, collapse = " + "))), data = ob_data_encoded_all_recode)

# Function to create stratified survival and cumulative hazard plots
stratified_plot <- function(variable, strat_var) {
  print(paste("Plotting for variable:", variable, "with stratification by", strat_var))
  
  # Ensure the stratification variable is in the dataset
  if (!strat_var %in% names(ob_data_encoded_all_recode)) {
    stop(paste("Stratification variable", strat_var, "not found in the dataset"))
  }
  
  # Adjusted survival plot
  tryCatch({
    surv_fit <- survfit(Surv(time, event) ~ get(strat_var), data = ob_data_encoded_all_recode)
    surv_plot <- ggsurvplot(
      surv_fit,
      data = ob_data_encoded_all_recode,
      surv.median.line = "hv",
      risk.table = TRUE,
      pval = TRUE,
      conf.int = TRUE,
      xlim = c(0, 100),
      break.time.by = 10,
      ggtheme = theme_minimal(),
      title = paste("Survival Plot for", variable, "stratified by", strat_var),
      legend.title = strat_var,
      legend.labs = levels(ob_data_encoded_all_recode[[strat_var]])
    )
    print(surv_plot)
  }, error = function(e) {
    print(paste("Error in adjusted survival plot for", variable, ":", e$message))
  })
  
  # Adjusted cumulative hazard plot
  tryCatch({
    cumhaz_plot <- ggsurvplot(
      surv_fit,
      fun = "cumhaz",
      data = ob_data_encoded_all_recode,
      risk.table = TRUE,
      pval = TRUE,
      conf.int = TRUE,
      xlim = c(0, 100),
      break.time.by = 10,
      ggtheme = theme_minimal(),
      title = paste("Cumulative Hazard Plot for", variable, "stratified by", strat_var),
      legend.title = strat_var,
      legend.labs = levels(ob_data_encoded_all_recode[[strat_var]])
    )
    print(cumhaz_plot)
  }, error = function(e) {
    print(paste("Error in adjusted cumulative hazard plot for", variable, ":", e$message))
  })
# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)


}

# List of significant numerical variables and their one-hot encoded groups
numerical_vars_with_groups <- list(
  age = c("age_group_lte_35", "age_group_gt_35"),
  apg1 = c("apg1_group_lt_6", "apg1_group_gte_6"),
  apg5 = c("apg5_group_lt_7", "apg5_group_gte_7"),
  gest_age_days = c("gest_age_days_group_lt_259", "gest_age_days_group_gte_259"),
  weight = c("weight_group_lt_3000", "weight_group_gte_3000")
)

# Plot adjusted survival and cumulative hazard for each significant numerical variable with grouping
for (var in names(numerical_vars_with_groups)) {
  strat_vars <- numerical_vars_with_groups[[var]]
  for (strat_var in strat_vars) {
    stratified_plot(var, strat_var)
  }
}


```

###### By Zip Code

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}

#https://rpkgs.datanovia.com/survminer/survminer_cheatsheet.pdf
# Load necessary libraries
# Load necessary libraries
library(survival)
library(survminer)

# Load the dataset
ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# Variables to use for stratification
zip_vars <- c("zip_93905", "zip_93906", "zip_93907", "zip_93908", "zip_93925")

# Create the Surv object
surv_obj <- Surv(time = ob_data_encoded$time, event = ob_data_encoded$event)

# Fit the Cox proportional hazards model using significant variables
cox_model_significant <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days + 
                                 membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs + 
                                 intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 + 
                                 conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia + 
                                 conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM + 
                                 lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable + 
                                 lbr_type_cnsldt_Spontaneous + strata(zip_93905, zip_93906, zip_93907, zip_93908, zip_93925),
                               data = ob_data_encoded)

# Summary of the Cox model
#summary(cox_model_significant)

# Generate plots for each zip code
for (zip_var in zip_vars) {
  # Create a binary stratification variable
  ob_data_encoded$strat_var <- factor(ifelse(ob_data_encoded[[zip_var]] == 1, "Present", "Absent"))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", zip_var),
    legend.title = zip_var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", zip_var),
    legend.title = zip_var,
    legend.labs = levels(ob_data_encoded$strat_var)
  )
  
# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)

}

```

###### High Risk OB and Preterm

```{r echo=FALSE, warning=FALSE, message=FALSE, eval=FALSE}
library(survival)
library(survminer)
library(dplyr)

# Load the dataset
ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# List of binary stratification variables
strat_vars <- c("hghRsk_yes", "cluster_1")

# Function to fit Cox model and generate plots for stratification variable
generate_cox_model_and_plots <- function(strat_var) {
  # Ensure the stratification variable is a factor
  ob_data_encoded$strat_var <- factor(ob_data_encoded[[strat_var]])
  
  # Fit the Cox proportional hazards model
  cox_model <- coxph(Surv(time, event) ~ age + apg1 + apg5 + gest_age_days +
                       membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs +
                       intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 +
                       conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia +
                       conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM +
                       lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable +
                       lbr_type_cnsldt_Spontaneous + strata(strat_var), data = ob_data_encoded)
  
  # Print the model summary
  #print(summary(cox_model))
  
  # Survival plot with stratification
  surv_fit <- survfit(Surv(time, event) ~ strat_var, data = ob_data_encoded)
  
  surv_plot <- ggsurvplot(
    surv_fit,
    data = ob_data_encoded,
    surv.median.line = "hv",
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Survival Plot with Stratification by", strat_var),
    legend.title = strat_var
  )
  
  # Print the survival plot
  print(surv_plot)
  
  # Cumulative hazard plot with stratification
  cumhaz_plot <- ggsurvplot(
    surv_fit,
    fun = "cumhaz",
    data = ob_data_encoded,
    risk.table = TRUE,
    pval = TRUE,
    conf.int = TRUE,
    xlim = c(0, 100),
    break.time.by = 10,
    ggtheme = theme_minimal(),
    title = paste("Cumulative Hazard Plot with Stratification by", strat_var),
    legend.title = strat_var
  )
  
# Choose a fixed time point for the vertical line
fixed_time <- 12

# Extract the number of individuals at risk at time zero for each stratum
n_risk_at_zero <- sapply(1:length(surv_fit$strata), function(i) {
  surv_fit$n.risk[1 + sum(head(surv_fit$strata, i - 1))]
})

# Extract the number of individuals at risk at the fixed time point for each stratum
n_risk_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$n.risk[i]
})

# Calculate the percentage reduction in the number at risk for each stratum
percent_reduction_at_fixed_time <- 100 * (n_risk_at_zero - n_risk_at_fixed_time) / n_risk_at_zero

# Extract cumulative hazard at the fixed time point for each stratum
cumhaz_at_fixed_time <- sapply(1:length(surv_fit$strata), function(i) {
  summary(surv_fit, times = fixed_time, extend = TRUE)$cumhaz[i]
})

# Add the vertical line at the fixed time point
cumhaz_plot$plot <- cumhaz_plot$plot + 
  geom_vline(xintercept = fixed_time, linetype = "dashed", color = "blue")

# Add labels for cumulative hazard percentage at the fixed time point for each stratum
for (i in 1:length(cumhaz_at_fixed_time)) {
  cumhaz_plot$plot <- cumhaz_plot$plot +
    annotate("text", x = fixed_time, y = cumhaz_at_fixed_time[i] + (i - 1) * 0.1, 
             label = paste(round(percent_reduction_at_fixed_time[i], 2), "%"), 
             color = "red", vjust = -0.5, hjust = -0.2)
}

# Print the plot
print(cumhaz_plot)

}

# Apply the function to each stratification variable
lapply(strat_vars, generate_cox_model_and_plots)

```

We performed the same analysis and obtained plots for the different variables and tabulated the result in a summary table:

###### Summary of Significant Cox Proportional Hazard modelling
 
```{r echo=FALSE, warning=FALSE, message=FALSE}

library(survival)
library(dplyr)
library(tibble)
library(stringr)
library(purrr)

# Load the dataset
ob_data_encoded <- readRDS("OB data encoded Significant Variables for CoxPH.RDS")

# List of significant stratification variables
significant_vars <- c("hghRsk_yes", "cluster_1", "membrane_rupture_None", 
                      "intrapartal_events_Precip_Labor_lt3_hrs", "intrapartal_events_Prolonged_Labor_gt20_hrs", 
                      "uds_age_25_44", "conditions_cnsldt_Placenta_Previa", "conditions_cnsldt_Preeclampsia",
                      "conditions_cnsldt_Preeclampsia_Infection")

# Initialize an empty data frame to store all results
summary_table <- tibble(Variable = character(), Probability = character())
cumhaz_table <- tibble(Variable = character(), Cumulative_hazard = character())

# Function to extract HRs and determine the overall effect on survival probability
extract_effect <- function(strat_var) {
  # Fit the Cox proportional hazards model without stratifying by the variable
  formula <- as.formula(paste("Surv(time, event) ~ age + apg1 + apg5 + gest_age_days +",
                              "membrane_rupture_None + intrapartal_events_Precip_Labor_lt3_hrs +",
                              "intrapartal_events_Prolonged_Labor_gt20_hrs + uds_age_25_44 +",
                              "conditions_cnsldt_Placenta_Previa + conditions_cnsldt_Preeclampsia +",
                              "conditions_cnsldt_Preeclampsia_Infection + conditions_cnsldt_Prolonged_ROM +",
                              "lbr_type_cnsldt_Induced + lbr_type_cnsldt_Not_Applicable +",
                              "lbr_type_cnsldt_Spontaneous +", strat_var))
  cox_model <- coxph(formula, data = ob_data_encoded)
  
  # Extract HR for the stratification variable
  summary_cox <- summary(cox_model)
  hr <- summary_cox$coefficients[strat_var, "exp(coef)"]
  effect <- ifelse(hr > 1, "Decreases Undelivered Probability", "Increases Undelivered Probability")
  
  # Handle reference category (NA values)
  if (is.na(hr)) {
    effect <- "Reference Category"
  }
  
  # Compile results into a data frame
  results <- tibble(
    Variable = strat_var,
    Probability = effect
  )
  
  return(results)
}

# Function to extract cumulative hazard effects
extract_cumhaz_effect <- function(strat_var) {
  # Fit survival curves using survfit
  surv_fit <- survfit(Surv(time, event) ~ get(strat_var), data = ob_data_encoded)
  
  # Check the cumulative hazard at time t = 12
  cumhaz_summary <- summary(surv_fit, times = 12)
  
  # Calculate the median cumulative hazard
  median_cumhaz <- median(cumhaz_summary$cumhaz, na.rm = TRUE)
  
  # Determine the effect on cumulative hazard
  cumhaz_effect <- ifelse(cumhaz_summary$cumhaz > median_cumhaz, "Higher Cumulative Deliveries", "Lower Cumulative Deliveries")
  
  # Handle reference category (NA values)
  cumhaz_effect[is.na(cumhaz_effect)] <- "Reference Category"
  
  # Compile results into a data frame
  results <- tibble(
    Variable = strat_var,
    Cumulative_hazard = cumhaz_effect
  ) %>%
    distinct()
  
  return(results)
}

# Apply the function to each significant variable and combine results into one data frame
results_list <- lapply(significant_vars, extract_effect)
summary_table <- bind_rows(results_list) %>% arrange(Probability)

cumhaz_results_list <- lapply(significant_vars, extract_cumhaz_effect)
cumhaz_table <- bind_rows(cumhaz_results_list) %>%
  group_by(Variable) %>%
  filter(row_number() == 1) %>%
  ungroup() %>% 
  arrange(Cumulative_hazard)

# Print the summary tables
# print(summary_table)
# print(cumhaz_table)

kableExtra::kable(summary_table, caption = "Summary Table, Predictor effects on Survival Probability")

kableExtra::kable(cumhaz_table, caption = "Summary Table, Predictor effects on Cumulative Deliveries")

```

We see that: 

- the probability at any time during OB admission of **not_delivered_yet** is **higher** with PreTerm, Placenta Previa, Preeclampsia, Preeclampsia with infection and Prolonged Labor greater than 20 hours. 

- On the other hand, High Risk OB, No Membrane Rupture, Precipitous Labor less than 3 hours, and Age_Group 25-44 years **decrease** such probability.


We will continue next with Cox Proportional Hazards Survival Analysis and address Hazard Ratio: the comparative Risk of the Event: (the Risk of) Delivery between two conditions that impact the Event. A reminder:

**Hazard**: *Risk of Event (risk of Delivery) at any point in time* at any given Time

**Hazard Ratio (HR)**: *Relative hazard rates between two sub-groups (usually of the binarized or "referencialized" group) at any given point in time* at any given Time

First, we'll obtain the HR's of all variables of interest, then we will select the statistically significant ones which are determined by p-values \< 0.05 (and marked by asterisks next to the p-value)

##### COX HR

We've explored cumulative hazard (where hazard = Delivery) which is accumulated delivery count over time at any time. Now, we will look at Risk Ratio, otherwise known as Hazard Ratio (HR) (risk in this case = Delivery): a comparative Risk analysis, here, between either binarized variables or between subgroups within a group with one subgroup as a reference.

###### All Variables

```{r echo=FALSE, warning=FALSE, message=FALSE}



# Load necessary libraries
library(survival)
library(survminer)
library(dplyr)
library(fastDummies)

# Load your dataset
ob_data_coxph <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  filter(time >= 0) %>% 
  dplyr::select(-adm_date, -delivery_date, -labor_type, -intrapartal_conditions, -delivery_method, 
                -baby_sq, -gest_age, -adm_to_del_tm, -adm_to_del_tm_cat, -presentation) %>% 
  mutate(cluster=as.factor(cluster))

# One-hot encode categorical variables
categorical_vars <- c("zip", "membrane_rupture", "intrapartal_events", "gender", "uds_age", "hghRsk", "conditions_cnsldt", "lbr_type_cnsldt", "presentation_cnsldt", 
                      "del_method_cnsldt", "cluster")

ob_data_encoded <- dummy_cols(ob_data_coxph, select_columns = categorical_vars, remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Ensure no negative or zero survival times
ob_data_encoded <- ob_data_encoded %>%
  filter(time > 0)

# Sanitize variable names
sanitize_names <- function(names_vector) {
  names_vector <- gsub(" ", "_", names_vector)
  names_vector <- gsub(",", "", names_vector)
  names_vector <- gsub("<", "lt", names_vector)
  names_vector <- gsub(">", "gt", names_vector)
  names_vector <- gsub("-", "_", names_vector)
  names_vector <- gsub("/", "_", names_vector) # Handle slashes
  return(names_vector)
}

# Sanitize the key variables
key_vars <- names(ob_data_encoded)[!names(ob_data_encoded) %in% c("time", "event")]
sanitized_key_vars <- sanitize_names(key_vars)

# Rename columns in the dataset to match sanitized names
names(ob_data_encoded) <- sanitize_names(names(ob_data_encoded))

saveRDS(ob_data_encoded, "OB data encoded Significant Variables for CoxPH.RDS")

# Create a formula for the Cox model
cox_formula <- as.formula(paste("Surv(time, event) ~", paste(sanitized_key_vars, collapse = " + ")))

# Fit the Cox proportional hazards model
cox_model <- coxph(cox_formula, data = ob_data_encoded)

cox_summary <- summary(cox_model)
# Summarize the model
#summary(cox_model)

# Create a forest plot for the Cox model using ggforest
ggforest(cox_model, data = ob_data_encoded)


# Extract p-values and hazard ratios for table
p_values <- cox_summary$coefficients[, "Pr(>|z|)"]
hazard_ratios <- cox_summary$coefficients[, "exp(coef)"]

# Select variables with p-value < 0.05
significant_vars <- names(p_values[p_values < 0.05]) 

# Extract significant variables with their p-values and hazard ratios
significant_p_values <- p_values[p_values < 0.05]
significant_hazard_ratios <- hazard_ratios[significant_vars]

# Create a data frame for significant variables with their p-values and hazard ratios
significant_vars_df_all <- data.frame(Variable = significant_vars, P_Value = significant_p_values, Hazard_Ratio = significant_hazard_ratios, row.names = NULL) %>% arrange(Hazard_Ratio)

# Round p-values and hazard ratios to 3 decimal places
significant_vars_df_all$P_Value <- round(as.numeric(significant_vars_df_all$P_Value), 3)
significant_vars_df_all$Hazard_Ratio <- round(as.numeric(significant_vars_df_all$Hazard_Ratio), 3)


# Add interpreter column
significant_vars_df_all <- significant_vars_df_all %>%
  mutate(interpreter = if_else(Hazard_Ratio < 1, "Delivery Risk lower", "Delivery Risk higher")) %>% 
  filter(Variable != "lbr_type_cnsldt_Not_Applicable")


# Display the significant variables with p-values and hazard ratios
#print(significant_vars_df_all)
#kableExtra::kable(significant_vars_df_all)



```

The busy HR plot above gives an idea of all the variables considered; it shows which HRs are statistically significant based on p-value less than 0.05. The next plot shows only variables whose HR values are significant; this is followed by the corresponding table.



###### Only significantly positive HR's

```{r echo=FALSE, warning=FALSE, message=FALSE}



# Load necessary libraries
library(survival)
library(survminer)
library(dplyr)
library(fastDummies)

# Load your dataset
ob_data_coxph <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>%
  filter(time >= 0) %>% 
  dplyr::select(-adm_date, -delivery_date, -labor_type, -intrapartal_conditions, -delivery_method, 
                -baby_sq, -gest_age, -adm_to_del_tm, -adm_to_del_tm_cat, -presentation)

# One-hot encode categorical variables
categorical_vars <- c("zip", "membrane_rupture", "intrapartal_events", "gender", "uds_age", "hghRsk", "conditions_cnsldt", "lbr_type_cnsldt", "presentation_cnsldt", 
                      "del_method_cnsldt", "cluster")

ob_data_encoded <- dummy_cols(ob_data_coxph, select_columns = categorical_vars, remove_first_dummy = TRUE, remove_selected_columns = TRUE)

# Ensure no negative or zero survival times
ob_data_encoded <- ob_data_encoded %>%
  filter(time > 0)

# Sanitize variable names
sanitize_names <- function(names_vector) {
  names_vector <- gsub(" ", "_", names_vector)
  names_vector <- gsub(",", "", names_vector)
  names_vector <- gsub("<", "lt", names_vector)
  names_vector <- gsub(">", "gt", names_vector)
  names_vector <- gsub("-", "_", names_vector)
  names_vector <- gsub("/", "_", names_vector) # Handle slashes
  return(names_vector)
}

# Sanitize the key variables
key_vars <- names(ob_data_encoded)[!names(ob_data_encoded) %in% c("time", "event")]
sanitized_key_vars <- sanitize_names(key_vars)

# Rename columns in the dataset to match sanitized names
names(ob_data_encoded) <- sanitize_names(names(ob_data_encoded))

# Create a formula for the Cox model
cox_formula <- as.formula(paste("Surv(time, event) ~", paste(sanitized_key_vars, collapse = " + ")))

# Fit the full Cox model
cox_model_full <- coxph(cox_formula, data = ob_data_encoded)

# Summarize the full model
cox_summary <- summary(cox_model_full)

# Extract significant variables
significant_vars <- cox_summary$coefficients[cox_summary$coefficients[, "Pr(>|z|)"] < 0.05, ]

# Create a data frame for plotting
plot_data <- data.frame(
  variable = rownames(significant_vars),
  hr = exp(significant_vars[, "coef"]),
  lower = exp(significant_vars[, "coef"] - 1.96 * significant_vars[, "se(coef)"]),
  upper = exp(significant_vars[, "coef"] + 1.96 * significant_vars[, "se(coef)"])
)

# Plot using ggplot2
ggplot2::ggplot(plot_data, aes(x = variable, y = hr)) +
  geom_point() +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
  geom_hline(yintercept = 1, linetype = "dashed") +
  geom_text(aes(label = sprintf("%.2f", hr)), hjust = -0.3, vjust = -0.5) + # Add HR values
  coord_flip() +
  labs(
    title = "Forest plot of significant hazard ratios",
    x = "Variables",
    y = "Hazard Ratio"
  ) +
  theme_minimal()

```

HR calculations showing conditions or variables with a statistically significant self-referential or binarized relative risk or likelihood of Delivery at any point in time 

In summary,

-   Higher Risk of Delivery (HR \> 1)
    -   Increasing Gestational Age (not by much): Each additional day of gestational age slightly increases the risk of delivery at any point in time.
    -   Higher APGAR 1 Score: Higher APGAR scores at 1 minute are associated with an increased risk of delivery at any point in time.
    -   Maternal Age Group 25-44 Years: This age group shows a higher risk of delivery compared to the reference group.
    -   Absence of Membrane Rupture: When there is no membrane rupture, the risk of delivery is higher.
    -   Spontaneous Labor: Spontaneous labor is associated with a higher risk of delivery.
    -   Precipitous Labor Less than 3 Hours: Quick labor (less than 3 hours) increases the risk of delivery.
-   Lower Risk of Delivery (HR \< 1)
    -   Placenta Previa: This condition significantly lowers the risk of delivery at any point in time.
    -   Prolonged Rupture of Membrane (ROM): Prolonged ROM is associated with a lower risk of delivery.
    -   Preeclampsia (with or without infection): Preeclampsia, whether alone or with infection, decreases the risk of delivery.
    -   Prolonged Labor Greater than 20 Hours: Labor lasting more than 20 hours is associated with a reduced risk of delivery.
    -   Labor Induction: Induced labor is associated with a lower risk of delivery.
    -   APGAR 5: Higher APGAR scores at 5 minutes are associated with a reduced risk of delivery.
    -   Increasing Maternal Age (not by much): Each additional year of maternal age slightly decreases the risk of delivery.
    -   Zip Codes 95076 and 93927: These specific zip codes show a lower risk of delivery at any point in time.




We leave Cox methods and go on to GLM modelling to determine ODDS Ratio

### ODDS Ratio: GLM modeling

**ODDS Ratio (OR)**: *measure of association between an exposure and an outcome*

OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring without that exposure. (In logistic regression, it is used to quantify the strength of the association between predictor variables and a binary outcome.) For our data, we will pick, as examples, two outcomes (binarized): presence or absence of intrapartal **events**; and presence or absence of intrapartal **conditions**.

Odds Ratio can answer the questions:

- What is the likelihood of an event occurring in the presence of a certain condition compared to its absence?

- How does exposure to a particular factor influence the probability of an outcome compared to non-exposure?

- Can we identify the strength of association between a predictor and an outcome in a binary response scenario?


##### Example: Intrapartal events and intrapartal conditions.

```{r echo=FALSE, message=FALSE, warning=FALSE}
# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(purrr)
library(knitr)
library(kableExtra)

# Read the dataset
ob_data_fctr <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS") %>% 
  mutate(cluster = as.factor(cluster))

# Design binary outcomes for GLM
ob_data_fctr$binary_outcome <- as.factor(ob_data_fctr$cluster)

ob_data_fctr.bin <- ob_data_fctr %>% 
  mutate(
    bin_conditions = as.factor(if_else(conditions_cnsldt == "None", 0, 1)),
    bin_events = as.factor(if_else(intrapartal_events == "None", 0, 1)))

# Create zip_category based on the frequency threshold
zip_freq <- table(ob_data_fctr.bin$zip)
threshold <- 60
ob_data_fctr.bin <- ob_data_fctr.bin %>%
  mutate(zip_category = ifelse(zip_freq[zip] >= threshold, as.character(zip), "Other"),
         zip_category = as.factor(zip_category))

# Suppress warnings and fit the models
suppressWarnings({
  model_conditions <- glm(bin_conditions ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt, data = ob_data_fctr.bin, family = binomial())
  model_events <- glm(bin_events ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt + conditions_cnsldt, data = ob_data_fctr.bin, family = binomial())
})

# Function to extract results and plot
extract_and_plot <- function(model, title, reference_levels) {
  # Extract coefficients, confidence intervals, and p-values
  coef_summary <- summary(model)$coefficients
  
  # Handle potential errors in confidence interval calculation
  conf_int <- tryCatch(
    confint(model),
    error = function(e) {
      warning("Error in calculating confidence intervals: ", conditionMessage(e))
      return(matrix(NA, nrow = nrow(coef_summary), ncol = 2))
    }
  )
  
  # Combine results into a single data frame
  results <- as.data.frame(coef_summary)
  results$conf.low <- conf_int[, 1]
  results$conf.high <- conf_int[, 2]
  
  # Add predictor names
  results$predictor <- rownames(results)
  rownames(results) <- NULL
  
  # Convert log-odds to odds ratios
  results <- results %>%
    mutate(OR = exp(Estimate),
           conf.low = exp(conf.low),
           conf.high = exp(conf.high),
           p.value = `Pr(>|z|)`)
  
  # Remove rows with missing values
  results <- results[complete.cases(results), ]
  
  # Create the plot
  plot <- ggplot2::ggplot(results, aes(x = reorder(predictor, OR), y = OR)) +
    geom_point(size = 3) +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
    geom_text(aes(label = sprintf("p = %.3f", p.value), color = ifelse(p.value < 0.05, "red", "black")), 
              hjust = -0.3, size = 2) +
    geom_text(aes(label = sprintf("OR = %.2f", OR)), vjust = -1.5, size = 3) +
    geom_hline(yintercept = 1, linetype = "dashed", color = "grey") +  # Add a line at OR = 1 for reference
    scale_y_log10() +  # Use log scale for y-axis
    labs(title = title,
         x = "Predictor",
         y = "Odds Ratio (log scale)",
         caption = reference_levels) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          legend.position = "none",
          plot.caption = element_text(size = 10, hjust = 1, vjust = 1, face = "italic")) +
    coord_flip()  # Flip coordinates for better readability
  
  return(plot)
}

# List of models and titles
models <- list(
  model_conditions = model_conditions,
  model_events = model_events
)

titles <- c(
  "Odds Ratios for Conditions",
  "Odds Ratios for Events"
)

# Reference levels for each model
reference_levels_list <- c(
  "Reference Levels:\nbin_conditions = 0 (No Intrapartal Conditions)",
  "Reference Levels:\nbin_events = 0 (No Intrapartal Events)"
)

# Apply the function to each model and print the plots
plots <- map2(models, titles, ~ extract_and_plot(.x, .y, reference_levels_list[[which(titles == .y)]]))

# Print the plots
for (plot in plots) {
  print(plot)
}

# Function to extract significant results
extract_significant_results <- function(model, p_value_threshold = 0.05) {
  # Extract coefficients, confidence intervals, and p-values
  coef_summary <- summary(model)$coefficients
  
  # Handle potential errors in confidence interval calculation
  conf_int <- tryCatch(
    confint(model),
    error = function(e) {
      warning("Error in calculating confidence intervals: ", conditionMessage(e))
      return(matrix(NA, nrow = nrow(coef_summary), ncol = 2))
    }
  )
  
  # Combine results into a single data frame
  results <- as.data.frame(coef_summary)
  results$conf.low <- conf_int[, 1]
  results$conf.high <- conf_int[, 2]
  
  # Add predictor names
  results$predictor <- rownames(results)
  rownames(results) <- NULL
  
  # Convert log-odds to odds ratios
  results <- results %>%
    mutate(OR = exp(Estimate),
           p.value = `Pr(>|z|)`) %>%
    filter(predictor != "(Intercept)" & p.value < p_value_threshold) %>%
    dplyr::select(predictor, OR, conf.low, conf.high, p.value)
  
  # Format p-values
  results$p.value <- as.numeric(format(round(results$p.value, 4), scientific = FALSE))
  
  return(results)
}

# Function to plot only significant predictors
plot_significant_predictors <- function(model, title, reference_levels, p_value_threshold = 0.05) {
  significant_results <- extract_significant_results(model, p_value_threshold)
  
  plot <- ggplot2::ggplot(significant_results, aes(x = reorder(predictor, OR), y = OR)) +
    geom_point(size = 3) +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
    geom_text(aes(label = sprintf("p = %.4f", p.value), color = ifelse(p.value < 0.05, "red", "black")), 
              hjust = -0.3, size = 3) +
    geom_text(aes(label = sprintf("OR = %.2f", OR)), vjust = -1.5, size = 3) +
    geom_hline(yintercept = 1, linetype = "dashed", color = "grey") +  # Add a line at OR = 1 for reference
    scale_y_log10() +  # Use log scale for y-axis
    labs(title = title,
         x = "Predictor",
         y = "Odds Ratio (log scale)",
         caption = reference_levels) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          legend.position = "none",
          plot.caption = element_text(size = 10, hjust = 1, vjust = 1, face = "italic")) +
    coord_flip()  # Flip coordinates for better readability
  
  return(plot)
}

# Plot significant predictors for each model
significant_plots <- map2(models, titles, ~ plot_significant_predictors(.x, .y, reference_levels_list[[which(titles == .y)]]))

# Print the significant predictor plots
for (plot in significant_plots) {
  print(plot)
}

# Extract significant variables from each model
significant_vars_conditions <- extract_significant_results(model_conditions)
significant_vars_events <- extract_significant_results(model_events)

# Add a column to indicate the model
significant_vars_conditions$model <- "Conditions"
significant_vars_events$model <- "Events"

# Combine all significant variables into one data frame
significant_vars_all <- bind_rows(significant_vars_conditions, significant_vars_events)

# Round the OR to 3 decimal places
significant_vars_all <- significant_vars_all %>%
  mutate(OR = round(OR, 3))

# Filter out 'lbr_type_cnsldtNot Applicable'
significant_vars_filtered <- significant_vars_all %>%
  filter(predictor != "lbr_type_cnsldtNot_Applicable")

# Add an interpreter column based on the model

# Print the updated dataframe
#print(significant_vars_filtered)

kableExtra::kable(significant_vars_filtered)


```



We have similar plots of all the outcomes we are interested in, but will summarize the findings below with bullet points, pared down to only the statistically significant. 

##### Other outcomes

###### (with Primary C/S, Non-spontaneous delivery, intrapartal conditions and events)

```{r echo=FALSE, message=FALSE, warning=FALSE, eval=FALSE}
#Design binary outcomes for GLM.
ob_data_fctr$binary_outcome <- as.factor(ob_data_fctr$cluster)


ob_data_fctr.bin <- ob_data_fctr %>% 
  mutate(bin_del_CS = as.factor(if_else(del_method_cnsldt =="C/S, Primary", 1, 0)),
         bin_lbr_non_spont = as.factor(if_else(lbr_type_cnsldt =="Spontaneous", 0,1)),
         bin_conditions = as.factor(if_else(conditions_cnsldt == "None", 0,1)),
         bin_events = as.factor(if_else(intrapartal_events == "None", 0, 1)))

# Create zip_category based on the frequency threshold
zip_freq <- table(ob_data_fctr.bin$zip)
threshold <- 60
ob_data_fctr.bin <- ob_data_fctr.bin %>%
  mutate(zip_category = ifelse(zip_freq[zip] >= threshold, as.character(zip), "Other"),
         zip_category = as.factor(zip_category))


suppressWarnings({
  model_del <- glm(bin_del_CS ~ zip_category + age + grav + para + gender + uds_age + hghRsk  + lbr_type_cnsldt + conditions_cnsldt, 
                   data = ob_data_fctr.bin, family = binomial())
  
  model_lbr <- glm(bin_lbr_non_spont ~ zip_category + age + grav + para + gender + uds_age + hghRsk + del_method_cnsldt+ conditions_cnsldt, 
                   data = ob_data_fctr.bin, family = binomial())
  
  model_conditions <- glm(bin_conditions ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt, 
                          data = ob_data_fctr.bin, family = binomial())
  
  model_events <- glm(bin_events ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt+ conditions_cnsldt, 
                      data = ob_data_fctr.bin, family = binomial())
})

library(dplyr)
library(ggplot2)
library(purrr)

# Function to extract results and plot
extract_and_plot <- function(model, title, reference_levels) {
  # Extract coefficients, confidence intervals, and p-values
  coef_summary <- summary(model)$coefficients
  
  # Handle potential errors in confidence interval calculation
  conf_int <- tryCatch(
    confint(model),
    error = function(e) {
      warning("Error in calculating confidence intervals: ", conditionMessage(e))
      return(matrix(NA, nrow = nrow(coef_summary), ncol = 2))
    }
  )
  
  # Combine results into a single data frame
  results <- as.data.frame(coef_summary)
  results$conf.low <- conf_int[, 1]
  results$conf.high <- conf_int[, 2]
  
  # Add predictor names
  results$predictor <- rownames(results)
  rownames(results) <- NULL
  
  # Convert log-odds to odds ratios
  results <- results %>%
    mutate(OR = exp(Estimate),
           conf.low = exp(conf.low),
           conf.high = exp(conf.high),
           p.value = `Pr(>|z|)`)
  
  # Remove rows with missing values
  results <- results[complete.cases(results), ]
  
  # Create the plot
  plot <- ggplot2::ggplot(results, aes(x = reorder(predictor, OR), y = OR)) +
    geom_point(size = 3) +
    geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
    geom_text(aes(label = sprintf("p = %.3f", p.value), color = ifelse(p.value < 0.05, "red", "black")), 
              hjust = -0.3, size = 3) +
    geom_text(aes(label = sprintf("OR = %.2f", OR)), vjust = -1.5, size = 3) +
    geom_hline(yintercept = 1, linetype = "dashed", color = "grey") +  # Add a line at OR = 1 for reference
    scale_y_log10(0.1,4) +  # Start y-axis from 0.5
    labs(title = title,
         x = "Predictor",
         y = "Odds Ratio (log scale)",
         caption = reference_levels) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1), 
          legend.position = "none",
          plot.caption = element_text(size = 10, hjust = 1, vjust = 1, face = "italic")) +
    coord_flip()  # Flip coordinates for better readability
  
  return(plot)
}


# List of models and titles
models <- list(
  model_del = model_del,
  model_lbr = model_lbr,
  model_conditions = model_conditions,
  model_events = model_events
)

titles <- c(
  "Odds Ratios for Delivery Method (C/S, Primary)",
  "Odds Ratios for Labor Type (Non-Spontaneous)",
  "Odds Ratios for Conditions",
  "Odds Ratios for Events"
)

# Reference levels for each model
reference_levels_list <- c(
  "Reference Levels:\nbin_del_CS = 0 (No Primary C/S)",
  "Reference Levels:\nbin_lbr_non_spont = 0 (No Spontaneous Delivery)",
  "Reference Levels:\nbin_conditions = 0 (No Intrapartal Conditions)",
  "Reference Levels:\nbin_events = 0 (No Intrapartal Events)"
)

# Apply the function to each model
plots_OR <- map2(models, titles, ~ extract_and_plot(.x, .y, reference_levels_list[[which(titles == .y)]]))

#plot
plots_OR

library(dplyr)
library(ggplot2)
library(purrr)

# Function to extract significant results
extract_significant_results <- function(model, p_value_threshold = 0.05) {
  # Extract coefficients, confidence intervals, and p-values
  coef_summary <- summary(model)$coefficients
  
  # Handle potential errors in confidence interval calculation
  conf_int <- tryCatch(
    confint(model),
    error = function(e) {
      warning("Error in calculating confidence intervals: ", conditionMessage(e))
      return(matrix(NA, nrow = nrow(coef_summary), ncol = 2))
    }
  )
  
  # Combine results into a single data frame
  results <- as.data.frame(coef_summary)
  results$conf.low <- conf_int[, 1]
  results$conf.high <- conf_int[, 2]
  
  # Add predictor names
  results$predictor <- rownames(results)
  rownames(results) <- NULL
  
  # Convert log-odds to odds ratios
  results <- results %>%
    mutate(OR = exp(Estimate),
           p.value = `Pr(>|z|)`) %>%
    filter(predictor != "(Intercept)" & p.value < p_value_threshold) %>%
    dplyr::select(predictor, OR, p.value)
  
  # Format p-values
  results$p.value <- format(round(results$p.value, 4), scientific = FALSE)
  
  return(results)
}

# Extract significant variables from each model
significant_vars_del <- extract_significant_results(model_del)
significant_vars_lbr <- extract_significant_results(model_lbr)
significant_vars_conditions <- extract_significant_results(model_conditions)
significant_vars_events <- extract_significant_results(model_events)

# Add a column to indicate the model
significant_vars_del$model <- "Delivery Method (C/S, Primary)"
significant_vars_lbr$model <- "Labor Type (Non-Spontaneous)"
significant_vars_conditions$model <- "Conditions"
significant_vars_events$model <- "Events"

# Combine all significant variables into one data frame
significant_vars_all <- bind_rows(significant_vars_del, significant_vars_lbr, significant_vars_conditions, significant_vars_events)

# Round the OR to 3 decimal places
significant_vars_all <- significant_vars_all %>%
  mutate(OR = round(OR, 3))

# Filter out 'lbr_type_cnsldtNot Applicable'
significant_vars_filtered <- significant_vars_all %>%
  filter(predictor != "lbr_type_cnsldtNot_Applicable")

# Add an interpreter column based on the model
significant_vars_filtered <- significant_vars_filtered %>%
  mutate(
    interpreter = case_when(
      model == "Delivery Method (C/S, Primary)" ~ paste("Odds of having a Primary C/S delivery with", predictor),
      model == "Labor Type (Non-Spontaneous)" ~ paste("Odds of having a Non-Spontaneous labor with", predictor),
      model == "Conditions" ~ paste("Odds of having intrapartal conditions with", predictor),
      model == "Events" ~ paste("Odds of having intrapartal events with", predictor)
    )
  ) %>%
  dplyr::select(-model)  # Drop the model column

# Print the updated dataframe
print(significant_vars_filtered)

kableExtra::kable(significant_vars_filtered)


```

Summary of Significant Odds Ratios from Logistic Regression Models Significant Variables and Their Interpretation:

-   Delivery Method (C/S, Primary)

    -   para: OR = 0.689, p = 0.0041

        -   Interpretation: Higher parity (number of pregnancies) is associated with lower odds of having a primary C-section.

    -   uds_age25-44: OR = 2.543, p = 0.0260

        -   Interpretation: Mothers aged 25-44 have higher odds of having a primary C-section compared to the reference age group.

    -   hghRskyes: OR = 4.571, p \< 0.0001

        -   Interpretation: High-risk pregnancies are significantly associated with higher odds of having a primary C-section.

    -   lbr_type_cnsldtSpontaneous: OR = 0.401, p = 0.0013

        -   Interpretation: Spontaneous labor is associated with lower odds of having a primary C-section.

-   Labor Type (Non-Spontaneous)

    -   age: OR = 1.055, p = 0.0038

        -   Interpretation: Increasing maternal age is associated with higher odds of non-spontaneous labor.

    -   para: OR = 0.834, p = 0.0249

        -   Interpretation: Higher parity is associated with lower odds of non-spontaneous labor.

    -   hghRskyes: OR = 1.387, p = 0.0473

        -   Interpretation: High-risk pregnancies are associated with higher odds of non-spontaneous labor.

    -   del_method_cnsldtC/S, Repeat: OR = 1.810, p = 0.0320

        -   Interpretation: Having a repeat C-section is associated with higher odds of non-spontaneous labor.

    -   del_method_cnsldtVaginal: OR = 0.226, p \< 0.0001

        -   Interpretation: Vaginal delivery is associated with lower odds of non-spontaneous labor.

-   Conditions

    -   zip_category93906: OR = 0.499, p = 0.0255

        -   Interpretation: Zip code 93906 is associated with lower odds of having intrapartal conditions.

    -   zip_category93930: OR = 0.397, p = 0.0273

        -   Interpretation: Zip code 93930 is associated with lower odds of having intrapartal conditions.

    -   age: OR = 0.943, p = 0.0032

        -   Interpretation: Increasing maternal age is associated with lower odds of having intrapartal conditions.

    -   uds_age25-44: OR = 1.960, p = 0.0315

        -   Interpretation: Mothers aged 25-44 have higher odds of having intrapartal conditions compared to the reference age group.

    -   hghRskyes: OR = 3.820, p \< 0.0001

        -   Interpretation: High-risk pregnancies are significantly associated with higher odds of having intrapartal conditions.

    -   lbr_type_cnsldtNot Applicable: OR = 0.346, p = 0.0001

        -   Interpretation: Having a consolidated labor type categorized as "Not Applicable" is associated with lower odds of having intrapartal conditions.

    -   lbr_type_cnsldtSpontaneous: OR = 0.344, p \< 0.0001

        -   Interpretation: Spontaneous labor is associated with lower odds of having intrapartal conditions.

    -   del_method_cnsldtVaginal: OR = 0.681, p = 0.0363

        -   Interpretation: Vaginal delivery is associated with lower odds of having intrapartal conditions.

-   Events

    -   lbr_type_cnsldtNot Applicable: OR = 0.231, p = 0.0011

        -   Interpretation: Having a consolidated labor type categorized as "Not Applicable" is associated with lower odds of having intrapartal events.

    -   lbr_type_cnsldtSpontaneous, Augmented: OR = 2.960, p = 0.0235

        -   Interpretation: Augmented spontaneous labor is associated with higher odds of having intrapartal events.

    -   del_method_cnsldtC/S, Repeat: OR = 0.325, p = 0.0062

        -   Interpretation: Having a repeat C-section is associated with lower odds of having intrapartal events.

    -   del_method_cnsldtVaginal: OR = 0.392, p \< 0.0001

        -   Interpretation: Vaginal delivery is associated with lower odds of having intrapartal events.

-   Zipcodes

    -   High Risk OB and Suppressed Labor Types exert strong influence consistently across all zip codes





Our ongoing analyses of the data now take us to the derivation of significant Coefficients

## Significant Coefficients

In our analysis, we utilized multinomial regression to uncover significant coefficients that explain the relationships between our predictors and the binary outcomes. Unlike the odds ratios, which provide a measure of the strength and direction of the association, the coefficients give a direct indication of the effect size for each predictor.

The plot below displays the significant coefficients (p < 0.05) for our usual outcomes of intrapartal events and intrapartal conditions. Each bar represents the coefficient value for a predictor, indicating the magnitude and direction of its effect on the outcome. Positive coefficients suggest an increase in the likelihood of the outcome with higher predictor values, while negative coefficients indicate a decrease.


This visualization provides a straightforward interpretation of how each significant predictor influences the outcome. The following sections will further detail the specific predictors and their coefficient values for each outcome.



```{r echo=FALSE, message=FALSE, warning=FALSE}
# Load the necessary libraries
library(ggplot2)
library(dplyr)
library(purrr)
library(knitr)
library(kableExtra)

# Read the dataset
ob_data_fctr <- readRDS("All Factored Complete Ready OB Dataset for Analytics.RDS")

# Design binary outcomes for GLM
ob_data_fctr.bin <- ob_data_fctr %>% 
  mutate(bin_del_CS = as.factor(if_else(del_method_cnsldt == "C/S, Primary", 1, 0)),
         bin_lbr_non_spont = as.factor(if_else(lbr_type_cnsldt == "Spontaneous", 0, 1)),
         bin_conditions = as.factor(if_else(conditions_cnsldt == "None", 0, 1)),
         bin_events = as.factor(if_else(intrapartal_events == "None", 0, 1)))

# Create zip_category based on the frequency threshold
zip_freq <- table(ob_data_fctr.bin$zip)
threshold <- 60
ob_data_fctr.bin <- ob_data_fctr.bin %>%
  mutate(zip_category = ifelse(zip_freq[zip] >= threshold, as.character(zip), "Other"),
         zip_category = as.factor(zip_category))

# Suppress warnings and fit the models
suppressWarnings({
  model_del <- glm(bin_del_CS ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + conditions_cnsldt, 
                   data = ob_data_fctr.bin, family = binomial())
  
  model_lbr <- glm(bin_lbr_non_spont ~ zip_category + age + grav + para + gender + uds_age + hghRsk + del_method_cnsldt + conditions_cnsldt, 
                   data = ob_data_fctr.bin, family = binomial())
  
  model_conditions <- glm(bin_conditions ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt, 
                          data = ob_data_fctr.bin, family = binomial())
  
  model_events <- glm(bin_events ~ zip_category + age + grav + para + gender + uds_age + hghRsk + lbr_type_cnsldt + del_method_cnsldt + conditions_cnsldt, 
                      data = ob_data_fctr.bin, family = binomial())
})

# Function to extract significant coefficients and plot
extract_and_plot <- function(model, title) {
  # Extract coefficients and p-values
  coef_summary <- summary(model)$coefficients
  
  # Combine results into a single data frame
  results <- as.data.frame(coef_summary)
  
  # Add predictor names
  results$predictor <- rownames(results)
  rownames(results) <- NULL
  
  # Filter significant coefficients (e.g., p < 0.05)
  significant_results <- results %>%
    filter(`Pr(>|z|)` < 0.05) %>%
    mutate(p.value = `Pr(>|z|)`,
           coefficient = Estimate) %>%
    dplyr::select(predictor, coefficient, p.value)
  
  # Check if there are any significant coefficients to plot
  if (nrow(significant_results) > 0) {
    # Plot the significant coefficients
    plot <- ggplot2::ggplot(significant_results, aes(x = reorder(predictor, coefficient), y = coefficient, fill = predictor)) +
      geom_bar(stat = "identity", position = "dodge") +
      geom_text(aes(label = round(coefficient, 2)), hjust = -0.3, size = 3) +
      coord_flip() +
      labs(title = title,
           x = "Predictor",
           y = "Coefficient Value") +
      theme_minimal() +
      theme(legend.position = "none")
    
    print(plot)
  } else {
    print(paste("No significant coefficients for", title))
  }
  
  return(significant_results)
}

# List of models and titles
models <- list(
  model_del = model_del,
  model_lbr = model_lbr,
  model_conditions = model_conditions,
  model_events = model_events
)

titles <- c(
  "Significant Coefficients for Delivery Method (C/S, Primary)",
  "Significant Coefficients for Labor Type (Non-Spontaneous)",
  "Significant Coefficients for Conditions",
  "Significant Coefficients for Events"
)

# Apply the function to each model and store the results
significant_results_list <- map2(models, titles, ~ extract_and_plot(.x, .y))

# Combine all significant variables into one data frame
significant_vars_all <- bind_rows(significant_results_list, .id = "model")

# Print the updated dataframe
#print(significant_vars_all)

kableExtra::kable(significant_vars_all, caption = "Significant Coefficients Across All Models")



```


Summary of results as above: The multinomial logistic regression model was used to analyze the relationship between various predictors and the occurrence of either Intrapartal Events or Intrapartal conditions. The outcome variable was re-coded to model the presence of intrapartal events as "Yes" versus "None"; same for intrapartal conditions. Our interest is in the coefficients.

General Interpretation: 

- Positive Coefficients: Predictors with positive coefficients increase the likelihood of intrapartal events or conditions
  
- Negative Coefficients: Predictors with negative coefficients decrease the likelihood of intrapartal events or conditions 

- Magnitude: The larger the absolute value of the coefficient, the stronger the association with intrapartal events or conditions 

- Significance: Predictors with significant p-values (e.g., p \< 0.05) are considered to have a statistically significant association with the outcome.



We will next engage one more analysis schema, looking for Variable Importance, because we may use our data to make forecasts.

## Variable Importance

Variable Importance can answer the questions:

- Which predictors are most influential in determining the outcome?

- How do the different features rank in terms of their contribution to the predictive power of the model?

- Can we identify the key drivers of the model's performance, helping to focus on the most critical variables for decision-making?

**Variable Importance**: *indicates how useful each variable is for making accurate predictions.*

The Mean Decrease in Gini Index (MeanDecreaseGini) is one common metric used to measure variable importance. 
Variables with higher MeanDecreaseGini values are more important for the model: the higher the MeanDecreaseGini, the more influential the variable is in predicting the outcome.

Significance of Variable Importance: 

- Key Predictors: Variables with the highest MeanDecreaseGini values are key predictors for the outcome. They contribute the most to the model's accuracy.  

- Interpretability: Knowing which variables are most important helps in understanding the underlying patterns and relationships in the data.

We will select the top 20 variables for each outcome of Intrapartal Events and Intrapartal Conditions.

#### Intrapartal Events and Intrapartal Conditions

```{r echo=FALSE, message=FALSE, warning=FALSE}

# Load necessary libraries
library(caret)
library(randomForest)
library(ggplot2)
library(dplyr)

# Load the dataset
ob_data_all_encoded_vip <- readRDS("OB DATA encoded with encoded Numericals.RDS")
ob_data_all_encoded_vip$time <- NULL

# Inspect the dataset to understand its structure
#str(ob_data_all_encoded_vip)

# Derive the Intrapartal_events_Yes variable from Intrapartal_events_None
ob_data_all_encoded_vip <- ob_data_all_encoded_vip %>%
  mutate(Intrapartal_events_Yes = if_else(intrapartal_events_None == 1, 0, 1))

# Derive the Intrapartal_conditions_Yes variable from conditions_cnsldt_None
ob_data_all_encoded_vip <- ob_data_all_encoded_vip %>%
  mutate(Intrapartal_conditions_Yes = if_else(conditions_cnsldt_None == 1, 0, 1))

# Convert response variables to factors
ob_data_all_encoded_vip$Intrapartal_events_Yes <- as.factor(ob_data_all_encoded_vip$Intrapartal_events_Yes)
ob_data_all_encoded_vip$Intrapartal_conditions_Yes <- as.factor(ob_data_all_encoded_vip$Intrapartal_conditions_Yes)

# Define the outcome variable
outcome_events <- 'Intrapartal_events_Yes'

# Remove related variables
related_vars_events <- grep("intrapartal_events", names(ob_data_all_encoded_vip), value = TRUE)
x_events <- ob_data_all_encoded_vip %>% dplyr::select(-one_of(c(outcome_events, related_vars_events)))
y_events <- ob_data_all_encoded_vip[[outcome_events]]

# Fit the Random Forest model for intrapartal events
set.seed(123) # For reproducibility
rf_model_events <- randomForest(x = x_events, y = y_events, importance = TRUE)

# Print the model summary
#print(rf_model_events)

# Compute variable importance
importance_df_events <- as.data.frame(importance(rf_model_events))
importance_df_events$Variable <- rownames(importance_df_events)
importance_df_events <- importance_df_events %>% arrange(desc(MeanDecreaseGini)) %>% 
  slice(1:20)

# Plot variable importance for intrapartal events
var_imp_event <- ggplot2::ggplot(importance_df_events, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "blue", high = "red") +
  coord_flip() +
  labs(title = "Variable Importance for Intrapartal Events",
       x = "Variables",
       y = "Mean Decrease in Gini Index") +
  theme_minimal() +
  theme(legend.position = "none")

# Save the plot with adjusted dimensions if needed
ggsave("var_imp_event_plot.png", plot = var_imp_event, width = 12, height = 6)
var_imp_event
### Intrapartal Events ###

# Compute variable importance for intrapartal events
importance_df_events <- as.data.frame(importance(rf_model_events))
importance_df_events$Variable <- rownames(importance_df_events)
importance_df_events <- importance_df_events %>% arrange(desc(MeanDecreaseGini))

# Extract significant variables for intrapartal events
significant_vars_events <- importance_df_events %>%
  filter(MeanDecreaseGini > quantile(MeanDecreaseGini, 0.75)) %>%
  arrange(desc(MeanDecreaseGini))

# Print significant variables for intrapartal events
# print("Significant variables for intrapartal events:")
# print(significant_vars_events)

kableExtra::kable(significant_vars_events, caption = "Variable Importance for Intrapartal Events Outcome")



# Define the outcome variable
outcome_conditions <- 'Intrapartal_conditions_Yes'

# Remove related variables
related_vars_conditions <- grep("conditions", names(ob_data_all_encoded_vip), value = TRUE)
x_conditions <- ob_data_all_encoded_vip %>% dplyr::select(-one_of(c(outcome_conditions, related_vars_conditions)))
y_conditions <- ob_data_all_encoded_vip[[outcome_conditions]]

# Fit the Random Forest model for intrapartal conditions
set.seed(123) # For reproducibility
rf_model_conditions <- randomForest(x = x_conditions, y = y_conditions, importance = TRUE)

# Print the model summary
#print(rf_model_conditions)

# Compute variable importance
importance_df_conditions <- as.data.frame(importance(rf_model_conditions))
importance_df_conditions$Variable <- rownames(importance_df_conditions)
importance_df_conditions <- importance_df_conditions %>% arrange(desc(MeanDecreaseGini)) %>% 
  slice(1:20)

# Plot variable importance for intrapartal conditions
var_imp_cond <- ggplot2::ggplot(importance_df_conditions, aes(x = reorder(Variable, MeanDecreaseGini), y = MeanDecreaseGini, fill = MeanDecreaseGini)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "blue", high = "red") +
  coord_flip() +
  labs(title = "Variable Importance for Intrapartal Conditions",
       x = "Variables",
       y = "Mean Decrease in Gini Index") +
  theme_minimal() +
  theme(legend.position = "none")

# Save the plot with adjusted dimensions if needed
ggsave("var_imp_cond_plot.png", plot = var_imp_cond, width = 14, height = 6)
var_imp_cond

### Intrapartal Conditions ###

# Compute variable importance for intrapartal conditions
importance_df_conditions <- as.data.frame(importance(rf_model_conditions))
importance_df_conditions$Variable <- rownames(importance_df_conditions)
importance_df_conditions <- importance_df_conditions %>% arrange(desc(MeanDecreaseGini))

# Extract significant variables for intrapartal conditions
significant_vars_conditions <- importance_df_conditions %>%
  filter(MeanDecreaseGini > quantile(MeanDecreaseGini, 0.75)) %>%
  arrange(desc(MeanDecreaseGini))

# Print significant variables for intrapartal conditions
# print("Significant variables for intrapartal conditions:")
# print(significant_vars_conditions)

kableExtra::kable(significant_vars_conditions, caption = "Variable Importance for Intrapartal Conditions Outcome")

```

© 2025 Clínica de Salud del Valle de Salinas · research.csvs.org

 

FQHC · Joint Commission Accredited · 501(c)3