Predictions of CD4 lymphocytes’ count in HIV patients from complete blood count

Background HIV diagnosis, prognostic and treatment requires T CD4 lymphocytes’ number from flow cytometry, an expensive technique often not available to people in developing countries. The aim of this work is to apply a previous developed methodology that predicts T CD4 lymphocytes’ value based on total white blood cell (WBC) count and lymphocytes count applying sets theory, from information taken from the Complete Blood Count (CBC). Methods Sets theory was used to classify into groups named A, B, C and D the number of leucocytes/mm3, lymphocytes/mm3, and CD4/μL3 subpopulation per flow cytometry of 800 HIV diagnosed patients. Union between sets A and C, and B and D were assessed, and intersection between both unions was described in order to establish the belonging percentage to these sets. Results were classified into eight ranges taken by 1000 leucocytes/mm3, calculating the belonging percentage of each range with respect to the whole sample. Results Intersection (A ∪ C) ∩ (B ∪ D) showed an effectiveness in the prediction of 81.44% for the range between 4000 and 4999 leukocytes, 91.89% for the range between 3000 and 3999, and 100% for the range below 3000. Conclusions Usefulness and clinical applicability of a methodology based on sets theory were confirmed to predict the T CD4 lymphocytes’ value, beginning with WBC and lymphocytes’ count from CBC. This methodology is new, objective, and has lower costs than the flow cytometry which is currently considered as Gold Standard.


Background
HIV infection has affected around 60 million people to date [1]. In 2009, there were 33.3 million people living with HIV worldwide; 2.6 million new cases were presented and 1.8 million deaths were secondary to AIDS in the same year [2]. By 2009, Sub-Saharan Africa was the leading region in the world for deaths caused by AIDS, recording 1.3 million cases [2]. Even though AIDS is a global problem, countries with fewer resources are mostly affected [2,3].
HIV is a retrovirus that mainly affects T cells and those cells that express CD4, such as macrophages, follicular dendritic cells and lymph nodes [4]. In the natural history of HIV infection, there is an initial decrease in the number of TCD4 lymphocytes that relates to the clinical primary infection (2 weeks after infection); then a partial recovery occurs, due to atypical lymphocytes and to an increase in T CD8 lymphocytes (3-4 weeks after exposure). Finally the number of lymphocytes decreases again; slowly during the latent period and faster during the final stage which is characterized by a notorious immunodeficiency with CD4 counts below 500 CD4/μl 3 [4]. For this reason, both the percentage of T CD4 lymphocytes and the occurrence of opportunistic infections define the stages of HIV infection and provide treatment guidelines. Currently this percentage is one of the referenced biological and immunological markers for HIV infection and AIDS control and it is also a predictor of mortality [5]. Its determination is the result of three laboratory steps: count of WBC, percentage of WBC that are lymphocytes or differential count, and percentage of CD4 lymphocytes. This last stage is performed by a technique known as "immunophenotyping by flow cytometry", which consists in the detection of CD4 antigenic determinants on the surface of WBC using monoclonal antibodies labeled with fluorescein [6,7]. However, this procedure has several limitations, such as a delay of more than 24 hours between blood collection and its processing, and the costs of equipment and reagents for flow cytometry, which make it inaccessible to some developing countries, especially Africa [5][6][7][8][9].
Given the large impact that HIV/AIDS represent for global public health, it has been sought to make flow cytometry more accessible by implementing simplified flow cytometers that are chargeable by battery or solar panels [8]. On the other hand it has been sought to replace it by methods of CD4 lymphocytes count prediction from CBC parameters [10,11], epidemiological variables [12,13] or machine learning [14]. There is a cross-sectional study of CD4 prediction from CBC parameters, which used the combined values of total T lymphocytes and hemoglobin to deduce CD4 counts <200 cells/μL 3 ; however, when this prediction was compared to the deduction based on total lymphocytes, it was found that in male patients sensitivity increased with no changes in specificity, and in female patients sensitivity did not change and specificity decreased [10]. A cross-sectional study that assessed the usefulness of total lymphocyte count as surrogate marker of T CD4 lymphocyte's count in HIV-positive patients found that there is a high correlation [11]. However, low sensitivity of total lymphocyte count was found in the classification of patients with CD4 counts <200 cells/μL [11]. Another epidemiological study, sought to predict the variability of T CD4 lymphocytes' decrease in seropositive patients by determining the distribution of CD4 counts in seronegative patients and survival rates after acquiring HIV infection [12]. This model was applied to different populations and individuals showing accuracy predictions over 75% with respect to the real value of T CD4 lymphocytes variability [13]. In the model proposed by Singh and Mars, based on machine-learning, the CD4 final count is obtained from the viral load values and the number of weeks after the first T CD4 lymphocytes' count, with an accuracy of 83% with respect to the real value [14].
In a previous study, Rodríguez et al. [15] developed a new methodology applying sets theory to predict T CD4 lymphocytes' count based on individual values of total WBC and lymphocytes obtained from CBCs. In that work 110 CBCs were analyzed and then classified into four sets named A, B, C and D, where union between sets A and C and union between sets B and D were evaluated, as well as the intersection of both unions. These results were classified into eight ranges of 1,000 leukocytes/mm 3 each, for its evaluation. The conclusion was that ranges below 5000-4000 leukocytes/ml 3 predict CD4 counts lower than 570 CD4/μL 3 with effectiveness percentages between 90-100% [15]. This showed that the study of the variation process of T CD4 lymphocytes' count reveals an underlying mathematical order when observed through theoretical abstractions; this order allows making simple predictions that are independent of virus characteristics or patient variables.
The aim of this work is to validate the clinical application of the methodology developed based on sets theory, applying it to a larger sample of HIV-positive cases.

Definitions
Determined Sets for the study of leukocytes/mm 3 , lymphocytes/mm 3 and CD4/μL 3 populations [15]: Where (x, y, z) is a triplet of values, being "x" the number of WBC, "y" the number of lymphocytes and "z" the T CD4 lymphocytes' count.
It is a study in which a physical-mathematical previously developed methodology based on sets theory is applied in order to predict T-CD4 lymphocytes' count. It is based on the mathematical analysis of the total WBC and lymphocytes' count in HIV-positive patients.

Sample
Printed CBCs of 800 HIV diagnosed patients were used, without distinction of gender, age, population kind, or clinical variables such as infection stage, hemoglobin value or medications used. The CBCs were taken from storage tests in a physical data-base of the infectologist who participated in the study.

Procedure
First, records of leukocytes/mm 3 , lymphocytes/mm 3 and CD4/μL 3 subpopulation counts measured by flow cytometry were taken. Then, they were organized in descending order according to the WBC number, establishing ranges of 1000 leukocytes/mm 3 . Values higher than 10.000/mm 3 were assigned to a single range as well as values lower than 3.000/mm 3 , so a total of 9 ranges were established in order to observe mathematical relationships between populations, independent of time or patient's evolution.
According to the previously developed methodology, records were evaluated by establishing if they belonged or not to sets A∪C and B∪D, as well as to set (A∪C) ∩ (B∪D), using a software that was previously developed based on sets algebra [15]. This software calculates the range of values in which the T CD4 lymphocytes' count is, beginning with WBC and lymphocytes number from CBC and applying the evaluated predictive methodology.
Results for the 9 leukocytes ranges were assessed, determining the elements number that belong to each set in each range and the percentage of success to which it corresponds to, according to the total number found for each range. In addition, the same values were established for the whole sample. In this work, the belonging percentage of each range to each one of sets is equivalent to the effectiveness percentage of prediction for such range. When a triplet of values belongs to all sets, this fact means that this triplet met with the condition of have a leukocyte value equal to or higher than 6800/mm 3 , with a lymphocyte value equal to or higher than 1800/mm 3 and with a CD4 cell value equal to or higher than 300/mm 3 or it may have a leukocyte value lesser than 6800/mm 3 , with a lymphocyte value equal to or lesser than 2600/mm 3 and with a CD4 cell value equal to or lesser than 570/mm 3 .

Statistical analysis
Some performance measures were established for each range through a binary classification performance measurement, where True positive (TP) is the number of cases with a correct prediction in the range with respect to real values, False negative (FN) is the number of wrong predictions in the range with respect to real values, and finally True negative (TN) is the total number of correct predictions in the other ranges. The performance measures calculated for each range were Sensitivity (SENS), and Negative Predictive Value (NPV); the first one which was calculated with the next equation: Otherwise, Negative Predictive Value (NPV) and was calculated by means of the next equation:

Ethic aspects
This study follows the laws established on articles 11 and 13 of the 008430 Colombia's Health Ministry resolution of 1993 given that physical calculations were made based on results of medically prescribed tests of the clinical practice, from an anonymous database retrospectively evaluated, with no risks to patients, protecting the integrity and anonymity of participants and with no need of informed consents. The approval of an ethics committee of a specific institution is not needed because it was accessed only numerical values of the database (without access to the names, data source or clinic history of patients), collected specifically for research purposes by one of the authors.

Results
Belonging of leukocytes, lymphocytes and CD4 cells values to each set in 27 specific samples is shown at the  Table 1). Table 2 shows that effectiveness percentage of the prediction for set A ∪ C according to each range, was between 68.42% and 100%, for set B ∪ D was between 65.66% and 100%, and for intersection set (A ∪ C) ∩ (B ∪ D) was between 55.64% and 100%. Effectiveness percentage of the prediction for the total number of cases to set A ∪ C was 81%, and to set B∪D was 80%, whereas for total number of cases to intersection set (A ∪ C) ∩ (B ∪ D) was 73.25% (See Table 2), being equal or above 73.91% in 6 out of the 9 established ranges, and over 81.44% in 5 ranges. This effectiveness percentage to the intersection (A ∪ C) ∩ (B ∪ D) was higher for the upper and lower ranges; which was between 83.05% and 83.33% for the ranges of 8000-8999 and 9000-9999, respectively; and was between 81.44% and 91.89% for the ranges of 4000-4999 and 3000-3999, respectively. For the range of leukocytes below 3000, that has more utility in clinical setting, the effectiveness percentage was of 100% (See Table 2).

Statistical analysis results
TP values ranged between 17 and 136, TN values were between 450 and 569, and FN between 0 and 59. Values for SENS ranged between 0.56 and 1, and values for NPV were between 0.89 and 1. The highest SENS values were for the ranges of 10000 leukocytes or more, between 9999-9000, 3999-3000, and for the range of 2999 leukocytes or less; the first three had values of 0.99 and the last one of 1.The NPVs showed values equal to or greater than 0.98 in 5 out of the 9 assessed ranges (See Table 3).

Discussion
This is the first work in which a new predictive methodology of T CD4 lymphocytes' count is applied to a sample of 800 HIV-positive patients. This methodology was developed beginning with the analysis of WBC and lymphocytes' count from CBC, and it is based on sets theory. Its predictive percentages are equal or above 73.91% for 6 out of 9 measured ranges, confirming its predictive capacity and clinical applicability independently of epidemiological and clinical variables. Sensitivity values over 0.80 were founding 5 of the 9 measured ranges; specificity was not calculated, given that there are no False Positives. Taking into account that the starting of anti-retroviral treatment is suggested at 300 CD4 cells/cm 3 , this predictive methodology showed an effectiveness percentage of prediction of 100% when leukocyte values were less than 3000.This means that a value of CD4 less than 570/ mm 3 is predicted for all these cases.
The belonging percentage to set A ∪ C is greater than the percentage to set B ∪ D, showing the specificity of T CD4 lymphocytes' values and evidencing the difficulty to find results that allow their prediction. In contradistinction to the previous work [15], one more range of leukocytes, from 3999 to 3000, was quantified in this work in order to study more specifically the ranges of values that have greater clinical importance. High values in the predictions were found, with percentages over 73% and even of 100% for high and low ranges, which are clinically the most important.
The mathematical theory through which predictions are obtained does not allow the establishment of False Positives in the statistical analysis, given that each set, as well as the intersections that constitute the prediction, exclude the possibility of finding triplets that allow obtaining a False Positive prediction. This is the reason why it is not possible to establish a positive predictive value, showing that this mathematical inductive way of thinking can't be taken directly from traditional statistical parameters; instead of that, the sets algebra way of thinking achieves deductive predictions of clinical importance.
Works performed with the aim to simplify and reduce costs of HIV patients follow-up are mostly epidemiological, with limitations in the prediction of immunological biomarkers [10][11][12][13], given that its study from epidemiological variables or virus characteristics does not allow a complete, rigorous, objective, and also simple and reproducible analysis of the immunological response of such patients. Such is the case of cross-sectional descriptive studies that try to deduce T CD4 lymphocytes from CBC parameters such as total lymphocytes' count or hemoglobin. However, although a correlation between these parameters has been found, the sensitivity of deductions varies according to gender [10] or to CD4 count itself [11]. Also, studies have sought to predict the variability of T CD4 lymphocytes'   [12,13]. Other studies based on machine learning, as the one proposed by Singh and Mars [14] to obtain the latest CD4 count have an accuracy not greater than 90% and require a previous count of T CD4 lymphocytes and values of viral load, which represent a moderate additional cost. Based on neural networks and machine learning, some methodologies propose viral load measurement as marker of treatment response in HIV-infected patients. The limitation of these experimental methodologies is that they don't take into account the immune response of the patient, but genotypic virus characteristics [16][17][18][19], which result in expensive flow cytometry tests [8,9]. In contrast to these studies, the present study applies a methodology that uses more accessible data such as the CBC and analyses it objectively from a sets theory approach in order to deduce the value of T CD4 lymphocytes with a high success percentage. This study provides useful scientific contributions for the development of control measures and management of HIV/AIDS pandemic; contributions of clinical applicability that may optimize care and follow-up of patients that suffer from this disease.
In the context of dynamic systems, a descriptive but not predictive model of the immune response to HIV dynamics was developed by plotting how T CD4, CD8, B lymphocytes and antibodies act, and how the viral load progresses [20]. The present paper makes an analysis of the variation process of WBC and lymphocytes populations in HIV patients, but also allows the prediction of CD4 subtype. Furthermore, since it is based on a mathematical approach, it does not require statistical analysis, as it is not required in the study of physical phenomena such as predicting the trajectory of planets or an eclipse.

Conclusions
This study confirms the predictive capacity of the developed methodology based on sets theory to determine the number of T CD4 cells based on WBC and lymphocytes' count, achieving a 91.89% effectiveness for the range between 3000 and 3999 leukocytes, and 100% for the range below 3000 leukocytes. This methodology can be useful to determine the number of CD4 in places where there is no easy access to flow cytometry, reducing costs in determining the state of patients with HIV/AIDS.