Analysis of Hospital Data using Hadoop: A Case Study

ANALYSIS OF HOSPITAL DATA USING HADOOP:

A CASE STUDY

Ganesh Ramesh Gholap

ABSTRACT

The main aim of this paper is to provide analysis on the field of healthcare(Hospital) data. The paper has listed some data analytics tools and techniques that have been used to improve healthcare performance in many areas such as: medical operations, reports, decision making, and prediction and prevention system. In today’s era data analysis is big challenge in healthcare. Unstructured data are growing very faster than semi-structured and structured data. 90 percentages of the big data are in the form of unstructured data, major steps of big data management in healthcare industry are data acquisition, storage of data, managing the data, analysis on data and data visualization. Recent researches targets on big data visualization tools. In this paper the we analyzed the effective tools used for visualization of big data. This paper will be helpful to understand the processes and use of big data in healthcare management.

General Terms :

Big data, Hive Tools, Data Analytics, Hadoop, Distributed File System.

Keywords :

Hospital data set, pig Tools.

1. INTRODUCTION

The medical world is growing along with growth of hospital. From medical records to network operations, if your hospital isn't taking advantage of a healthy dose of big data, then your management team is missing out.

When it comes to healthy data management, here are just a few ways big data can improve your hospital. Nowadays the data is expanding in huge amount in field of medicine. Every patient comes with their own set of medical data and hospitals. By using big data hospital management team can create a better healthcare infrastructure. Big Data collects and store huge amounts of data related to hospital can use it from patient recovery rates to hospital finances. Patient health records are an important step in ensuring your hospital gives the most appropriate care to its visitors. The problem is records that aren't intuitive and immediately available don't have as great of an effect on a patient's recovery rate or the quality of their future visits. Big data analytics in the healthcare field are helping physicians better track their patients' medical conditions by collecting and analyzing each piece of the medical data puzzle.

2. CHALLENGES IN BIG DATA

1. Data storage and quality:

Companies and Organizations are growing at a very fast pace. Moreover, the growth of the companies rapidly increases the amount of data produced. The storage of this data is becoming a huge challenge for every organization. Options like data lakes and warehouses are used to collect and store massive quantities of unstructured data in its native format. The problem, however, is when a data lakes and warehouse try to combine inconsistent data from disparate sources, it encounters errors. Inconsistent data, duplicates, logic conflicts, and missing data all result in data quality challenges.

2. Security and privacy of the data:

Once, companies and organizations figure out how to use big data, it gives them a varied range of opportunities. However, it also involves big risks when it comes to the security and the privacy of the data. The tools used for analysis, stores, manages, analyses, and utilizes the data from a different variety of sources. This ultimately leads to a risk of exposure of the data, making it highly vulnerable. Therefore, the production of more and more data increases security and privacy concerns. Thus making it essential for analysts and data scientists to consider these issues and deal with the data in a manner that will not lead to the disruption of privacy.

3. Various sources of data:

Data is coming in various resources and in various forms like audio, video, images, files, etc. To dealing with such volume of data is a big challenge nowadays.

4. Searching of Data:

For searching specific data from such large data set is very critical. For that purpose, we required special tools and techniques. So that the specific data is found without some delay.

3. ANALYSIS OF HOSPITAL DATA

The proposed method is made by considering following scenario under consideration. Hospital has huge amount of data related to number of patient data, Appointment date, discharge date and number of patients treated in each Hospital, list of regular customers in each hospital and their diseases. The proposed method intension is to develop model for the hospital data for new analytics based on the following queries.

The data description is as shown in table 1 and table 2.

Table 1 Hospital Data Set:

Attribute	Description
Patient ID	Unique identifier for patient.
Patient Name	Name of a patient
City	City of patient
Country	Country of patient
Patient phone no	Phone no of patient
Age	Age of patient
Gender	Gender of patient
Birth date	Birth date of patient
Appointment date	Date of appointment
Discharge date	Date of Discharge
Cost	Bill of Patient

Table 2 Dataset for Hospital:

Attribute	Description
Hospital ID	Unique identifier for Patient.
Hospital Name	Name of Hospital
City	City of hospital
Country	Country of Hospital
Address	Address of the hospital

4. METHODOLOGY

In this paper the tools used for the proposed method is Hadoop which is mainly used for structured data. Assuming all the Hadoop tools have been installed and having semi structured information on hospital data.

1. Put the data set in the Hadoop directory.

2. Extract semi structured data into table using the LOAD command.

3. Analyze data for the following Queries: -

a) list of patients in the specific hospital id.

b) list of patients having particular age.

c) list of patients with highest hospital bill, etc.

Basic HDFS Commands:

1. Print the Hadoop version
hadoop version

2. List the contents of the root directory in HDFS
hadoop fs -ls /

3. Report the amount of space used and available on currently mounted filesystem
hadoop fs -df hdfs:/

4. Count the number of directories,files and bytes under the paths that match the specified file pattern
hadoop fs -count hdfs:/

5. Run a DFS filesystem checking utility
hadoop fsck – /

6. Run a cluster balancing utility
hadoop balancer

7. Create a new directory named “hadoop” below the /user/training directory in HDFS. Since you’re currently logged in with the “training” user ID, /user/training is your home directory in HDFS.
hadoop fs -mkdir /user/training/hadoop

8. Add a sample text file from the local directory named “data” to the new directory you created in HDFS during the previous step.
hadoop fs -put data/sample.txt /user/training/hadoop

9. List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop

10. Add the entire local directory called “retail” to the /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/hadoop

11. Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory. The next command will therefore list your home directory, and should show the items you’ve just added there.
hadoop fs -ls

12. See how much space this directory occupies in HDFS.
hadoop fs -du -s -h hadoop/retail

13. Delete a file ‘customers’ from the “retail” directory.
hadoop fs -rm hadoop/retail/customers

14. Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers

15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*

16. To empty the trash
hadoop fs -expunge

17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rm -r hadoop/retail

18. List the hadoop directory again
hadoop fs -ls hadoop

19. Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/

20. To view the contents of your text file purchases.txt which is present in your hadoop directory.
hadoop fs -cat hadoop/purchases.txt

21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory to the directory “data” which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data

22. cp is used to copy files between directories present in HDFS
hadoop fs -cp /user/training/*.txt /user/training/hadoop

23. ‘-get’ command can be used alternaively to ‘-copyToLocal’ command
hadoop fs -get hadoop/sample.txt /home/training/

24. Display last kilobyte of the file “purchases.txt” to stdout.
hadoop fs -tail hadoop/purchases.txt

25. Default file permissions are 666 in HDFS Use ‘-chmod’ command to change permissions of a file
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt

26. Default names of owner and group are training,training Use ‘-chown’ to change owner name and group name simultaneously
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt

27. Default name of group is training Use ‘-chgrp’ command to change group name
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt

28. Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop

29. Default replication factor to a file is 3. Use ‘-setrep’ command to change replication factor of a file
hadoop fs -setrep -w 2 apache_hadoop/sample.txt

30. Copy a directory from one node in the cluster to another Use ‘-distcp’ command to copy, -overwrite option to overwrite in an existing files -update command to synchronize both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop

31. Command to make the name node leave safe mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave

32. List all the hadoop file system shell commands
hadoop fs

33. Last but not least, always ask for help!
hadoop fs -help

Fig 1 Put the file in HDFS
Cammand To Load The File In HDFS(Hadoop Distributed File System).
[training@localhost ~]$ hadoop fs -put /home/training/Patient.txt /user/training

Command To Display Loaded File On The HDFS
[training@localhost ~]$ hadoop fs -cat /user/training/Patient.txt /user/training

Load the file in pig (grunt terminal)

grunt>A=Load 'Patient.txt' USING PigStorage(',') as (patient_id:int, patient_name:chararray, patient_city:chararray, patient_state:chararray, pateint_mobno:chararray, patient_age:int, patient_address:chararray, patient_appoiment_date:chararray, discharge_date:chararray, patient_disease:chararray,

cost:int);

grunt>DUMP A;

List of patients having age greater than 20

grunt>C=Filter A By age>=20;

grunt>DUMP C;

Query 2- Find the patients having Expense >= 50000.

grunt> C = Filter A By cost >= 50000

Result of above Query:

4,pravin awhad,gandhi nagar,punjab,20,gurunagar colony,3/6/2018,3/7/2018,headache,69000

6,akash kale,pimpri,maharashtra,48,pimprichinchwad,21/1/2019,18/2/2019,dengue,52412

7,aniket gole,surat,gujrat,56,gandinagar,17/2/2019,21/2/2019,cancer,78500

8,naru choudhari,raju galli,rajasthan,20,jaipur,12/5/2019,21/5/2019,stomach pain,74963

5. CONCLUSION

Big data analytics has the potential to transform the way hospital providers use sophisticated technologies to gain insight from their clinical and other data repositories and make Appropriate decisions. In the future we’ll see the rapid, widespread implementation and use of big data analytics across the hospital organization and the healthcare industry. To that end, the several challenges highlighted above, must be addressed. As big data analytics becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards and governance, and continually improving the tools and technologies will garner attention.

From above work, we conclude that with the help of big data tools like hadoop we can efficiently handle or deal with huge amount data of any sectors and can produce useful information that user want to deal with. Nowadays companies are move forward to use big data to deal with their massive amount of data using big data analytics. In future, there will be more need of big data analytics because for a day, we generate 2.5 quintillion bytes of data. And it is very difficult for organization to handle this massive amount of data with traditional method. Big data analytics and applications in healthcare are at a nascent stage of development, but rapid advances in platforms and tool can accelerate their maturing process.

6. REFERENCES

[1] Challenges and opportunities with Big Data http://cra.org/ccc/wpcontent/uploads/sites/2/2015/05 /bigdatawhitepaper.pdf

[2] Data set is taken from edureka

http://www.edureka.co/my-course/big-data-and-hadoop

[3] big data in organization https://www.researchgate.net/publication/264555968_Big_Data_Analytics_A_Literature_Review_Paper

[4] Big Data Analytics in Healthcare https://www.degruyter.com/view/j/jib.ahead-of-print/jib-2017-0030/jib-2017-0030.xml

[5] Analysis on Big Data in health and hospital https://www.amazon.com/Data-Analytics-Healthcare-Research-Strategies/dp/1584264438

[6] Big data Analytics in Medicine and Healthcare https://www.degruyter.com/view/j/jib.ahead-of-print/jib-2017-0030/jib-2017-0030.xml

[7] The data challenges at scale and The scope of Hadoop https://intellipaat.com/tutorial/big-data-and-hadoop-tutorial/the-data-challenges-at-scale-and-the-scope-of-hadoop/