ANALYSIS
OF HOSPITAL DATA USING HADOOP:
A CASE STUDY
Ganesh Ramesh Gholap
ABSTRACT
The
main aim of this paper is to provide analysis on the field of
healthcare(Hospital) data. The paper has listed some data analytics tools and
techniques that have been used to improve healthcare performance in many areas
such as: medical operations, reports, decision making, and prediction and prevention
system. In today’s era data analysis is big challenge in healthcare. Unstructured
data are growing very faster than semi-structured and structured data. 90 percentages
of the big data are in the form of unstructured data, major steps of big data
management in healthcare industry are data acquisition, storage of data,
managing the data, analysis on data and data visualization. Recent researches
targets on big data visualization tools. In this paper the we analyzed the
effective tools used for visualization of big data. This paper will be helpful
to understand the processes and use of big data in healthcare management.
General
Terms :
Big data, Hive Tools, Data Analytics, Hadoop, Distributed File
System.
Keywords :
Hospital data set, pig Tools.
1. INTRODUCTION
The
medical world is growing along with growth of hospital. From medical records to
network operations, if your hospital isn't taking advantage of a healthy dose
of big data, then your management team is missing out.
When it comes to healthy data management, here
are just a few ways big data can improve your hospital. Nowadays the data is
expanding in huge amount in field of medicine. Every patient comes with their
own set of medical data and hospitals. By using big data hospital management
team can create a better healthcare infrastructure. Big Data collects and store
huge amounts of data related to hospital can use it from patient recovery rates
to hospital finances. Patient health records are an important step in ensuring
your hospital gives the most appropriate care to its visitors. The problem is
records that aren't intuitive and immediately available don't have as great of
an effect on a patient's recovery rate or the quality of their future visits. Big
data analytics in the healthcare field are helping physicians better track
their patients' medical conditions by collecting and analyzing each piece of
the medical data puzzle.
2.
CHALLENGES IN BIG DATA
1. Data
storage and quality:
Companies and
Organizations are growing at a very fast pace. Moreover, the growth of the
companies rapidly increases the amount of data produced. The storage of this
data is becoming a huge challenge for every organization. Options like data
lakes and warehouses are used to collect and store massive quantities of
unstructured data in its native format. The problem, however, is when a data
lakes and warehouse try to combine inconsistent data from disparate sources, it
encounters errors. Inconsistent data, duplicates, logic conflicts, and
missing data all result in data quality challenges.
2. Security and privacy
of the data:
Once, companies and
organizations figure out how to use big data, it gives them a varied range of
opportunities. However, it also involves big risks when it comes to the
security and the privacy of the data. The tools used for analysis, stores,
manages, analyses, and utilizes the data from a different variety of sources.
This ultimately leads to a risk of exposure of the data, making it highly
vulnerable. Therefore, the production of more and more data increases security
and privacy concerns. Thus making it essential for analysts and data scientists
to consider these issues and deal with the data in a manner that will not lead
to the disruption of privacy.
3. Various sources of
data:
Data
is coming in various resources and in various forms like audio, video, images,
files, etc. To dealing with such volume of data is a big challenge nowadays.
4. Searching of Data:
For
searching specific data from such large data set is very critical. For that
purpose, we required special tools and techniques. So that the specific data is
found without some delay.
3. ANALYSIS OF HOSPITAL
DATA
The proposed method is
made by considering following scenario under consideration. Hospital has huge
amount of data related to number of patient data, Appointment date, discharge
date and number of patients treated in each Hospital, list of regular customers
in each hospital and their diseases. The proposed method intension is to develop
model for the hospital data for new analytics based on the following queries.
The data description is
as shown in table 1 and table 2.
Table
1 Hospital Data Set:
Attribute
|
Description
|
Patient
ID
|
Unique
identifier for patient.
|
Patient
Name
|
Name
of a patient
|
City
|
City
of patient
|
Country
|
Country
of patient
|
Patient
phone no
|
Phone
no of patient
|
Age
|
Age
of patient
|
Gender
|
Gender
of patient
|
Birth
date
|
Birth
date of patient
|
Appointment
date
|
Date
of appointment
|
Discharge
date
|
Date
of Discharge
|
Cost
|
Bill
of Patient
|
Table
2 Dataset for Hospital:
Attribute
|
Description
|
Hospital ID
|
Unique identifier for Patient.
|
Hospital Name
|
Name of Hospital
|
City
|
City of hospital
|
Country
|
Country of Hospital
|
Address
|
Address of the hospital
|
4.
METHODOLOGY
In this paper the tools used for the proposed method
is Hadoop which is mainly used for structured data. Assuming all the Hadoop
tools have been installed and having semi structured information on hospital
data.
1. Put the data set in the Hadoop directory.
2. Extract semi structured data into table using the
LOAD command.
3. Analyze data for the following Queries: -
a) list of patients in
the specific hospital id.
b) list of patients
having particular age.
c) list of patients with highest
hospital bill, etc.
Basic HDFS Commands:
1. Print the Hadoop version hadoop version
2. List the contents of the root directory in HDFS
hadoop fs -ls /
3. Report the amount of space used and available on currently mounted filesystem
hadoop fs -df hdfs:/
4. Count the number of directories,files and bytes under the paths that match the specified file pattern
hadoop fs -count hdfs:/
5. Run a DFS filesystem checking utility
hadoop fsck – /
6. Run a cluster balancing utility
hadoop balancer
7. Create a new directory named “hadoop” below the /user/training directory in HDFS. Since you’re currently logged in with the “training” user ID, /user/training is your home directory in HDFS.
hadoop fs -mkdir /user/training/hadoop
8. Add a sample text file from the local directory named “data” to the new directory you created in HDFS during the previous step.
hadoop fs -put data/sample.txt /user/training/hadoop
hadoop fs -ls /user/training/hadoop
10. Add the entire local directory called “retail” to the /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/hadoop
11. Since /user/training is your home directory in HDFS, any command that does not have an absolute path is interpreted as relative to that directory. The next command will therefore list your home directory, and should show the items you’ve just added there.
hadoop fs -ls
12. See how much space this directory occupies in HDFS.
hadoop fs -du -s -h hadoop/retail
13. Delete a file ‘customers’ from the “retail” directory.
hadoop fs -rm hadoop/retail/customers
14. Ensure this file is no longer in HDFS.
hadoop fs -ls hadoop/retail/customers
15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*
16. To empty the trash
hadoop fs -expunge
17. Finally, remove the entire retail directory and all of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
18. List the hadoop directory again
hadoop fs -ls hadoop
19. Add the purchases.txt file from the local directory named “/home/training/” to the hadoop directory you created in HDFS
hadoop fs -copyFromLocal /home/training/purchases.txt hadoop/
20. To view the contents of your text file purchases.txt which is present in your hadoop directory.
hadoop fs -cat hadoop/purchases.txt
21. Add the purchases.txt file from “hadoop” directory which is present in HDFS directory to the directory “data” which is present in your local directory
hadoop fs -copyToLocal hadoop/purchases.txt /home/training/data
22. cp is used to copy files between directories present in HDFS
hadoop fs -cp /user/training/*.txt /user/training/hadoop
23. ‘-get’ command can be used alternaively to ‘-copyToLocal’ command
hadoop fs -get hadoop/sample.txt /home/training/
24. Display last kilobyte of the file “purchases.txt” to stdout.
hadoop fs -tail hadoop/purchases.txt
25. Default file permissions are 666 in HDFS Use ‘-chmod’ command to change permissions of a file
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt
26. Default names of owner and group are training,training Use ‘-chown’ to change owner name and group name simultaneously
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt
27. Default name of group is training Use ‘-chgrp’ command to change group name
hadoop fs -ls hadoop/purchases.txt sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt
28. Move a directory from one location to other
hadoop fs -mv hadoop apache_hadoop
29. Default replication factor to a file is 3. Use ‘-setrep’ command to change replication factor of a file
hadoop fs -setrep -w 2 apache_hadoop/sample.txt
30. Copy a directory from one node in the cluster to another Use ‘-distcp’ command to copy, -overwrite option to overwrite in an existing files -update command to synchronize both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
31. Command to make the name node leave safe mode
hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave
32. List all the hadoop file system shell commands
hadoop fs
33. Last but not least, always ask for help!
hadoop fs -help
Fig 1 Put the file in HDFS
Cammand To Load The File In HDFS(Hadoop Distributed File System).
[training@localhost ~]$ hadoop fs -put /home/training/Patient.txt /user/training
Command To Display Loaded File On The HDFS
[training@localhost ~]$ hadoop fs -cat /user/training/Patient.txt /user/training
Load the file in pig (grunt terminal)
grunt>A=Load 'Patient.txt' USING PigStorage(',') as (patient_id:int, patient_name:chararray, patient_city:chararray, patient_state:chararray, pateint_mobno:chararray, patient_age:int, patient_address:chararray, patient_appoiment_date:chararray, discharge_date:chararray, patient_disease:chararray,
cost:int);
grunt>DUMP A;
List of patients having age greater
than 20
grunt>C=Filter A By age>=20;
grunt>DUMP C;
Query
2-
Find the patients having Expense >=
50000.
grunt> C = Filter A By cost >= 50000
Result of above Query:
4,pravin awhad,gandhi
nagar,punjab,20,gurunagar colony,3/6/2018,3/7/2018,headache,69000
6,akash
kale,pimpri,maharashtra,48,pimprichinchwad,21/1/2019,18/2/2019,dengue,52412
7,aniket
gole,surat,gujrat,56,gandinagar,17/2/2019,21/2/2019,cancer,78500
8,naru choudhari,raju galli,rajasthan,20,jaipur,12/5/2019,21/5/2019,stomach
pain,74963
5. CONCLUSION
Big data analytics has
the potential to transform the way hospital providers use sophisticated
technologies to gain insight from their clinical and other data repositories
and make Appropriate decisions. In the future we’ll see the rapid, widespread
implementation and use of big data analytics across the hospital organization
and the healthcare industry. To that end, the several challenges highlighted
above, must be addressed. As big data analytics becomes more mainstream, issues
such as guaranteeing privacy, safeguarding security, establishing standards and
governance, and continually improving the tools and technologies will garner
attention.
From above work, we
conclude that with the help of big data tools like hadoop we can efficiently
handle or deal with huge amount data of any sectors and can produce useful
information that user want to deal with. Nowadays companies are move forward to
use big data to deal with their massive amount of data using big data
analytics. In future, there will be more need of big data analytics because for
a day, we generate 2.5 quintillion bytes of data. And it is very difficult for organization to handle this massive amount
of data with traditional method. Big data analytics and
applications in healthcare are at a nascent stage of development, but rapid
advances in platforms and tool can accelerate their maturing process.
6.
REFERENCES
[1] Challenges and
opportunities with Big Data http://cra.org/ccc/wpcontent/uploads/sites/2/2015/05
/bigdatawhitepaper.pdf
[2] Data set is taken from edureka
[3] big data in
organization https://www.researchgate.net/publication/264555968_Big_Data_Analytics_A_Literature_Review_Paper
[4] Big Data Analytics in Healthcare https://www.degruyter.com/view/j/jib.ahead-of-print/jib-2017-0030/jib-2017-0030.xml
[5] Analysis on Big Data in health and hospital https://www.amazon.com/Data-Analytics-Healthcare-Research-Strategies/dp/1584264438
[6] Big data Analytics in Medicine and Healthcare https://www.degruyter.com/view/j/jib.ahead-of-print/jib-2017-0030/jib-2017-0030.xml
[7] The data
challenges at scale and The scope of Hadoop https://intellipaat.com/tutorial/big-data-and-hadoop-tutorial/the-data-challenges-at-scale-and-the-scope-of-hadoop/
This comment has been removed by the author.
ReplyDelete