4/3/2023 0 Comments Goldencheetah weather data![]() ![]() Then I proceeded with cleaning up the data. ![]() However, relative humidity is calculated from temperature and dew point, if the dew point is extremely low and the temperature is high, it gives the relative humidity values below 8 percent, so I did not exclude these values from the analysis. For the relative_humidity data, the humidity below 8 percent was marked as an outlier. Then, the following function would add outliers to the list. This function was used on the heat_index column and a relative humidity column. Then, the function find_outliers has been created. In a separate script, after the connection to the database was established, the pandas table was created from SQL query. The data was also checked on the presence of outliers in a more rigorous statistical way. Then I connected to PostgreSQL to get data and created a pandas table from SQL query. So I found out 99.85 and 0.15 percentiles, which were 5 and -5 and I used them for excluding data, which might have appeared by measuring mistakesĪfter that, I installed all necessary packages to connect to PostgreSQL and for data analysis, which are SQLAlchemy (1.3.12), pandas (0.25.3), psycopg2 (2.8.4) and psycopg2-binary (2.8.4). However, the rest (0.3 % of the data) should be excluded. Given the concern that there are some missing hours in data, I decided that everything that lies within three standard deviations (99.7% of the data) of the mean is normal temperature change per hour. As was expected, the most frequent change was zero. So I calculated the differences between rows for the whole dataset. But first I wanted to check what is the most frequent change per hour. I wrote a line in Python, which would compare every row with the following row to check if the change in temperature between two observations is less than 5 degrees. These temperatures could be caused by the mistake of the thermometer. However in my dataset, there were 135 values which exceeded this. I looked at the records of temperature in Wikipedia and for the majority of countries the records were around 50 ☌ degrees. ![]() I’ve started with a similar basic approach for detection of outliers in temperature values. With SQL code I checked that the data do not have negative values or the values over 100. As an example, relative humidity should only vary from 0 to 100, so seeing a value above 100 means a problem in the dataset. I checked if the data columns which I will use for this project do not have extreme outliers. Next step was to identify outliers in the data. Then, I checked if the data is consistent, for instance, if datetime observations were taken at the same time, ideally rounded every hour which was not the case. I checked if there are missing values in the dataset and identified a few gaps in terms of hours/days missing or humidity/temperature data being unavailable for some days. I manually went over the yearly data for the first city in the PostgreSQL viewer. ![]() Our first step is to evaluate the quality of the data. Technically, since I was processing the data one city at a time, I was using the chunksize approach as I wasn’t loading the entire database into memory. In my case, the total database size was around 900 megabytes with 21 million observations, so I’ve opted for Pandas. Above you can see a simple decision tree that will help us figure out an optimal approach. Pandas, Dask or PySpark? What Should You Choose for Your Dataset? (Alina Zhang, )īefore we begin working with the data, we need to decide which library to use. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |