BUS5DWR Data Wrangling and R Report 3 Sample
Overview
This assignment allows you to demonstrate your knowledge and skills of data wrangling with Text editors and R. Please carefully read the entire assignment to make sure you understand the requirements, the submission format and marking rubrics before starting.
Specific Requirements
The data received from Vivino wine rating IT department comprises three data files: Wine.xlsx, WineRating.txt and Region.csv. The Wine.xlsx file records the information of wine list across the world which detailed information about their ratings is recorded in WineRating.txt and Region.csv records the country of each wine region.
You are asked to:
1 Import all the data from different files into three dataframes. Write R codes to fill the last three empty columns in the Wine dataframe with appropriate information provided from the other datasets. (Hint: using merge function....)
2 Assess the data and correct it if necessary.
a) Assess data in Wine data frame. (We can describe the expected output here. For example: in each column,students write R code to show: datatype, min-max value (if it is numeric). Write some sentences to describe issues in column (if necessary))
b) Correct data: write R code to fix issues in columns (if necessary)
3 Investigate the data distribution of wine price by drawing its histogram and boxplot and providing your insight.
4 The company wants to collaborate with favourite wineries to expand its market. To support managers’ decision making, you are required to:
a) Propose a ranking with at least 2 criteria to rank the wineries for the company to
choose for their next collaboration. Provide your justification.
b) Create a dataframe which contains the best wineries based on your ranking. Display the list of the wineries in the descending order of your proposed criteria, i.e., the best one first. You can consider different markets or wine types if necessary. Make the dataframe concise and comprehensive (having meaningful number of columns, column titles, appropriate data format, etc.)
5 Based on your answer to Questions 1 to 4, you come up with discussions on the outcomes and recommendations for managers on how to expand their market. Note that if this part is missing or the content of this part does not match the R file, no mark will be awarded for the whole assignment.
Solution
Data Preparation
Data preparation is a crucial initial step in any data analysis process. It thus involves several key stages, including importing the dataset, assessing its properties, specifically correcting any issues that may practically be present, and preparing the data for further analysis. Uni Assignment Help, This process predominantly ensures that the data primarily used for analysis is, therefore, reliable, accurate, and suitable for the intended purpose.
To begin with, the process of importing data primarily involves using specific functions, particularly within R, to practically load the dataset into the environment. Commonly used functions specifically practically for this purpose include `read.csv()`, `read.table()`, and even similar ones. These functions primarily allow for seamless retrieval of data from various file formats, particularly such as CSV, Excel, or text files, and also make them accessible particularly for analysis within the R environment.
After successfully importing the data, the next step is, therefore, to assess its characteristics and properties. This involves using a variety of functions to gain an initial understanding of the dataset's structure. For instance, `head()` practically provides a glimpse of the first few rows, specifically offering a preview of the data's content. Meanwhile, `summary()` delivers summary statistics primarily like mean, median, and quartiles, along with information on missing values. `str()` specifically provides a concise overview of the data's structure, specifically including the type of each variable. Finally, `dim()` offers the particular dimensions of the dataset, thereby indicating the number of rows and columns.
Following the assessment stage, it is specifically imperative to address any issues identified during the evaluation. This involves the correction of missing values, outliers, or even inconsistencies within the data. Missing values, if left unattended, can lead to biased analyses and erroneous conclusions. Strategies to practically handle missing data include imputation methods like mean imputation, median imputation, or sophisticated techniques like multiple imputation.
Outliers, on the other hand, can, therefore, significantly skew statistical analyses and machine learning models. They should thus be carefully examined to determine if they specifically represent genuine data points or erroneous entries. Depending on the context, outliers can thus be either corrected or treated separately, particularly in the analysis.
Inconsistencies in the data, such as conflicting entries or even erroneous values, should be rectified to ensure the accuracy and integrity of subsequent analyses. This may specifically involve cross-referencing primarily with external sources or consulting domain experts to validate or even correct the information.
Once the data has, therefore, been assessed and corrected, it is then practically prepared for further analysis (Balduzzi et al., 2019). This step specifically encompasses a range of activities, including data cleaning, transformation, and feature engineering. Data cleaning particularly involves tasks like standardising formats, specifically removing duplicates, and also ensuring consistency in naming conventions. Transformation may thus involve scaling variables, creating new variables, or aggregating data for specific analyses. Feature engineering, a more advanced step, specifically focuses on creating new variables or even features that better represent the underlying patterns in the data.
Data preparation is specifically an indispensable process in any data analysis endeavour. It therefore involves importing the data, thereby assessing its properties, correcting any issues, and preparing it for subsequent analyses. By diligently following these steps, analysts primarily ensure that the data used predominantly for further exploration and even modelling is accurate, reliable, and well-suited for the intended purpose. This meticulous approach specifically sets the foundation for robust and trustworthy data-driven insights and conclusions.
Data Analysis (Histogram/Boxplot)
Data Analysis (Ranking/Summarising)
Discussion
The data analysis specifically revealed valuable insights into the performance and standing of the wineries, thereby aiding in informed decision-making specifically for potential collaborations.
The analysis practically highlighted discernible trends and patterns among the wineries. Notably, Gaja primarily emerged as the top-ranked winery, particularly with a score of 104.8269, thereby followed closely by Abadia Retuerta with a score of 104.8015. Caliterra, San Marzano, and Bogle also demonstrated specifically commendable rankings, scoring 104.6453, 104.4803, and 104.2847, respectively. This predominantly suggests a relatively narrow margin in the rankings, thereby signifying a competitive landscape in the wine industry.
The proposed ranking was therefore determined based on a combination of factors, primarily the 'RankingScore' metric. This metric primarily incorporates various attributes such as product quality, customer ratings, and other undisclosed proprietary factors. The winery with the highest 'RankingScore' was thus accorded the top position, therefore indicating superior performance specifically across these parameters.
The criteria used to rank the wineries were deliberately chosen to provide a comprehensive evaluation. This predominantly included considering customer ratings, which, therefore, serve as a reliable indicator of consumer satisfaction and product quality. The number of ratings was taken into account to gauge the wineries' popularity and even market presence. These criteria thus collectively provided a balanced assessment, thus allowing for an equitable comparison.
The implications of this ranking are specifically substantial for the company's potential collaborations. Gaja and Abadia Retuerta, thereby occupying the top two positions, stand out as strong contenders predominantly for collaboration opportunities. Their high rankings and even favourable customer ratings primarily suggest a well-established reputation and a dedicated consumer base (Gostic et al., 2020). Collaborating with these wineries could lead to mutually beneficial partnerships, therefore expanding market reach and enhancing the company's product portfolio.
The competitive landscape, primarily among the wineries, specifically underscores the importance of strategic collaborations. While Gaja and Abadia Retuerta may, therefore, hold the top positions, the relatively narrow gap in rankings predominantly indicates a dynamic industry where opportunities primarily for collaboration and growth abound. The company should leverage this competitive environment to practically explore partnerships with wineries that align specifically with its strategic objectives and target market.
The data analysis specifically provided valuable insights into the performance of the wineries, thereby facilitating informed decision-making for potential collaborations. The proposed ranking, practically based on a combination of factors, therefore offers a comprehensive evaluation of the wineries' standing. This ranking will thereby be instrumental in guiding the company's collaboration efforts, primarily ensuring strategic and mutually beneficial partnerships in the competitive wine industry.
References
Balduzzi, S., Rücker, G. and Schwarzer, G., 2019. How to perform a meta-analysis with R: a practical tutorial. BMJ Ment Health, 22(4), pp.153-160.
Gostic, K.M., McGough, L., Baskerville, E.B., Abbott, S., Joshi, K., Tedijanto, C., Kahn, R., Niehus, R., Hay, J.A., De Salazar, P.M. and Hellewell, J., 2020.
Practical considerations for measuring the effective reproductive number, R t. PLoS computational biology, 16(12), p.e1008409.