 If you've ever used an app to find the best way through traffic or received product recommendations while shopping online, you're already familiar with data analytics from the consumer end. According to some estimates, each person creates at least 1.7 megabytes of data per second on average. That roughly translates into over 2.5 quintillion megabytes of data being produced every single day worldwide. As such, there is a huge demand, now and for the foreseeable future, for people who can organize data and interpret the stories locked within. As you approach this career pathway, you're likely to bring practical experience in problem solving, decision making, allocating resources, time management, and many other skills that are particularly suited for the job of a data professional. Many companies are searching for candidates to fill positions in this fast-growing, high-paying field. My name is Cassie and I've been a data scientist since before we called it data science. I lead decision intelligence here at Google and I'll be your instructor for the first course of this certificate program. Before I became Google Cloud's chief decision scientist, I worked as a data scientist in Google Research where I was involved with over 400 projects all across Google. One of my favorite things about the data science career is the tremendous variety, especially if, like me, you're a naturally curious person. There are so many different flavors of project and challenge. Some of us choose to work on one project for years, others get involved with several new projects every week. The possibilities are endless. A data professional is a term used to describe any individual who works with data and or has data skills. At a minimum, a data professional is capable of exploring, cleaning, selecting, analyzing, and visualizing data. They may also be comfortable with writing code and have some familiarity with the techniques used by statisticians and machine learning engineers, including building models, developing algorithmic thinking, and or building machine learning models. Machine learning is an alternative approach to automation, expressing the way you want a task done by using data instead of explicit instructions. Machine learning is an important component of the modern data professionals toolkit. To train a machine learning model, specialists put a bunch of potential data inputs through algorithms, tweak the settings, and keep iterating until promising outputs are produced. But training a model is only one small step in the professional machine learning journey. Machine learning techniques can also be used for data analytics and exploration with far fewer steps, and that's what you'll learn in this program. You'll discover additional opportunities to explore machine learning through the course resources, so do check those out. Data professional work spans a wide range of industries and impact affects a multitude of products and services. As we'll discuss later, there are also lots of different roles and titles that focus on data professional work. Think of them as data detectives, analyzing and interpreting their findings to reveal the stories within. I'm excited for you to get to meet some of them in this program. Google career certificates are designed by industry professionals with decades of experience here at Google. You'll have a different expert from Google to guide you through each course in this program. We'll share our knowledge in videos, help you practice with hands-on activities, and guide you through scenarios that you might encounter on the job. This certificate is designed to prepare you for a job in three to six months if you work on the certificate part-time, so this program is really flexible. You can complete all of the courses on your own terms and at your own pace. Throughout the program, we'll give you resources that will prepare you to advance in your career as a data professional. As you progress, you'll also build a repository of portfolio projects and a comprehensive capstone project that will showcase your abilities beyond your resume. You'll also have a supportive network of peer learners taking the certificate with you, and you can connect with them in the discussion forums. This program is designed to give you experience by building upon the knowledge and skills that you've developed to this point. Regardless of your experience with data and analytics as you begin the program, you'll learn about different experiences that are relevant and helpful for starting or advancing your career. In addition to building your skill set, we'll examine how teams of data professionals collaborate and contribute in the workplace. By the end of this program, you'll be ready to pursue a position in the data career space. By completing a Google career certificate like this one, you'll develop the skills and knowledge necessary for a job in this expanding career field. Once you graduate, you can connect with hundreds of employers who are interested in hiring Google career certificate graduates. Whether you're seeking to switch careers, level up your skills, or start your own business, the Google career certificates can help you take that next step. Throughout this first course, I'll be here to help you gain foundational knowledge needed to succeed in the field. Again, I'm so glad you're here. I'm excited for you to take these next steps forward in your career. I'll see you in the next video. Now that you have a general understanding about this program overall, let's talk a bit more specifically about what you can expect in this course. We'll start with the basics and a little background on what the data field offers. While you may already be familiar with work within the data field, I'm excited to dig deep into some history that shows us where we've been, where we are, and where we're going in this field. These key developments and applications will really showcase all the opportunities this program will help prepare you for. In this first course, we'll discuss the specific skills and characteristics that organizations look for in future employees. You'll develop the core skills necessary to advance on your journey as a data professional and integrate these skills with your own pre-existing abilities. We'll focus specifically on technical and workplace expectations. There will also be plenty of chances to practice along the way. Then we'll explore your job market opportunities. You'll learn about the variety of roles and positions that match your skillset. You'll also investigate the responsibilities and ethics that underpin all roles within the data career space. With the increasing number of industries that are turning to the data professions, you're bound to find a role that fits your interests. And we'll also explore some of these industries so you can see where data professionals are employed to make more informed decisions. Then we'll examine how larger organizations create teams of data professionals to approach larger scale projects. We'll also peek into the future of data careers and the trajectory of the field in general, so that you have a good sense of what the future holds for you after completing this program. We'll also investigate elements of effective communication and discover how it can empower you as a data professional. Throughout this course and the whole program, you will see how effective communication can elevate productivity and promote general understanding during the data analytics process. As you progress, you will gain hands-on data analysis experience through your portfolio projects, beginning with the first one in this course. We've designed a few different options for you to apply your data skills to actual scenarios with data shared by our industry partners. And you can use these projects to showcase your skills to future employers. Lastly, our instructors will provide a few career tips and tactics to guide you on your journey. That's a short preview of what to expect later in this course. Next, you'll have the opportunity to go over some resources that will help you get the most out of this program. I'll see you in the next video. I am playing with Excel and I had this gemstone collection and I would love putting the data into a spreadsheet. And this collection would grow not for the gemstones but definitely to be able to put it into the spreadsheets. I get excited. Oh, a purple one. I don't have an amethyst yet. You know, now I get to put purple into the color columns. I was a weird kid. But to me, data was the most beautiful thing on earth. But as my career progressed, I began to think a lot about the why of data because I took it for granted that data is pretty. But there has to be something important to motivate action. And the important bit was decision making. If a data point falls in the forest and doesn't lead to any kind of action, in my opinion, there's no point to it. It becomes valuable when it's related to decisions or real world actions. And so that's why I got really interested in the decision sciences and studied those alongside what we today call data science. Even though back then that was statistics and big data eventually showed up as one of them as well. So I studied all of those things together. I remember back when I was in college, some career counselors asked me, I know, what even what major is this? What you can't even get a job with this? Well, today, combining the decision sciences, thinking carefully about the why, plus the data sciences, the information piece, you get to use information to drive better action. And that is what I'm passionate about. A lot of people don't realize that without any courses in data analytics, they are already data analysts. We're all already data analysts. You're watching this course on your computer or your smartphone. And the information that is being recorded now, as I speak by the video camera, is stored in a bunch of matrices as a bunch of numbers that don't make any sense when you look at that raw data. When you open it with the correct software, in this case, your browser, you are extracting meaning sense from that raw data and you're learning something. So right now in this very second, you are doing data analysis. There are so many different ways to work with data. There are so many different ways to make it useful. Some of them are going to fit your personality better than others. Just guide yourself towards what's most fun, because there's also a lot of different people in these careers and it's a team sport overall. So they're going to cover the parts that you are less inclined towards. And so just lean in to the most fun that you can have pursuing your individual data personality. Just make it useful, and all the good things will follow. Now again, let's discuss some of the course items you'll encounter in your learning journey. In this program, you'll code in Python, discover the stories that data holds, develop data visuals, use statistical tools, build models, and even dabble with some machine learning. Along the way, you'll build a portfolio full of data projects, in addition to this program's capstone. Whether you're looking to switch careers, start a new career, improve your skills, or advance beyond your current role in a company, the Google career certificates can help guide you as you take steps towards new opportunities. We've gathered some amazing instructors to support you on your journey, and they'd like to introduce themselves now. Hello, I'm Adrian, and I'm a customer engineer at Google. Together we will explore one of the fastest growing programming languages, Python. You'll learn the basics, which will help you write scripts that perform a number of key mathematical operations on data sets. All designed to help you unlock the stories within data. Hi there, I'm Rob, I'm a consumer product leader. I work on marketing projects here at Google. I'm excited to talk to you about how to tell stories using data. We'll discuss the six practices of exploratory data analysis, and how to identify the trends and patterns in it. We will also learn about the importance of designing and presenting data visualizations using Python and Tableau, which can help you understand your data and convey it to others. Hello, my name is Evan. I'm an economist, and I consult with various teams across Google. Statistics helps you generate more complex ideas from the data itself. In our time together, you'll discover how you can generate insights, draw conclusions, make inferences, create estimates, and make predictions. Hello, I'm Tiffany, and I'm a marketing science lead, and I work with marketing data here at Google. I will guide you through the process of modeling relationships between variables. Together, we'll explore different regression models and hypothesis tests. We'll also talk about model assumptions, construction, evaluation, and interpretation as the means for answering data driven questions. Hello, I'm Sushila. I'm a data scientist, and I work on projects for YouTube here at Google. I'll guide you through building systems that can learn and adapt without a specific set of instructions. We'll discuss how machine learning is transforming the process of data analysis as you construct your own models. Hello, I'm Tiffany, and I lead teams focused on building AI responsibly here at Google. I'll introduce you to career resources and portfolio projects, and guide you through the capstone course at the end of the program. I'll assist you with different opportunities and tools that will set you up for success on the job market. And of course, you already know I'll be guiding you through course one. This is such a great time to grow and advance your career as a data professional. Your path to a career full of new opportunities awaits. Earlier, I referenced the tremendous amount of data generated each day. It has become a byproduct of modern life. For companies and organizations, all this data can provide insight into the ways they operate and ultimately interact with their users and consumers. To obtain these insights, organizations need people with the ability to access, interpret, and share the stories within their data. Organizations understand that data can inform decision making and explain consumer trends and user behavior. Data professionals use data insights to optimize products or services. There's a common phrase in data-driven decision making that references the untapped potential of data. The phrase is, imagine if we knew what we know. Basically, it's a way of asking the following question. How can we take all that data that may already exist and translate it into meaningful and actionable insights? To gain insights, businesses rely on data professionals to acquire, organize, and interpret data, which helps inform internal projects and processes. Businesses seek those who can access data and understand its metrics. As a reminder, metrics are methods and criteria used to evaluate data. Both are necessary before creating predictive models that can identify trends and inform best practices. And that's where you'll come in. The combination of all these skills, from statistics and scientific methods to data analysis and artificial intelligence, all fall within the category of data science. Data science is the discipline of making data useful. To me, the idea of usefulness is tightly coupled with influencing real-world actions. Some individuals with these skills may work on developing business insights and supporting strategic decision makers. Others may use data skills to fuel automation, testing, and analytic tool development. Still, others may focus on the analytic process itself by adapting modeling approaches to incorporate new and emerging technologies. Data is the foundation for making future decisions. It's through our actions and decisions that we affect the world around us. Businesses and organizations need people like you who can think critically and analytically about how to directly address challenges and opportunities through data-focused projects. The work of data professionals can provide businesses and organizations with details about their practices that can promote new approaches and innovation. This might make a little more sense if we take a closer look at an example. A global delivery company, for instance. Generally speaking, delivery companies are responsible for transporting goods to consumers. A company as complex as this is going to have a number of different data inputs or streams that influence, impact, and affect the ways the business operates. These data inputs and streams may include, but are not limited to, weather and traffic patterns, which affect when deliveries are projected to arrive, gas price trends and fuel economy, which affect shipping costs and profit margins, truck loading times in relation to the number of workers available, which affects the time it takes for the delivery to reach its final destination, how users interact with the company's app to track their packages, which affects the customer experience and the company's ratings, and whether users engage with marketing emails sent after they make specific purchases, which impacts future and repeat sales. My point is, each of these variables affects the way organizations harness data to transform decisions, automating and adapting machine learning where applicable. The ability to unlock transformative information within data is a skill that businesses seek. As you progress within this program, you'll discover how data professionals can make meaningful contributions to almost any organization by finding action-oriented solutions within data. All professions require a certain set of tools for success and data-driven work is no different. In this video, we'll open our analytics toolbox and look at some of the most common items. Before we begin, I want to emphasize that each of the items serves their own individual purpose. However, when used together, they help build and tell stories with data, which can then inform, influence and impact business decisions. Programming languages are the first tools we'll investigate. They allow data professionals to work efficiently within and dissect large datasets. Most languages have been developed over time and each data professional has their own preferences. We'll mention two in this video that have become very popular for data analysis, the R programming language and Python. R is a programming language that's used extensively by researchers and academics. It was my primary language during graduate studies in statistics, and some people say that R captures the statistician's mindset. I'd say there's something to that sentiment. If you're after implementations of the latest statistical breakthroughs, R is a great place to look, but it's used for more than statistics. You'll find many new technologies and ideas programmed with it. One of the best futures of R is that you can create complex statistical models from just a few lines of code. If you're curious about R or need a refresher, be sure to check out our Google Data Analytics certificate, also offered here on this platform. This program teaches the Python programming language. It's a great choice for a few reasons. First of all, it emphasizes readability, making it one of the easiest programming languages to learn and write. Second, unlike R, Python wasn't born in the data community. While this might sound like a minus, it can also be a huge plus. In the modern world, data is used in increasingly creative ways. So there's a massive advantage to learning a programming language that's capable not only of handling the data side of things, but can also be used to build and deploy the applications that data will be fueling. Although R was my first love, these days I find that I lean more heavily on Python because of its flexibility. Python can perform a wide variety of data-related tasks, which makes it very popular among data professionals. If you're a novice or new to coding completely, Python is a very approachable language. Its formatting is visually uncluttered. It's one of the most beginner-friendly languages and it has enormous online communities and plenty of resources to help you if you get stuck. We will interact with Python within a web-based computing platform, also called Jupyter Notebooks, which allows you to run code in real-time and helps identify errors easily. To visualize the stories in the data, we're going to teach you how to share complex data through a graphical interface. Those who experienced our data analytics program will be familiar with a platform called Tableau. In this program, we'll take a more detailed look at how this powerful tool can help others understand the results of your analysis. Additionally, we'll look at effective communication in data-driven careers. At first glance, it might seem like less of a concern, but describing the sometimes complex processes of data analytics to non-technical stakeholders may be one of the most important skills a data professional can have. Since communication is something we all do regularly, it's easy to forget about the importance of how data professionals share and process data stories. Our goal here is to strengthen the communicative skills that you already possess so that you can leave this program equipped to excel. In this course specifically, and across other segments of this program, communication will be a key component that is directly tied to the work you'll do as a data professional. Programming languages allow data professionals to interact with and interpret data. Visual data tools like Tableau enrich the stories within data with visual elements that bring attention to specific details, but the most important element of any story is the storyteller. That's you. Your prior experiences and knowledge inform your storytelling abilities and your distinct background is what will set you apart from others in these roles. Regardless of your eventual career path, remaining determined and developing the proper skills is essential to personal and professional transformation, and the tools we're offering you in this program will also help you along the way. I'm thrilled to continue alongside you in your journey. The best is yet to come. I'll see you soon. Congratulations on completing the first part of this program. You've officially begun your journey to new opportunities in the data field. Let's revisit what we've covered so far. First, we covered the basic logistics of what's included in the certificate. You met each of your instructors and we previewed some of the different course topics you'll encounter. Next, we looked at the basics of data focused work and some of the industries that are incorporating data insights. We also discussed the future of data driven careers. In addition to exploring what it's like to work in the data field, we also discussed the data professionals toolbox and the skills this program will help you develop. Congratulations on your progress so far. I can't wait to see you in the next video. My name is Loyanne and I'm an industry intelligence lead at Google. What that means is that I help to make both our internal teams and our partners and advertisers a lot smarter about the industry that they work in. Early in my analytics career, it was actually pretty lonely and so I had a really big need to connect. I could not find a job immediately out of college and I had done lots of things, worked different places, didn't really build deep enough connections to have a community of people in this area. Early on, I started joining some networks. Networking is critical when entering a new field. For me, what was most jarring about this new thing was just the newness of being new. So you feel like a fish out of water. You are also subject to that very pesky imposter syndrome. I'm not good enough to be here. They're going to find me out. Having a network who was able to reign me back in and say, no, you're qualified and hear all the reasons why you're qualified and actually here's how these other skill sets you have make you an asset here was immensely valuable. In my role currently, I collaborate with tons of different people in data analytics. Collaboration is important because there are so many diverse parts of the story. And so you need people who are willing to and able to and good at asking the questions, which are, what exactly are we looking for? Are we looking in the right place? You need people who are then able to query and find a set of data, find the needle in the haystack. You need other people to find the pattern and to find what the insight is from the data. And then you need people to tell the story. Practicing curiosity is something that will always help you in any role that you're in, especially when you're dealing with data. I highly recommend participating in the forums and this certificate program because it's a great way to practice curiosity. Asking questions, getting comfortable asking questions, getting comfortable working on a team and asking questions of your team is super important. And beyond that, you have this opportunity to connect with your cohort. You are the next generation of advanced data analytics practitioners. And so getting to know one another, getting to understand what you're all thinking, and even opening up the door to innovate together is a super important and exciting opportunity. As you know, this program asks you to complete a graded assessment at the end of each section and course. And now's the time to prepare for your first one. This assessment will effectively verify your understanding of key data analytics concepts. It will also help build confidence in your understanding of data analysis while identifying any areas where you can continue to improve. Assessments can sometimes feel overwhelming, but approaching them with a strategy makes them more manageable. Here's a list of tips to set yourself up for success. Before taking an assessment, review your notes and the videos, readings and the most recent glossary to refresh yourself on the content. During the assessment, take your time. Review the whole test before filling in any answers. Then answer the easy questions. Skip the ones you don't know the answer to right away. For multiple choice questions, focus on eliminating the wrong answers first. Also, it's a good idea to read each question twice. There are often clues that you might miss the first time. If you start to feel anxious, calm yourself with some mental exercises. One way to do that is by completing a simple math problem in your head or spelling your name backwards. This also helps you recall information more easily. Before you submit the assessment, check your work, but be confident. Sometimes people change an answer because it feels wrong, but it's actually correct. Your first instinct is often the best one. Finally, remember to trust yourself. Often people know a lot more than they give themselves credit for. Everyone learns at different speeds and in different ways, but it's important to maintain your momentum. So take the time you need. And when you feel ready, keep moving ahead. You've got this. Great to have you back. I'm so delighted to be your guide as you continue on your way to becoming a data professional. As you've been discovering, data can be used in many ways. But no matter how it's applied, the key thing to keep in mind is that knowledge is power. The power to improve your business, your work, your personal life, and the world around you. With all of the data surrounding us, there's just so much potential waiting to be unlocked. I'm so glad that you've chosen to learn more and to be a part of it. And speaking of your learning journey, in this section of the course, we'll start by finding out more about data careers. We'll explore different industries and examine some direct applications of data-driven work. And we'll investigate where data careers are headed in the future. It's such an impactful and rewarding field. And it just keeps getting better. I'm so excited for you to get started. Let's go. How's it going? My name is Adrian and I'm a customer engineer at Google. A customer engineer at Google is the bridge between the technical and business components or aspects of corporate America. We help customers leverage our technology to meet their business needs. One of the most exciting projects and even interesting that I worked on was creating a unified patient record. And this project was for a lab diagnostics company that was really looking into creating a 360 view of a patient. Our project focused on consolidating information across different systems. You can imagine taking your lab diagnostics data for your most recent routine blood work or even something that's a little bit more specific, you know, a diabetes or glucagon test. Taking that and all that information that you've provided in terms of biological data, getting that centralized and standardized in one environment so that if it is your physician looking at the data or your radiologist or your nutritionist, they have a complete view of all of your data and they can not only just access it but they can visualize it in real time. One of the key components of this project was the use of virtual reality headsets as the main visualization component. For me what this meant was I got the ability to do some game design, create some human skins in terms of the body and use different colors and programming functionality to indicate, for example, the system is looking healthy or if it's looking like it needs concern or intervention. We can see over time how the diagnostic test results have changed and the impact that this has on our patients. At the end of the day what we had was a project that unified all of the data for the patient and gave the key medical professionals a way to visualize it. The best piece of advice that I have for you is to remain confident in yourself. While you might be new to this new role or this new company, you're coming with skills that are repeatable, that are transferable, whether it be the way that you approach problem solve, whether it be the rapport you're able to build with your stakeholders and the soft skills and ensuring that you can build those relationships or even when it comes to communication and writing a succinct and concise email, there are going to be ways where you're going to take what you've done before and apply them in your new role. You might have to think outside of the box but you'll get that. Data professionals are so valuable to their companies. They determine which data streams are most important to specific business projects, challenges and initiatives. They identify key goals for the future and they give their organizations the ability to take meaningful action by reimagining processes and improving operations. To do all of this, data teams require individuals with diverse skills, knowledge and interests. Therefore, they're countless different data focused roles, responsibilities and project types which are further differentiated by the industries and businesses they support. Among all of these possibilities, data careers can be loosely categorized into two complementary types of work, technical and strategic. In this video, we'll investigate both. First, let's find out about the folks whose work requires a heavy emphasis on technical skills. Some examples of these professionals are machine learning engineers and statisticians. These people provide high effort solutions to specific problems. Through their expertise in mathematics, statistics and computing, they build models and make predictions. Using tools such as R and Python, they help their teams extract value from business data sets. The result is a solution that has a direct positive impact. Another highly technical role is the expert data analyst whose work involves exploring vast and complex data sets to identify directions worth pursuing in the first place. They ensure that an organization's data science efforts are directed as efficiently as possible, bridging the gap between other technical data professionals and the strategic work we'll cover shortly. Like most technical data professionals, you'll learn how to acquire, scale, organize, structure and manipulate data so that it's packaged in a way that others can work with. In other words, you'll know how to transform raw data into something useful for decision-making. Okay, now let's consider data professionals on the more strategic side. These people include business intelligence professionals and technical project managers, to name a couple. Strategic data professionals use their skills to interpret information that affects an organization's operations, finance, research and development and so much more. Their work aligns closely to the overall business strategy and involves seeking solutions to problems through data analytics. In short, strategic data professionals maximize information to guide how a business works. Sometimes you'll find a company has roles that blend specialist technical knowledge with strategic data expertise, often in unusual and very creative ways. Soon we'll learn more about some of these opportunities as well as the more specialized, technical and strategic roles and of course, we'll discover some proven ways to tap into them as a data professional. Lots more to come. Nowadays, it seems that data is generated everywhere. Computer functions have been integrated into a multitude of everyday technologies. Smart home voice assistants, step trackers, TV streaming apps, even some coffee makers are connected. All of these touch points create data that businesses can use to understand trends and advance their products and services. And with every update, the ability to gather environmental and user data expands even further. Within this massive reserve of data is a wealth of untapped potential, awaiting data professionals and the organizations they support. That's why these people are so in demand. Companies need them to use data to refine business strategies, meet consumer preferences, react to emerging trends and re-align internal efforts. Let's consider some examples of how data professionals use their expertise to transform industries. First, the world of big finance was an early adopter of the power of data science. And with the way information drives this industry, it's easy to understand why. Data professionals help financial institutions, assess investment risks, monitor market trends, detect anomalies to reduce fraud, and create a more stable financial system overall. Hundreds of millions of financial transactions occur in the financial world each day. And data is part of each and every one of them. As another example, data analytics is key in healthcare. And here, the data benefits can actually be in life-saving. For instance, the information collected by smartwatches is making a huge difference in the lives of many people. Sensors in these devices record biode data, such as heart rate and oxygen levels. And of course, all this information can be shared with healthcare professionals. Together, the patient and the physician can better understand sleep trends, stress levels, and much more. Then individualized wellness plans can be created and modified for the patient's well-being. Plus, on a larger scale, data analytics is helping healthcare organizations process large amounts of clinical data, which supports the early detection of a health condition and leads to more precise diagnoses. Thirdly, data has a big impact in manufacturing. Data professionals predict when to perform preventative maintenance to avoid production line breakdowns, use data to maximize quality assurance and defect tracking, and artificial intelligence models help respond to logistical issues and reduce delivery truck miles on the road, advancing key sustainability goals. And in a time when supply chains reach every corner of the world, data enables clear and near real-time communication with suppliers, retailers, and other network partners. It also helps supply chains maintain optimal levels of inventory to avoid stockouts and empty retail shelves. Data professionals are also advancing approaches to agriculture. With data insights, farmers develop new ways to approach crop production, livestock care, forestry, and aquaculture. The inclusion of autonomous machinery, tractors, and irrigation systems is improving harvesting technologies as well. If you'd like to keep learning about how various industries use data analytics, refer to the course resources on this topic. And here's a little piece of advice for me. Don't miss an opportunity to learn from someone in real life. I love asking business owners, store managers, and client support professionals about how they use data each and every day. Who knows? One of these conversations could open a door to a future opportunity for you. Recently, you've been learning about how businesses use data to guide decision-making, answer questions, and solve problems. In this video, we'll investigate how nonprofits use data analysis to pursue their unique goals. Non-profit groups are created to further a social cause or provide benefit to the public. As the name suggests, their main purpose is not about profit, but to foster a collective, public, or social advantage. There are some very rewarding and inspiring opportunities for data professionals in the nonprofit sector. In particular, data can be applied in order to help these groups more effectively anticipate and respond to the greatest areas of need. For instance, maybe a U.S. charity that provides bicycles for children would like to determine which neighborhoods are most in need. They could ask their data professional to access the U.S. Census Bureau. The professional would use their talents to navigate the census database, identify key metrics, and summarize findings with analyses and data visualizations. This report would highlight where there are larger numbers of school-aged children in need who would benefit from the resources of this program. And there you go. Data insights led to informed decisions about where this nonprofit can do the most good. Now nonprofits do more than use data. Many of them also collect it. As you likely know, public entities and government agencies can be excellent resources for useful data. And much of it is open data that's available for general use. As you likely know, open data is data that is available to the public. It's free to use and guidance is provided to help navigate the datasets and acknowledge the source. While sourcing open data is a good way to interact with data on your own, there are other opportunities that enable you to refine your skills while helping others. Data volunteers contribute to many projects that help nonprofits benefit communities all over the world. To find out more, here are some organizations to check out. First, the Data Science for Social Good Foundation was launched at the University of Chicago back in 2013. In 2020, they trained forces with UNICEF to analyze various aspects of air pollution around the world to help monitor children's health. DataKind launched in 2011 in New York City with chapters in the United Kingdom, Bengaluru, San Francisco, Singapore and Washington, D.C. This organization analyzes the cost of environmental cleanup in different underserved communities to guide restorative efforts. You can view both foundations latest efforts through the links in the transcript for this video. Another option for putting your data skills to good use are hackathons. A hackathon is an event where data professionals and programmers come together and collaborate on a particular project. The goal is to create a solution to an existing problem using technology. Some examples include developing better tools for predicting extreme weather events, creating tech to help elementary school kids learn important reading skills and identifying ways that community development groups can use their data to advance home accessibility and affordability. Volunteering your data skills to public projects is an excellent way to contribute to the greater good while gaining experience and networking with others in your field. Coming up, we'll take a deeper look at some data-oriented projects in the public sector and how they're making an impact around the world. One of the most important responsibilities for those of us in data-centered careers involves how we protect our organizations, manage and protect data. This has a lot to do with communication exchanges between a company and its customers. As you've been learning, almost all communication generates data, whether it's a shopping receipt, confirmation of an order, or even earning customer loyalty points. And businesses have a big responsibility to their customers, especially when it comes to maintaining and protecting user privacy. Any data gathered from individuals or consumers is referred to as personally identifiable information or PII. PII permits the identity of an individual to be inferred by either direct or indirect means. This includes things like biometric records, user names, and social security or national identification numbers. Because this information is often associated with medical, financial and employment records, PII is sensitive and must be managed with great care. After all, when someone's personal data is improperly handled, they become vulnerable to identity theft, fraud and other issues. Recently, there have been great efforts to take a wider view of data collection practices and protect individuals. Industries are trending towards aggregate information. This is data from a significant number of users that has eliminated personal information. By aggregating the data and removing PII, this protects people and gives them more control over their own data. Similarly, as more industries become interconnected, the amount of data available to them increases. And just as with aggregate information, the more data collected, the more likely it is that it will be representative of a wider population rather than a single user. A key thing to keep in mind is that data gathering is a task managed by humans. And that process can be informed by different backgrounds, experiences, beliefs and worldviews. These and other types of biases can affect the way that data is communicated and how the results are shared, which in turn can have an impact on business decisions. Effective data professionals know that, with a collecting, analyzing, interpreting or communicating sensitive data, bias should always be considered. So be very careful when interpreting data where there is a clear source of bias and be on the lookout for subtle biases as well. In addition to thinking through bias in the data, data professionals should also try to emphasize that there can be a multitude of possible interpretations for every data insight. So the main trick is avoid jumping to conclusions until you've really done your homework. One method of addressing bias is to make sure that the data that you're working with has the same characteristics as the greater population that you're interested in. In data analytics, this is called a sample. A good sample is a segment of a population that is representative of that entire population. Here's an example. A clothing company is analyzing sales in their highest growth market. They want to determine what color shirts will be most popular in the upcoming season. One person notes that red and blue shirts accounted for 80% of their sales in this market over the past three months. This is a big number so they suggest ordering lots of red and blue shirts. But another person points out that the local sports teams' colors are red and blue and this team had recently won a championship. It's very likely that sales of red and blue shirts will have spiked as consumers purchased teas to support the local team. Plus, they note that although this market is high growth, it only represents 40% of the retailer's total sales. With all this information in mind, decision makers at this retailer instead choose to evaluate color popularity over a full year and across all markets. This will provide a much more complete picture. We'll investigate more about bias later in this program. And as you progress, you'll discover many more strategies for ensuring that you're aware of bias and proactively working to counter it in all of your data work. When investigating a possible new career path, one of the most important things to consider is its outlook and potential for growth. Predictions about careers related to data analysis show that there is no shortage of need for professionals in this field. Over the last decade, data-focused careers have surged. According to estimates by LinkedIn, the data science field grew by over 650% between 2012 and 2017. Many experts believe that we have not yet seen the full potential of these careers. In fact, the US Bureau of Labor Statistics stated that data science is one of the fastest growing career fields in the United States, projecting a 30% increase over the next decade. Among the data science professions, one of the fastest growing is artificial intelligence and machine learning. And we've seen significant advances in these areas in recent years. At its core, artificial intelligence, or AI, is the development of computer systems able to perform tasks that normally require human intelligence. Thanks to growth in the data sciences, AI is now becoming more commonplace. These technologies will continue to evolve and provide more accurate results and richer insights. And as AI increasingly becomes an essential component of data work, it's important to be aware of the human bias that can be imprinted within your work. To counter this, organizations benefit most from building diverse teams of professionals from different backgrounds and different life experiences. Incorporating a wide range of perspectives and worldviews promotes wider representation and yields more accurate results. As we study the future of the data professions, I want to emphasize that data professionals have yet to realize the full potential of artificial intelligence. As these types of technological innovations continue to evolve, we can expect that organizations will grow and adapt their business practices accordingly. With wider and wider adoption of data analysis techniques, the most likely area for growth is in specialization. And we expect to see further subdivision of roles within data-focused teams. Ultimately, what I want you to keep in mind is this. The world is generating more and more data every year, so it's reasonable to expect labor that extracts business value from it to be able to earn its keep. More data means more demand for the three main activities covered by the data professions, statistical inference, machine learning, and data analytics. So those skills will stay very relevant, though their names might evolve over time. In addition, constant innovation in the field offers you the opportunity for perpetual learning, growth, and development. As you may already know, being a data professional means that your growth and success in this field depend on a desire to keep learning. In fact, that just might be the reason you enrolled in this program. And for that, I'm so proud of you. Continue to explore opportunities to evolve throughout your career. Be proactive in acquiring new skills, keep growing, and you will always be ready for the future. In this section of the course, we explored many different facets of data careers. You learned that data professional is a broad term that encompasses different roles and responsibilities in the data space. You discovered that the work we do in this field involves countless possibilities, such as determining important data streams, identifying and focusing on future business goals, and reimagining internal and external processes. You also thought about some of the ways that organizations are being transformed by data professionals and how these talented individuals use their skills to positively impact communities around the world. You've come so far already, but there's so much more to learn. Thanks for joining me on this exciting exploration, and I'll catch up with you again soon. Welcome back. I'm really excited to share this section with you. We'll start by looking at some of the skills needed in the data field today. We'll look at what happens after you land a data position and what to expect in those first few weeks on the job. We'll investigate some of the roles within the data professions and take a closer look at the general categories that you'll encounter. Then we'll explore the importance of networking and building relationships within an organization. We'll also look at the responsibilities of different data professionals. So let's get started. See you in the next video. All data professionals share a love of data and a desire to solve problems. While wearing their analytics hat, data professionals lay out the story that they're tempted to tell and then they poke it from several angles with follow-up investigations to see if it holds water before bringing it to their decision makers. In doing so, they rely on their programming and investigative skills to guide others towards informed decisions. Data professionals also combine a knowledge about how to do practical tasks with an awareness of what makes communication and collaboration successful. Later, we'll dig deeper into the elements of communication and discuss the ways communication enhances and structures your work as a data professional. For now, let's examine some skills and attributes that are applicable across data-driven careers. Working in data analytics requires a mix of business sense and knowledge in gathering, manipulating and analyzing data. Our goal is to prepare you to develop the competencies needed to succeed. Let's start by discussing some interpersonal skills. Often, these are referred to as people skills. They focus on communicating and building relationships. Interpersonal skills are critical. In this field, there's a high degree of interaction between stakeholders. This is especially relevant now with team members often working collaboratively across the globe. Very often, work conversations are the starting point and the fuel that drives projects. And because of the cyclical processes within data analysis, communication is always ongoing. Another important skill is active listening. This means allowing team members, bosses, and other collaborative stakeholders to share their own points of view before offering responses so that each exchange improves mutual understanding. You can actually practice active listening. Next time you speak with someone, put extra effort into listening beyond their words. Focus on what they're trying to communicate. Your listening and communication skills will play a huge role in helping you capture effective insights and informed decisions. We'll take a closer look at communication a little later in this course. There are other things you'll need to consider. As a data professional, you'll search for information hidden within a large amounts of data by applying critical thinking skills. Along the way, you'll investigate the connections between a variety of different data sources as you search for trends and indicators. Think of yourself as a data detective. Project data can come directly from your organization or from other sources. You might be lucky and receive a well-formatted spreadsheet or database, but quite often, you will need to prepare the data to get started. This process is known as data cleaning. This is where the data is reorganized and reformatted. The goal is to remove anything that could create an error during analysis. This process includes tagging and consolidating duplicates, irrelevant entries, structural errors, and empty space. Once you have everything in the proper format, you can then filter out unwanted material. Now your data is ready to be analyzed. It's time to look for trends and tendencies. Often, it's very helpful to render the data visually to reveal additional insights through charts, dashboards, and reports. Graphic tools will be very useful in identifying patterns, as well as in sharing information with others. You will explore this in greater detail later and have opportunities to practice compiling visualizations You'll also learn about more advanced skills like building models and machine learning algorithms. These tools will help you and other data professionals assess information accuracy, analyze specific data segments, and predict future business outcomes. Your hard work will assist leaders and other decision makers in your company, providing them access to a rich variety of perspectives on different sets of information. With demand for data analytics increasing across all types of companies and businesses, you will likely find opportunities in an industry that you are personally interested in. Next, we'll take a look at working in the data field. My name is Cliff and I'm a workforce planning and people analytics lead at Google. I use data to help our employees be more productive, more connected, and just overall improve their well-being. I also use data to improve our HR practices and focus on our hybrid work policies as well as our location strategy. I've always been interested in issues of workforce development, people strategy, human resources, but I didn't anticipate what a central role analytics would play in my work and how much it would come to love analytics. One of the things that has helped me develop a confident voice in this space has just been an understanding that we work in teams who work cross-functionally and I don't need to bring all of the solutions to a problem. I'll bring a perspective on how we can use data to solve this problem, but I'm working with people who also bring a wealth of skills and looking at it really as a partnership it's really about leveraging the best of everyone on the team. That's helped me bring a lot of confidence to the work. My go-to strategy for communications when working with partners is to first set up a few low stakes meetings just to understand what their broader business goals are. I'm not even thinking about the specific project we're working on but more broadly, how do they define success? That helps me understand where the work that we're doing fits into the context of their bigger picture. The second thing I do from a communication standpoint is to try to play back what I think I heard somebody say sort of to repeat it back to them if whether it's my understanding of their question or the output that they're trying to see from the data just to test if I actually really understand their goals. When I'm working with somebody and I feel we're not getting to the root of a question or a problem what I find is really helpful is laying out from a data perspective a set of different options or possibilities for them and engaging in a conversation around which of those really resonate for them and so it's finding a balance between listening and telling as a way to unlock some ideas that they might not have thought about themselves. You can learn a lot about a career by looking at job postings. If you've searched for opportunities in the data space you may have noticed different data related job titles with similar responsibilities or postings with similar titles listing different responsibilities. Here's an example. At one company the role of data analyst will focus on using statistics and models to craft insights that inform business decisions. Another job with the same title at a different company may focus on optimizing the tools and products that automate analytical processes. One reason for these inconsistencies is that data tasks and responsibilities are dependent on an organization's data team structure and how they make use of insights and analytics. As such some organizations choose to be very specific with responsibilities. Others leave job tasks quite broad in scope. That's why this program refers to the field as a career space. With that said when you're comparing positions that have similar titles I encourage you to classify them based on the skills used in their day-to-day activities. Two of the most common titles are data analyst and data scientist. These can cover a wide range of job responsibilities many of which you'll gain experience with in this program. Traditionally a data scientist was expected to be a three-in-one expert in data analytics, statistics and machine learning. But not all employers use these conventions when writing their job descriptions. Generally any role that includes analytics expects candidates to be able to function as technically skilled social scientists looking for patterns and identifying trends within big data sets. Also, they develop new inquiries and questions as they uncover the stories inside their data. Their hard work can help steer companies, future actions and guide decision-making. They allow their organizations to keep a finger on the pulse of what's going on in the business interpreting and translating key information into visualizations such as graphs and charts allowing every stakeholder to understand their findings. At times they may be tasked with creating computer code and models to recognize patterns in the data and make predictions. When you investigate job postings you'll encounter other titles with similar responsibilities. For example, junior data scientist, data scientist entry level, associate data scientist or data science associate. All of these roles include a mix of technical and strategic skills to help others make informed decisions. In your career you might encounter other professionals in roles that use data and analytical skills. Some of these may overlap with the skills you will learn but these roles are specific to certain tasks or are supervisory positions. Let's take a look at a few. Data scientists depend on systems within their companies to collect, organize and convert raw data. Designing and maintaining these processes are some of the most important responsibilities of a data engineer. Their goal is to make data accessible so that it can be used for analysis. They also ensure that the company's data ecosystem is healthy and produces reliable results. These positions are highly technical and typically deal with the infrastructure for data usually across an entire enterprise. You also need to have the ability to get data before it even makes sense to talk about data analysis. Most of the technical work leading up to the birthing of the data may comfortably be called data engineering and everything done once some data have arrived is data science. Similar to how a data engineer oversees the data infrastructure, there are data roles that manage all aspects of data analytics projects for a company. Insights managers or analytics team managers often supervise the analytical strategy of the team or of the organization as a whole. As a data analyst, you will likely report to someone working in this capacity. They are often responsible for managing multiple groups of customers and stakeholders and they're often a hybrid between the data scientist and the decision maker. Since this combination of skills is rare, these positions are often more difficult to fill. This role can have other titles like analytics team director, head of data or data science director. You may encounter another job role in your scan of job postings, business intelligence engineer or business analyst. This role is highly strategic, focused on organizing information and making it accessible. BI analysts synthesize data, build dashboards and prepare reports to address specific needs for a business or requests from leadership. If you're interested in learning more about business intelligence and its opportunities, I encourage you to look into the Google Business Intelligence Certificate. Now that you have some idea of the roles found within the data analytics career space, we will begin to take a closer look at how data professionals function within their larger organizations. Compared with many other professions, the data career space is relatively young. This application of data-driven work in organizations has grown exponentially in the last several decades, which means there are many different opportunities and much job security for you in the future. Now that organizations have the technical capacity to take on their own data-focused work, they're looking for people like you with the right skills to fill these jobs. Traditionally, companies have filled jobs in the data career space with those from computer engineering backgrounds or from statistics. Increasingly, there's been a shift towards de-emphasizing engineering and instead promoting analytical skills. These skills can be learned in different forums, like the program that you're currently enrolled in. Let's look at a scenario. Let's say that an enthusiastic and enterprising person, that's you, is starting a new position at a company as a data professional. Your company is a recognized leader in its industry. Its workforce spans the globe and you are its newest member. It's your first day on the job and you are ready to start working. During your orientation, your company grants you systems access and onboarding documentation. You're starting to have a clearer picture of how information is generally shared with employees. You still have many questions about the responsibilities of the position. Later, you watch a video from the quarterly review meeting led by a company executive. Watching the presentation, you get insight into the quarterly budget, recent client interactions, and some general information on an upcoming project. You now have a broad understanding of the company. At this point, you still lack details about your specific responsibilities. During your first week, you're invited to a virtual meeting of the data professionals involved on the project that you've been assigned to. As each data professional outlines their job responsibilities, you take note of the differences among them. After each participant speaks, you begin to realize that not all data tasks are universal and that many data professionals end up adapting to meet the needs of the current project and the needs of the data. When you're new to a job, I would discourage you from over-specializing immediately. Instead, taking on a variety of tasks within a project is a great way for newer data professionals to continue developing their skillset. As a member of a larger group of data professionals, you're able to observe and learn from your team members. Once the analytical process is complete, the results of the project will need to be shared, allowing everyone in the organization to have access to the information. This includes building user-friendly interfaces and communicating the findings to different departments. Working for a large company means that there's a good chance that you will be dealing with vast amounts of information. This will require more work than a single data professional can reasonably provide. Because of this, you might encounter scenarios where organizations have created teams of data professionals. Throughout the rest of this section, you'll take a closer look at how complex organizations are incorporating data professionals through data teams and the division of responsibilities within these teams. Hi, I'm Tiffany, and I lead teams focused on building AI responsibly here at Google. I've served in the United States Army, worked as a consultant, and have worked as a program manager in privacy and machine learning fairness. Data and having a rich understanding of data has always been an important part of my job. Today, we have more data available to us than ever before, and it's important to be able to derive insights to help decision makers make the best possible decisions. I'm so glad you're here, and I really hope this program is giving you all kinds of new possibilities to think about. You've already learned so much. We've covered the basics of data-driven fields and looked at career roles, how data professionals are being used by different industries, and how those in the field can make a valuable contribution. You're gaining a vast range of knowledge and skills, which is going to be extremely valuable as you prepare to join us in the amazing field of data-driven careers. At this point in the program, I encourage you to take some time to reflect on how your experiences so far are setting you up for a great career. And one way to do that is by enhancing your current online presence. In the Google Data Analytics certificate, we covered numerous job-related materials, including how to create an effective resume and LinkedIn profile. This video is about improving your existing career assets. Those of us who were involved in the Google Data Analysis certificate always love receiving learner feedback, especially when it has to do with someone else's professional success. I remember one person who took the initiative to refine her LinkedIn profile as soon as she began the program. She noted that she was currently working through her program and she added to her profile many of the technologies she had become familiar with. Well, not long after, she saw an advertisement for her dream job. Even though she was early on in her D.A. education, she decided to apply for it. And she got it. The hiring manager told her that the fact that she had familiarity with those data tools really set her apart from other candidates. There are tons of stories just like this one that prove the value of having a compelling and professional LinkedIn presence. So let's get into that now. A professional online presence enables you to better connect with others in the field. You can share ideas, ask questions, or provide links to a useful website or an interesting article in the news. These are great ways to meet other people who are passionate about data-focused jobs. Even if you're already part of the community, strengthening your network makes even more dynamic. LinkedIn is an amazing way to follow industry trends, learn from thought leaders, and stay engaged with the global data analytics community. And of course, it has job boards and recruiters who are actively looking for data professionals for all sorts of organizations and industries. So it's a good idea to always keep your profile up to date and to be sure to include a professional photo. Beyond that, consider including a link to some of the relevant projects you've done in data analytics, such as the portfolio project you'll work on during this program. As you continue expanding your online presence to represent the work you're doing in data analytics, the connections you make will be an important part of having a truly fulfilling networking experience. Plus, there are also many rewarding in-person networking opportunities, which we'll explore soon. See you then. My name is Tiffany, and I work at Google, and my job is to ensure that our products are fair and inclusive. My first role in data analytics probably harkens back to my time spent in the United States Army. And there I did a lot of work with data, trying to understand which decisions I should make for my soldiers and for my unit, making sure I was making the best data-driven decisions to ensure their safety and well-being. Coming out of the military, I felt a tremendous amount of imposter syndrome. I felt insecure. I wasn't sure of what I could be good at since I had such a highly specialized job, but I talked to a lot of mentors, a lot of friends that encouraged me and told me about transferable skills. Some of the skills that I gained in the Army were very, very clearly helpful to me in my career today. The one that stands out to me the most is the ability to frame a problem. So the ability to think about what someone needs, the data that you have, and how to connect in the middle, how to frame it out so there's no scope creep, so that you have a very crisp and clear articulation of the problem and a very clear and crisp articulation of that solution set. And I learned that in the military and I continued to build upon my skill base, continue to go online and read books and shore up that knowledge and over time I became more and more confident of what I could accomplish and the things that I could reach for. All of the courses that I took, all of the hard work, all of the imposter syndrome really all led me to the job, my dream job that I have today. It's important for people that have non-traditional backgrounds or non-traditional paths to get into a certificate program such as this because we know that education is uneven. Opportunities are uneven. I'm one of the first people in my family to go to college to get an education and being able to do so has opened up so many doors for me. And so many of you may be in the same situation. If I were to give anyone advice as they're transitioning out of the military and into a data analytics career I would tell them to use some Google products. So Google has a career translator where you can put your military service, your branch, your job into that translator and it will spit out transferable skills that you may have that you can place on your resume. I'd encourage you to take the leap of faith and get rid of and shed the imposter syndrome and apply for as many jobs as you can. And finally, I would encourage everyone to try to find someone who works at the company, try to get that referral, be bold on LinkedIn, be bold on other platforms and just try to make your way into the job that you see yourself in. Recently you learned about the value of maintaining a professional online presence and connecting with others in the data field. As I noted, there are many professional networking sites such as LinkedIn that are well worth your time and involvement. But here's something that many people don't realize. Some of the best opportunities are never actually shared on a networking site. Sometimes there are professional opportunities that are not publicly advertised by the employer. There are lots of reasons why some positions are not posted. Maybe an employer is concerned about revealing details about confidential projects to its competitors through a job posting. Or perhaps the company HR department doesn't have the resources to review a flood of applications. Often a business may choose to use a recruiter instead of posting jobs. So let's start exploring how you can increase your visibility to access more opportunities through building valuable relationships. After all, the more people you connect with professionally, the greater your chances are for being referred. Be sure to follow best-in-class organizations and visionary business leaders on Twitter, Facebook, and Instagram. Interact with them and share their content. If there's a post you like, consider commenting with a response or a thank you. You can also search for data field webinars featuring interesting speakers and many of these events are free. This is another fascinating way to learn while connecting with peers, colleagues, and experts. And there are also lots of blogs and online communities that focus on data fields. Data and tech companies will often talk about what's new and important from their point of view, but there's a growing community of bloggers and podcasters who offer great perspectives of their own. Now let's move to in-person networking opportunities. The easiest way to find events is by simply searching for data science or data analytics events in your area. You'll likely discover a wide range of engagement opportunities for more formal conferences and seminars to casual meetups and get-togethers. Non-profit associations are also wonderful resources and may offer free or reduced-rate memberships for students. In addition to networking, learning from a mentor can positively influence your career and life. As you may know, a mentor is someone who shares knowledge, skills, and experience to help you grow both professionally and personally. Mentors are trusted advisors and valuable resources. The first step in finding a mentor is to determine what you're looking for to narrow down your potential list. Think about any challenges you face or foresee and how to address them in order to advance professionally. Then, consider who can help you grow in these areas as well as fortify your existing strengths. Share these things openly when you formally ask someone to be your mentor. It's also helpful to note any common experiences. Perhaps you grew up in the same city. Maybe you both worked in the same industry. Your mentor doesn't have to be someone you work with currently. Many people find mentors on LinkedIn, an association mentorship program or at a mentor matching event. This really taught me the value of mentorship. I also learned that having a successful mentorship experience requires effort and investment in time, whether you're preparing to ask the right questions internalizing the feedback or scheduling follow-up sessions. But it's well worth it. Always be open to connecting with new people. You never know where a single conversation will lead. Congrats on your progress so far and on taking meaningful action to advance your career. I wanted to let you know about some of the great career-building activities and resources you'll encounter in the rest of this program. In the next course and those that come after it, you'll have the chance to complete a number of hands on activities based on data-driven scenarios. They'll let you put what you're learning into practice and help you discuss your skills with hiring managers in a concrete way. Be sure to save your work from these activities. They'll be useful to you as you near the end of the program and start thinking about the next stage of your data-driven career. When you get to the last course in the program, we'll go in-depth on preparing for a job search. We'll cover how to find and apply for jobs that interest you. I'll also share some tips to help you prepare for the interview process so you'll know what to expect going in. You'll learn how to put together an online portfolio that will help you demonstrate your knowledge and experience. You'll also complete a scenario-based project from beginning to end that you can put in your portfolio and use to present your working process to potential employers. With your past working and educational experiences, your career journey will be unique to you. But whatever path you choose, the knowledge and resources you gain from this program will give you a strong start. You've accomplished so much already, and there's so much more to come. Good luck on the next part of your journey. I'm excited to meet up with you again soon. As you approach the end of this section, let's take a few moments to review some key concepts before moving ahead. We saw that the data career space has experienced amazing growth over the last decade. Future predictions indicate that this should continue. You discovered that all kinds of companies and organizations are looking to drive their future decisions with data, which is creating opportunities for data professionals. Additionally, we saw that innovation is the engine driving this continued growth. You also discovered that data skills and tools are becoming more universal. At the same time, experts foresee a continued specialization of roles within the different fields. We were introduced to artificial intelligence and saw how it has become an important tool for data professionals. We also took a brief look at some of the common roles and responsibilities in today's data professions. While we reflect on the information we've covered so far, please remember that you're not taking these steps of personal and professional growth alone. One of the most beneficial resources available to you is our discussion board. Take advantage of the opportunity to interact with like-minded learners and gain knowledge from their experience. Coming up, we'll take a closer look at the skills needed by data professionals and we'll investigate how larger organizations incorporate data analysis. I'm looking forward to joining you as we continue your journey. I'll see you in the next video. We began our journey into the data professional world by covering the basics of data science, exploring careers, and discussing skills needed to succeed. In this section, we'll investigate the workflow within data-driven careers. You'll be introduced to a helpful organizational tool called the PACE model that can provide structure while allowing great flexibility when working on projects. We'll explore the elements of effective communication and some best practices when sharing your ideas with others. All of these skills and tools will help you prepare for your upcoming portfolio project. I know you're excited to continue. So, let's get started. The most important part of any project is preparation. That involves thinking through all the necessary steps and anticipated tasks. Let's say you were planning a dinner party. You would start by planning the theme, menu, the guest list, and all the other details. Next, you might check the reservation list, dietary restrictions, or take a trip to the grocery store for ingredients. Afterward, you would return home to prepare your dishes, clean your space, set up the table, and get ready. Because you set up so much beforehand, the evening will be awesome. While this scenario may be fictional or not, it offers some wise advice. As a data professional, being able to visualize data, predict outcomes, and quickly pivot away from obstacles makes you a problem solver and a great asset to any organization. Benjamin Franklin once said, by failing to prepare, you are preparing to fail. After almost three centuries, it still applies to something as simple as a dinner party or as complex as a deep space mission. Regardless of the project, having a structural framework in place for how to get that work done can be immensely beneficial. I've done a lot of data science consulting in my time, and one of the most common problems I've seen is teams coming to me thinking that they need advice on which tool or equation to use, and all of us finding out during the meeting that they're solving entirely the wrong problem. All the math in the universe won't help you if you don't point it in the right direction, but it's easy to get excited about those nitty gritty details and rush headlong into a waste of time. The best teams I've worked with have adopted a framework to help them focus on the most impactful actions in the most efficient order, and they've had the discipline to use it to stay on track instead of running off into the weeds. For those of you who completed the Google Data Analytics certificate, you'll recall a structure for the data workflow that was divided into six phases. Ask, prepare, process, analyze, share, and act. This framework is quite useful for a multitude of projects, but often with bigger datasets you need more freedom and flexibility. The PACE framework designed for this program offers the same workflow and structure, but in a simpler way. PACE is a framework developed with input and feedback from our team of data professionals. The intent of PACE is to provide an initial structure that will help guide you through projects. The goal is to lay a foundation upon which you will develop your own workflow practices. The PACE framework helps you solve problems and make judgments quickly and efficiently. PACE is an acronym. Each one of the letters represents an actionable stage in a project. Plan, analyze, construct, and execute. In the plan stage, you'll define the scope of your project. You'll begin by identifying the informational needs of your organization. This is where you'll want to ask yourself questions like the following. What are the goals of the project? What strategies will be needed? What will be the business or operational impacts of this plan? Taking inventory of the project and the tasks required will help you get a better understanding of the context of your work and prepare for success. During analyze, you'll engage with the data. You'll start by preparing it for your project. Here you'll acquire the necessary data from primary and secondary sources. And then you'll clean it, reorganize it, and transform it for analysis. Then you'll conduct a methodological examination of the data. You'll also engage in exploratory data analysis or EDA for short. This will involve converting the data into usable formats, assessing the quality of the data, and then diving into the data to find as many potentially useful insights and directions as possible. You'll then work with your stakeholders to see which of those areas are worth pursuing in more detail, which brings us to the construct stage, where you're going to pursue a limited subset of all those potential insights that looked interesting to you in your EDA. And here is when you will work with other data professionals, potentially statisticians and machine learning engineers to do things like building machine learning models and revising those, uncovering relationships within your data, and doing statistical inference about those relationships. Finally, in the execute stage, you will share the result of your analyses and your collaboration with your stakeholders, as well as the value that you've unlocked from your data. Here you'll present findings to internal and external stakeholders, answering questions and considering different viewpoints. You'll also have an opportunity to present recommendations based on what you found in the data. You may discover that you revisit the planning and analysis stages as you refine models and incorporate feedback. A good way to visualize the PACE framework is as a completed circuit. Each of the four stages must be engaged for it to function correctly. The electricity or flow of energy in the PACE circuit is the communication between you, your team, and all the other stakeholders and collaborators involved. When you look at the stages in this manner, you might think that communication only flows one way. Well, you always do start with planning, but don't be afraid to go back and iterate. With PACE, new information and feedback can be incorporated in any part of the process. You might need to return to Analyze to clarify some aspect of the data, and then jump back to Execute to present this aspect to your stakeholders without the need to construct new models or dashboards. Along the way, you'll see how the PACE framework can be scaled to fit within the scope of any project. The model's adaptability will prepare you for a dynamic profession that requires a high degree of professional flexibility and communication. Regardless of where your career takes you, the PACE framework is a tool that provides a clear foundation and structure. Through the continued application of the PACE framework, you'll prepare for each course's portfolio project. Then you'll have the opportunity to practice your evolving skills. Each portfolio project will introduce opportunities to develop and strengthen your organizational methods. As you develop your own intuitive workflow, the PACE framework can be a great organizational tool. Next, we'll look at how communication is so crucial. When I was a kid, I used to love to take things apart so that I could see how everything came together on the inside to make something work. I actually still love doing that today. And this is so much easier to do with tangible things that you can hold and look at, like a pen, for example. I can twist the tip to pull each section apart. From there, I can pull the ink cartridge out and check out the way that the spring works in tandem with the top part of the pen to make it click. Studying the way things work becomes much harder when we start thinking about abstract concepts like communication. And you're probably wondering why it's important to talk about it in the first place. In your role as a data professional, you are the direct connection between the information inside the data and other project stakeholders. So let's take a closer look at some key considerations when crafting a message. All the communicative exchanges have three key elements that we need to keep in mind. Purpose, receiver, and sender. When we think about purpose, I want you to think about the reason why the communication is taking place. In analytically oriented settings, you might find yourself in situations where technical pockets of information require analysis or reporting. On the other hand, there may be contexts that depend on strategic insights which will be used to direct a company's financial or organizational efforts. The receiver is your audience. When you think about the receivers of your messages, I want you to think about who you are talking to. It's helpful when crafting communications to ask, what does my audience already know? What do they need to know? And it's important to keep in mind that every exchange can cause a rippled chain of events. As a data professional, you're often working as part of a distributed team across an organization. And that's why a message shared with one receiver may be used for reporting with or to others. The sender is the person responsible for crafting that message or communication. Yes, you. You're the sender. The sender is a crucial part of any communicative exchange. As the sender, I encourage you to think about the following. What's your relationship to the receivers? What's your role in this exchange? Are you reporting insights? Are you pitching ideas? Or are you identifying potential data inputs? Also, what personal biases might affect the message you're trying to share? At the heart of the relationship between purpose, receiver and sender is the message or communication that you intend to share, which is impacted by all three elements. For this reason, the same message might be shared in dramatically different ways when the purpose and the receivers change from scenario to scenario. For example, you might know a data professional working on a complex project. They've been involved since the project was sparked by an idea in a meeting. The way they articulate the project pitch, structure, research, models and findings will change depending on whom they're talking to. With non-technical audiences, they're much less likely to focus on details about the code used to program the model and instead they'll focus on the impact of the project. When the audience shifts to other colleagues, your friend may instead choose to be very detailed about the code and project logistics. In both cases, the overall message exchanged, that is, information about the project was the same. How it was crafted, the details included, and the way the information was organized, that's what was different. Now that we've examined the key elements of communication, we can really start to think about some of the best practices that will set you up for success when you communicate in your future job. My name is Molly and I'm an analytics manager on Google Photos. We use data to make the Google Photos app better for our users so that they can reminisce on all their memories of their life. Communication is everything in data analytics. You could say that coding is just communicating with computers and at the end of it, the what the computer gives you, you have to then communicate back to other humans. Most of the time, they don't have a background in analytics or statistics. So you really have to put it in a way that anyone could understand. Being able to communicate not only what to do, how to do it, but then what the results are, I think is as important if not more important than the actual code. I would say communication is first really important when you're working through what your priorities are. So as an analyst, you'll get a lot of requests. Everyone's going to be your best friend and asking you for a lot of things. But it's really important to kind of take a zoom out of the trees and look at the forest and say, what is the most important thing I could be spending my time on? And usually, that's not the request of the person next to you asking for something. So instead of saying yes to everything, which I did when I was early in my career and then just try to burn the midnight oil and do everything in order of just first in, first out, I now say no to a lot of things. It's better to say, sorry, this is not one of my top priorities and I'm not going to be able to get to it in the next couple of weeks. That's better than saying yes and then not being able to get to it in the next couple of weeks. So communication is important. Usually you have to be transparent and if you emulate transparency and trust, you'll generally get that back as well. You want to be transparent in what you're working on and keep your updates coming in frequently. There's very few times I've been annoyed that I got too many updates on something. Most of the time it's the opposite and you're like, what is the status of this? So I haven't heard on this and your stakeholders are thinking that as well. We create these requirements documents for every analytics project and you get signed off and you say, this is what I'm going to do. Here's what it's going to look like. Here's how long it's going to take. It's a little, you know, over communication but being on the same page is really important. There's a lot of data out there and it is only growing. So the amount of people who can translate what's going on from zeroes and ones on a computer into telling a data story into producing really beautiful visualizations into finding problems. You know, at the end of the day humans write code and humans make coding errors. So how do you spot those in a way that a computer couldn't or how do you think about what are the new things that we haven't even touched on that we could work on? There's unlimited amounts of things you can do with data and I think what excites me is 10 years ago nothing that I'm doing today was even possible. So what I'm doing 10 years from now is probably going to be not possible today and the opportunity to get into the data field really means opportunity to keep learning and growing at all times. Have you ever stopped to think about what makes communication successful? Surely you can think of a time when you witnessed a communication failure. I know I can. Communication is one of the most essential components in the exchange of ideas and concepts. It's a large part of everyday life and the foundation of all human interaction. It's also an important element of analytically focused work that isn't discussed as often as technical skills. In this video I'm going to share a few tips for successful communication. My hope is that you'll rely on this advice in your future role as a data professional. Having a clear vision of why you're communicating is the first thing you need to consider. Your why depends on the context set by the business or organization that you work for as we discussed in earlier videos. Your why is also informed by the project's mission statement that is the goals orienting the project. And when crafting any form of communication use your why to guide the main ideas so that your audience can identify how to act or respond with purpose. Time is a currency in the professional world. It's very important to be efficient. As you draft or prepare to communicate I encourage you to use direct language minimize wordiness and avoid unnecessary details. You should always strive for clarity. Use proper grammar and punctuation. Regarding vocabulary keep it simple choosing specific terms and avoiding technical language where possible. Break complex ideas into shorter sentences especially when addressing a diverse audience of professionals from outside data analytics. This will make the points within your communication easier to understand and remember. You may recall from earlier videos that I mentioned the three key elements of communication. These are the sender receiver and purpose. Now I want to add to that list and consider the setting for your communication as well. Are you having a discussion at lunch with a co-worker? Are you creating an email to update all the stakeholders about your progress or presenting the results of your analysis to a boardroom of executives where and how you'll be delivering your message will have a direct impact on the way you shape it. Let's use a simple example to help illustrate effective communication. Like using a hose to water a plant the straighter the hose the better the water flows. Now bend the hose or tie a knot into it and the flow of water is interrupted. In this analogy the water is like the intended message. The hose is whatever method you've selected to deliver your message. Knots in the hose symbolize misunderstandings within the mind of your audience. When the knots are created the free flow of information stops even if you continue to deliver information. As a data professional you may find that a significant aspect of your work involves your ability to communicate with non-technical audiences. Your goal should be to meet your audience where they are. What I mean by that is you want to be efficient in your communication and tailor it to the people you're sharing information with. Depending on how familiar they are with what you're sharing you'll likely want to break down technical concepts into simpler terms. Avoid jargon acronyms and technical buzzwords. When everyone understands the intended message you can avoid the knots and the flow of ideas can continue. Carry these practices through all types of interactions email, messaging and both virtual and in-person communications. As you continue throughout the program consider ways that you would communicate the technical concept you're acquiring to different audiences. Also remain focused on the purpose and setting of your communication. All these factors add up and make a big difference to how effectively you communicate as a data professional. Previously you were introduced to the PACE framework. As a reminder PACE has four stages plan, analyze, construct and execute. We also examined the rationale and key elements involved in effective communication. Now let's investigate how they can be integrated together within data projects. At first glance the PACE framework easily enables a wide view of an entire project's workflow that is each stage from planning to executing is logically sequenced with the beginning middle and end of a data project. As you may recall though PACE is incredibly flexible and it doesn't need to be employed in a linear progression. Often iteration makes sense in projects. So let's unpack that with an example. For reference we can compare it to the construction of a building. First a blueprint is created. This is where the planning happens. Next we lay the foundation. We can do this once we've analyzed the lot space and considered other variables like cost. Then we add the frame. After that the roof can be installed. By the time the structure is ready for move in the builder will have progressed through the entire workflow planning, analyzing, constructing and executing the client's vision. In practice the PACE workflow is meant to serve as a navigational tool. We created it with the goal of helping you understand the data professionals workflow and as an aid you can consult in your future role. Now let's return to the example. With the roof completed that makes it possible to begin the insight construction like installing drywall, electrical components and plumbing systems. Each of these jobs also has its own PACE workflow demonstrated on its own planning document within the larger blueprint each requiring planning, analyzing, constructing and executing. Just like the building example we've been discussing as data projects move into the construct and execute stages on the global level you may need to return to the earlier stages to incorporate additional data or feedback from other stakeholders. Even while the global project is transitioning into a new phase of PACE there can still be upcoming tasks that are just beginning their PACE cycles. Regardless of where you might be in the PACE workflow communication is what drives the framework to the realization of the project. At each stage within the framework there will always be a need for communication to improve the workflow. This could be asking questions about your data gathering additional sources updating stakeholders on progress or presenting findings and receiving feedback. One of the most important considerations behind the development of PACE was providing a flexible structure that allows you to adjust changes within a project. Let's return to our building example. Let's say that during the installation of the electrical system the property owner has communicated that they wish to have an additional charging port for electrical vehicles added to the plan. To facilitate the change you would revisit the PACE framework to plan, analyze, construct and execute this new request. Just like in this example requests from other stakeholders can come in at any time. Regardless of the timing of an additional request or task data professionals need to be available and accessible throughout the entire project cycle. Sometimes you may need to speak at a meeting or participate in a progress update. Additionally you may update your progress within a tracking system email conversations and chat discussions will keep others involved and up to date with where you are in your workflow. I'm excited for you to develop some hands-on experience practicing different communication strategies through each stage of PACE. You'll have opportunities to do this later in the program. For now I want you to remember that a good data professional is a proactive communicator who responds to questions in a timely fashion. Keeping other stakeholders up to date with clear explanations can make you the most effective data professional that you can be. You're nearing the end of your first course and we've discussed some key concepts especially in the areas of communication and data workflow through the PACE framework. As you're preparing for your first portfolio project I'd like to point out a vital consideration when using the PACE framework. Each stage of the workflow is not mutually exclusive. Different roles, teams, projects and workflows may put emphasis on different stages of the PACE framework. So, although you may be working on a task that is primarily focused on for example analyzing data there will still be elements from the other stages of the PACE framework that will affect your work as well. Let's take a look at the way this program applies the PACE framework. We can take a global perspective and say that the earlier courses are heavier on planning and analysis and the later courses are for construction and execution but that's a very wide view. If you take a closer look you'll discover that each course is operating within its own PACE framework. In your Python course you'll be acquiring knowledge to enable you to use the language for data analysis. Because you're acquiring new skills much of the course may be classified in the planning stage of PACE. As you move on to that course's portfolio project you'll shift into analysis but this will have elements of planning, constructing and even execution as you create the final product. Next you'll learn how to prepare data to reveal the stories within. Here you'll lean into the analyze stage. As you become familiar with the foundations of statistics you'll continue analyzing data and you'll construct some new tools for your toolbox that you'll apply in your portfolio project. While you're exploring regression you'll extend your experience in the analysis stage and put it into practice by constructing data models. You'll also practice executing by summarizing the results and insights to provide value to your stakeholders. Advanced modeling will further expand your analysis and construction skills within that particular portfolio project. And even while you're expanding your knowledge of career resources you'll be operating in the planning stage as you collect information the analysis stage while processing information about the job market and the constructing stage while assembling your resume and portfolio. You'll have an opportunity to bring all your newly acquired knowledge and skills together in the capstone project. Here you'll have the chance to use the entire PACE framework through scenarios and data provided by our industry partners. Even though some stages are more prominent than others in each course you'll still see evidence of all the stages throughout. You can look at PACE as the scaffold that surrounds the exterior of a building. With this scaffold in place different tasks within the building can use the same workflow structure without disrupting the entire project. As you can see the PACE framework is an excellent way to help guide you as you gain knowledge and confidence. Understanding how to classify tasks and proactively interact will help you develop good habits and eventually develop your own professional workflow. You've reached the end of the data workflow overview. You've covered a lot of material and concepts. Great job! You saw how the PACE framework can be used to help structure and guide you through projects. Next we learned about effective communication and how it can be used within the PACE framework. We looked at the external factors that can have an impact on data analysis. We also looked at the external factors that can have an impact on data analysis like bias within data. As you're preparing for the weekly challenge don't forget that you can review any of the materials we covered. In the next section you'll begin working on your first portfolio project. I wish you the best of luck. Attending a class, watching an instructional video or reading information are all great ways to gain new knowledge. However, there's nothing like applying that knowledge. When you actually do something this really helps confirm that you understand what you've learned. This concept is called experiential learning which simply means understanding through doing. It involves immersing yourself in a situation where you can practice what you've learned. Further develop your skills and reflect on your education. Experiential learning gives a broader view of the world provides valuable insights into your particular interests and passions and helps build self-confidence. In the context of this program experiential learning will give you the opportunity to discover how organizations use data analysis every day. This type of activity can help you identify the specific types of industries and projects that are most interesting and gain the confidence necessary to discuss them with potential employers. This can really help you stand out during a job search. Soon you'll put experiential learning into practice by working on a portfolio project. A portfolio is a collection of materials that can be shared with potential employers. Portfolios can be stored either on a public or personal website. They can be linked within your digital resume or any online professional presence you may have such as you're linked in. Your portfolio project for this course will involve using the PACE model to set up tasks of a project. Creating a portfolio project is a useful opportunity since companies will often ask you to complete some type of project during the interview process. Employers commonly use this method to assess you as a candidate and gain insight into how you approach common business challenges. Completing this portfolio will prepare you if you encounter the situation when applying for data-focused jobs. Coming up, you'll be introduced to the specifics involving your portfolio project. You'll also receive clear instructions to follow. As you begin working, consider the knowledge and skills you've acquired in this course and how they can be applied to your project. Within each portfolio project, you'll prepare a PACE strategy document. This will help you identify key points within each project to share with a hiring manager such as the many transferable skills you've gained. A transferable skill is a capability or proficiency that can be applied from one job to another. Highlighting your transferable skills is especially important when changing jobs or industries. For instance, if you learned how to solve customer complaints while working as a host at a restaurant, you could emphasize the transferable skill of problem-solving when applying for a job in the data field. Or maybe you learned how to meet deadlines, take notes, and follow instructions while working in administration at a non-profit organization. You could discuss how your organizational skills are transferable to the data analysis field. The point is you've developed the ability to problem-solve or keep things organized in one role. You can apply that knowledge anywhere. There are all kinds of transferable skills that you can add to your resume. Reflecting on your transferable skills and the notes you take in your PACE strategy document will help you consider how to convey technical concepts clearly. This will also help you demonstrate how you would apply your expertise across all kinds of tools and scenarios in the data career space. And by the time you're done, you'll not only have a very useful data analysis process document, but also a comprehensive set of artifacts for your portfolio. Sounds exciting, doesn't it? Let's get going. When I interview people for jobs here at Google, I love checking out their online portfolios. I find that I feel more confident in candidates who can demonstrate their knowledge in a clear and compelling format. Having a portfolio has become incredibly common in the data field. During a job hunt, it's so valuable to showcase your ability to understand business scenarios, communicate effectively, and use tools to solve complex problems. Your portfolio can really help you stand out from other candidates. So far in this course, you gain lots of knowledge and job-ready skills to help you excel. You've discovered the role of data professionals within an organization and typical career paths. You've explored core analysis practices and tools and witnessed how data professionals use them to make a positive impact. All of these things will help you successfully complete your portfolio project. In addition, you will apply what you've learned about team members, stakeholders, and clients such as their particular roles or priorities. You'll begin by reading the specific project you'll be working on. This reading will describe the type of organization you're working with, the people involved, the business problem to be solved, and other key details. This will enable you to further define the project, understand the stakeholders, and consider key questions to answer. In order to achieve a successful result. Then you will create a PACE Strategy document outlining the project's purpose, stakeholders, deliverables, and much more. In this document, you will begin to integrate the PACE model to identify steps at each stage in the project. For each portfolio project, you will continue to use the PACE model to guide you. By completing each PACE Strategy document, you will be well on your way to developing your own data analytic workflow. Then, in later courses, you will continue working on your portfolio project and continue using the PACE model to guide your process. And by the time you're done, you have designed something that you can use to really impress hiring managers. Plus, you'll have a dynamic example of your data analytics skills, demonstrating your thought process, approach to the problems, the key skills you've gained, and lots more. These are all great things to talk about during an interview. All right, let's get started. It's time to discover how you will help an organization advance through the exciting world of data. You've completed the first portfolio project. Congrats. Planning and documenting your approach to the example situation is a useful experience as you begin thinking about your future job hunt. Soon, you'll be able to impress hiring managers by discussing your data professional experience, including understanding stakeholder requests, establishing a clear and straightforward project plan, and completing an effective strategy document. In addition, you'll understand how to plan out and organize the necessary workflow. This is a big part of a data professionals process. Also, as you've learned, it's helpful to communicate about your transferable skills with potential employers. The information you've added to your pay strategy document will be valuable during job interviews. There will be another opportunity to complete a portfolio project at the end of the next course. At that point, you will use the pay strategy document that you started here to continue developing your skills as a data professional. Then, at the end of the program, you'll bring everything together in order to finalize your unique approach to this example situation. The goal is to have a great example of your work that clearly demonstrates your skills to potential employers. Congratulations again, and I hope you have a really rewarding experience as you continue working on your portfolio project throughout this program. Congratulations on finishing this first course. You've already learned so much, and now you're ready to take your new knowledge and skills and keep moving forward. But remember, if one day you feel like you need a refresher or just some extra practice, these videos, readings, and activities will still be here whenever you need them. Now I'm excited for you to begin working with the instructor for the next course. They're ready to help you take your next step towards finishing this program and continuing your journey as a data professional. This course builds directly on all the interesting topics you've learned so far. It will give you more insight into the Python programming language and how it can be used to enable data analytics. As you progress, you'll continue building your data professional skillset. And by the end of the course, you'll be prepared to take the next step towards your portfolio project. Before you get started though, I'd like to thank you for joining me in this course and choosing to pursue this exciting learning opportunity. I strongly believe that education is a lifelong journey and I've no doubt that all the time and effort you put into experiences like this one will better equip you for anything you choose to pursue. You've come a long way, so take a moment to celebrate everything that you've already accomplished. And then when you're ready, head on over to the next course. Hey there, welcome to the next stage of your learning journey. Congratulations on completing your first course. You've learned how data professionals contribute to the success of an organization and the main tools and techniques they use on the job. Now you'll learn how to use one of the most powerful tools available to data professionals, the Python programming language. Computer programming refers to the process of giving instructions to a computer to perform an action or set of actions. You can use different programming languages to write these instructions. You might choose a specific language based on the project you're working on or the problem you want to solve. The Python programming language is super useful for working with data. Data professionals use Python to analyze data in faster, more efficient, and more powerful ways because it optimizes every phase of the data workflow from exploring, cleaning, and visualizing data to building machine learning models. This course will give you a strong foundation and Python fundamentals and prepare you for more advanced data work in your future career. If this is your first experience with the Python programming language, welcome. This course does not assume you have any prior knowledge of Python. We'll begin from the beginning and work through each concept step by step. Take it one step at a time and go at your own pace. As you develop your Python skills, you'll apply what you learn to gain valuable practice working with data. And if you have experience with Python, that's great too. I'll help you apply your knowledge in a new way and demonstrate how to use Python for data analytics specifically. Let me introduce myself. My name's Adrian. I work as a customer engineer at Google Cloud. This means that I work with our customers to understand the technologies they have at their disposal to solve data analytics needs. I first learned Python when I went to create an electronic journal. I got tired of buying a new physical one every year. I learned how to password protect it and to this day it's still one of my proudest moments of using Python. Throughout your career as a data professional, you'll have the opportunity to continually learn and grow. To me, that's one of the coolest aspects of the job. And learning Python is one of the most rewarding parts of that growth process. I'm still learning new ways to use Python all the time, both at work and for fun. Now let's review what you will learn. We'll start with a general introduction to Python and discuss why it's such a popular programming language among data professionals. You'll learn fundamental coding concepts such as variables and data types and how they help store and organize your data. You'll also get a chance to start writing your own Python code. Next, you'll explore functions or reusable chunks of code that let you perform specific tasks. Functions help you work with data quickly and efficiently. You'll also learn about conditional statements which tell the computer how to make decisions based on your instructions. Then, you'll discover the power of loops which repeat a portion of code until a process is complete. You'll also learn how to work with strings which are sequences of characters such as letters and punctuation marks. After that, you'll explore data structures in Python which are methods of storing and organizing data in a computer. You'll review the most useful structures for data professionals such as lists, sets, dictionaries, and data frames. Finally, you'll apply your Python skills and an end-of-course project that you can add to your professional portfolio. The end-of-course project features a unique dataset based on a workplace scenario. In future job interviews, you can share your project as a demonstration of your skills and impress potential employers. Learning Python will take your data analysis skills to the next level. It'll also be a great addition to your resume. Knowing how to use Python is a key credential for data professionals and will give you a big boost as a job candidate. I'm here to help you every step of the way and remember, you set the pace. Feel free to watch the videos as many times as you like and review topics that are new to you. By the end of this course, you'll know how to use Python to explore and analyze data. Let's get started. Hi there. I can't wait to explore Python with you. During our journey together, you'll learn how to use the Python programming language to power your data analysis. This course will give you a foundation of essential Python skills that you can continue to build on throughout this program and your career. We'll begin with an overview of Python's main features and capabilities and discuss the basics of Python programming. You'll discover how Python can help you work with your data quickly and efficiently. Then we'll consider Jupyter Notebooks, an interactive environment for coding and data work. You'll learn about the useful functionality of Jupyter Notebooks and how to write Python code in the notebook environment. Python is an object-oriented programming language based on objects that contain data and useful code. Next, you'll explore the benefits of object-oriented programming for data professionals and become familiar with its basic concepts. After that, you'll discover variables, one of the building blocks of Python programming. You'll learn how variables help you store and label your data and how to assign specific values to variables. We'll also review the conventions or widely accepted rules for naming variables. Learning these conventions will help you write code that is clear, precise, and consistent. Finally, we'll explore fundamental data types such as integers, floats, and strings. You'll learn how to convert and combine data types to organize your data. When you're ready, join me in the next video. My name's Adrian. I work as a customer engineer at Google Cloud. This means that I work with our customers to understand the technologies they have at their disposal to solve data analytics needs. A couple of non-traditional backgrounds started out in nursing. I spent several years there before I realized that while dealing with patients and helping patients was definitely fulfilling, it wasn't exactly what I wanted to do for the rest of my life. There were a number of transferable skills that I learned from my previous career as a nurse. First, being critical thinking, second, problem-solving, and third, assessment. Within advanced data analytics, critical thinking and problem-solving becomes key when trying to debug. Furthermore, when looking for answers online, you need to be able to assess what information you're being given and understand how to apply it to fix your problem. Another skill that I was able to bring over from nursing is soft skills or interpersonal skills, which are critical in advanced data analytics as you try to work with others in a collaborative space. I got into this career because I had gained skills related to programming just throughout my life. I didn't really understand how I could apply it or the fact that I could actually apply it. When I got an undergraduate degree in English and History and it was after realizing what the job opportunities were for humanities that I had to think again about what my next step would be. That's where technology came into play and I found that I could actually use technology and integrate that with what I was doing from a humanities point of view with English and History and get into digital humanities. That led me to do data analytics through the process of or the concept of knowledge management where in a space where every company is going to become a data company whether you're working in medicine whether you're working in retail whatever it might be data and the ability to manipulate and leverage it is going to be critical. My favorite thing about data analytics is it can be incredibly entry level. Once you understand those basics you can get started on your own. You don't need to have a formalized university degree. You don't need to have years and years of experience. Getting started is something that can be done as long as you put in the work to get the elementary foundation down. Python is a powerful coding language that has become one of the preferred tools of data professionals worldwide and for good reason. In this video we'll explore what Python is and why it has become so popular. But first let's discuss some basic elements of programming languages in general. Programming languages originated with the development of electronic computers. They were and still are the words and symbols that we use to write instructions for computers to follow. Communicating with the computer ultimately relies on computer hardware. A transistor is the most fundamental component of the computer because it controls the flow of electricity through a circuit. A transistor can exist in two states on or off like a switch. When a transistor is on electricity passes through it. When it's off it blocks the electricity. This duality defines how computers operate. If you chain enough transistors together each either on or off you can create complex logic. So how does this relate to computer programming? Well because computers are just billions of transistors or switches they understand things only in binary terms. You may have encountered this concept before binary. Binary is represented as ones and zeros. These numbers are just an easier way to refer to the on and off sequences of transistors when a computer receives instructions from a program. Computers are powerful but they still need to be given instructions and they can only understand instructions that are given in binary. The engineers who first designed computers encountered this and discovered a problem. Computers are great at understanding binary. Humans are not. This problem is what gave rise to the first programming languages. The very first programming languages were difficult to use. Required lots of training and often only worked on the specific machine that each was designed for. These types of languages are known as low level languages. Over time new coding languages developed to simplify and generalize programming instructions. The programming languages became easier to learn because they were designed with simpler rules and structures known as syntax. Modern programming languages use syntax that's much more familiar to humans. These languages are known as high level languages and this brings us back to Python. Python is a high level language that's versatile and easy to learn. Simply put, Python is friendly. In fact, some people might think that the name itself is scary but the creator of Python didn't name it after a giant snake. He named it after a British comedy trope, Monty Python because he wanted to be easy and approachable. In addition to Python being versatile and easy to learn it's also powerful. This combination of qualities has made it a favorite not only of data professionals but of scientists and web developers too. Part of what makes it so powerful is that it's open source and developers have created many libraries and tools to make many jobs requiring Python easier. A library is a reusable collection of code. For instance, you could hand code a function that takes two numbers, add some together and returns the sum but what if now you want to add three numbers or four? You can write a more complex function that lets you input any combination of numbers and it will return the sum but summing is a super common task so you can save yourself a lot of time by just using a math library that contains a sum function. There are thousands of Python libraries that contain code for tasks as simple as summing numbers and as complex as building a neural network for an artificial intelligence application. You'll learn more about libraries soon and you'll learn about neural nets and AI and how they fit into the world of data analytics in a later course. This certificate program focuses on advanced data analytics so you'll learn to use Python as it's most often applied in data analysis work. You'll also learn about NumPy, Pandas, stats models, Matplotlib, Seaborn, Scikit-learn and more. These are code libraries that are used every day by data professionals on the job. You'll explore these in detail later. The ease of learning, ease of use, versatility and power of Python make it one of the most used coding languages today. Because it's so widely used it has a large and active community of users who are willing to help and provide support which makes it a great coding language to discover and explore. As you move through this course and the entire certificate program always remember that coding is simple and complex. In other words each line of code represents a small simple idea but together those lines of code can express very complex logic. Coding can be frustrating at times but also a lot of fun and very rewarding. You will practice coding a lot in this course so you can get better and better. Lastly don't be afraid to make a mess. Experimenting is part of the process and practicing will help you quickly improve your coding skills. Previously you learned that Python is a high-level programming language. This means that Python uses more human friendly syntax and more closely resembles a spoken language. In fact Python is simple enough that you can learn some of its basic concepts just by example. In this video I'll demonstrate some of the fundamentals of Python. I'll introduce you to some new terms but we won't take into account formal definitions and processes. We'll consider these things in more detail later. For now just take a moment to get familiar with the code and how it works. The first thing we'll do is print to the console. If we tell the computer to print Hello World it will output Hello World for us. In fact the print function will output whatever we enter in its parentheses. Of course Python is also capable of performing computations. If we divide the sum of five and four by three we get three. And Python we can also assign variables. You can think of a variable like a container that you can name. A container's contents are known as its value. For example here we create two new variables. The first variable is called country and we're assigning to it the value of Brazil. The second variable is called age. We'll assign it the value of 30. Now we can refer to the values Brazil and 30 by their variable names. When I give the country variable to the print function it returns Brazil. Similarly when I give the age variable to the print function it returns 30. Python can also be used to evaluate statements. For example I can check if 10 to the third is equal to 1000. It is so the computer returns true. Notice that so far I've used the plus sign for addition the forward slash for division and two asterisks to indicate an exponent. Like in mathematics these signs are known as operators. A lot of them are straightforward however some might not be so simple. Take this last case for example I use two equal signs to check the equivalence of two values. When used properly the computer will return either true or false. What happens when I use one equal sign? I get a syntax error. You might have noticed from earlier examples that the single equal sign is reserved for assigning variables. Just like in spoken languages Python has rules that govern its construction. You'll learn many of these rules in this course but I won't discuss them in detail now. You won't easily learn a new spoken language by studying a big book of its rules. You're more likely to learn it by speaking it hearing it reading it and writing it. It's also a lot more fun that way. It's the same with Python so for now don't worry about memorizing any rules. First let's observe Python in action. As you might expect if I make a false equivalency like 10 times 3 equals 40 the computer will return false. I can even use a variable I defined previously, age in a new statement. 10 times 3 does equal age. You'll recall we assigned age a value of 30 so the computer returns true. Python also lets me perform actions based on conditional logic. Here I tell the computer that if the value of my age variable is greater than or equal to 18 then it should print the word adult. Otherwise it should print the word minor. The value stored in the age variable was 30 so the computer returns adult. Another common task in Python is looping. Looping performs the same action to each element of something. Here's a simple loop. For each number in this list of 1 through 5 we print the number. The computer outputs 1, 2, 3, 4, 5. Here's another example. This time I'll create a list containing the numbers 3, 6, and 9 and assign it to a variable named myList. Now I'll loop over my list and for each number in the list the computer will print that number divided by 3. And there you have it. It outputs 1, 2, and 3. Now let's return to the conditional statement we wrote. If age is greater than or equal to 18 print adult otherwise print minor. What if we want to repeat the same action for many different age values? We can write a function. A function is a chunk of code that can be reused to perform the same task. We'll define this function and call it isAdult and it will accept an argument called age. By the way an argument is information that you give a function in its parentheses. Now in the body of the function we'll use the same code we use for the conditional statement. Nothing happens when we run this code. But check out what happens when we call the function and give it an argument of 14. The computer returns the word minor. Now we can perform the same evaluation as many times as we want simply by using the isAdult function we just created. And remember how I told you that libraries are part of what make Python such a useful and powerful language? Python has its own library of built-in functions that perform common tasks. For example, here's a list of numbers. 20, 25, 10, and 5. We'll assign this list to a variable named newList. We can use the built-in Python function called sorted and enter newList as its argument. It returns our list with its values sorted from least to greatest. These are just a few simple examples of what Python can do. These simple processes can be stacked and layered and combined to create algorithms and programs that could possibly change the world. Python's power is bounded only by the limits of your imagination. Not that we've explored what Python is and what it can do, I'm excited for you to learn more about it. Recently, you've been investigating the power of Python. There are many environments programmers use when exploring all of Python's capabilities and one of the most popular is Jupyter Notebook. You've already encountered Jupyter Notebook. The examples that I've presented to you previously were done in one. We'll use this platform to write code and perform analysis throughout the course. We'll also provide you with information about how to set up Jupyter Notebook on your own computer. But that's optional and not necessary to complete the Advantage Analytics program. Jupyter Notebook is an open source web application for creating and sharing documents containing live code, mathematical formulas, visualizations, and text. Using a Jupyter Notebook lets you collaborate on data projects and integrate code. Plus it puts all of your output in one document which is very useful especially when first learning about programming. To illustrate why take this example. In the computing world most code is written in an environment similar to this. This is a terminal based text editor. Notice that it's like a single page that's infinitely long. If I perform an operation or write a line of code it executes immediately when I move to the next line and I can only move forward. I can't return to an earlier line of code and sort my cursor there and change or run it. A terminal based text editor is a very useful environment for many many situations but it's not always the best or easiest to work with for data analytics projects. Now compare that to our notebook environment from before. Here I can more easily modularize my code into cells to organize them in sections. Cells are the modular code input output fields and to which Jupiter notebooks are partitioned. I can move code around add it and delete it with the click of a mouse or the press of a button. And it's great for visualizations and presentations. I can add comments annotations and explanations using markdown syntax. Markdown lets you write formatted text in a coding environment or plain text editor. For example, I can add titles and bullets tables and mathematical formulas. These are just a few of the many features of Jupiter notebooks that make it a preferred platform among data professionals. As you move through this course you'll create projects in Jupiter notebooks to showcase your skills as a data professional. You're well on your way now. In this video you'll learn more about Python and what makes it both approachable and powerful as a coding language. You're going to learn about object-oriented programming. Specifically, we're going to discuss classes, methods and attributes. What follows is a brief introduction to object-oriented programming. A more detailed discussion is beyond the scope of this course. You'll have the opportunity to further explore object-oriented programming in your future career as a data professional. Object-oriented programming is the programming system that is based around objects which contain both data and useful code that manipulates that data. An object is an instance of a class. Think of it like a fundamental building block of Python. Lists, functions, strings, these are all objects. The main idea behind object-oriented programming is to have both data and the methods that manipulate it within objects allowing for more organized, accessible and reusable code. The most important concept in object-oriented programming is the class. A class is an object's data type that bundles data and functionality together. In other words, the reason why it's useful for an object to have a type or belong to a class is because it allows us to build a bunch of useful tools that can be packed directly into the object itself. This will make a lot more sense with an example. When we put the words hocus pocus, inside quotations, and assign that to a variable called magic, this variable becomes an instance of the string class. Because it belongs to the string class, it behaves in a certain way and has a lot of built-in functionality reserved for strings. We can swap the case of the characters by typing magic.swapcase with empty parentheses after it. We can replace some characters with new characters by typing magic.replace and entering the characters we want to replace and what we want to replace them with. And we can split the string into a list of two strings using dot split and an empty pair of parentheses. These actions are known as methods. A method is a function that belongs to a class and typically performs an action or operation. They use parentheses. In our examples, each method acted on the value of our variable. It changed it in some way. That's what I mean when I say that methods typically perform an action or operation. By the way, you don't have to memorize these methods. There aren't many people who know all of them. Most coding environments have ways to access a list of methods available to a given class. In Jupyter Notebook, we type a dot and then hit the tab key. Notice that we're attaching the method to its class instance using a dot. That is called dot notation and it's how we access the methods and attributes that belong to an instance of a class. There are many different classes in Python. You've encountered some of these already. The core classes of Python are integers, floats, strings, bullions, lists, dictionaries, tuples, sets, frozen sets, functions, ranges, and none, which is a data type that returns an empty value. There are also many additional custom defined classes that come with libraries and packages and you can even make your own. Okay, the last thing we're going to discuss is attributes. Attributes are values associated with an object or class which are referenced by name using dot notation. They don't use parentheses. Attributes are especially important for custom built classes and more complex data structures like data frames. You'll learn more about these later, but here's an example. Suppose we have a data frame called planets that contains a row for each planet and columns that represent planet name. It's radius and the number of moons it has. One attribute of this data frame would be its shape. This data frame is eight rows by three columns. Another attribute of the data frame class is columns. Calling this attribute on the data frame object returns an index object containing the column names of the data frame. Attributes allow you to access characteristics of a class, but they don't do anything to it or change it. Perhaps you're beginning to appreciate how object-oriented programming is an ideal structure for data analysis. By packaging data together with ways to manipulate it and learn about it, objects are the fundamental building blocks of Python and part of what makes it such a powerful tool for data professionals. My name is Hamza and I'm an Applied Machine Learning Engineer. I love building models. I love building large-scale systems and that is exactly what I do at work. It's a form of art for me that you get to create something from scratch that does not exist and you put it into production and it is used by about 100 million users. The scope of my work is to build, maintain and production-wise large-scale models. So Python is pretty much the central theme of my job. Python is a programming language which helps you manipulate data which helps you build models and you can use it to create production-level models and softwares. There are a lot of other programming languages which will do the same things that you're looking for. For me personally, I use Python because there is a lot of documentation, there's a lot of help and there is a lot of learning from past failures from different programming languages which has been incorporated to help Python be user-friendly and be used by everyone around and easy to adapt. You never get hired as a good chef because you have good knives. You get hired as a good chef because you have good cooking skills. This is the same thing. Python has helped me become a better data scientist, a better machine learning engineer and it has helped me understand diverse set of mathematical applications in machine learning which previously I was not aware of. I think the unique strength about Python is that it's just a multifaceted tool. It's not just one thing, it's not just for data manipulation or data cleaning. You can do transformations, data cleaning, you can build models, you can put them into production, you can make an API out of it, you can build monitoring systems on top of that. Those are the strengths of Python that makes you a sort of master of all in when it comes to data science. The one most important thing that when you do an online course or you're studying anything is that learning is not linear. There's a very steep curve and you finally and then eventually you just get to the aha moment. The thing is that in the first two weeks when you're doing a certain course you might be like, okay this is not making sense and it's not working for me and it's not doing what I'm looking for. So just stick to it and be consistent. The learning curve is always steep in these things but then you eventually get to a point where you say oh I know all of these things and I can put all the knowledge that I have together to build something great. In this video we're going to focus on variables. Variables give meaning to code. Think about nouns and language. Nouns are used to identify people, places or things in a sentence. Variables in Python are like nouns. Variables point to values. They're not the values themselves. If you have the expression x equals three, x is the variable and its stored value is three. The three exists in a specific location in the computer's memory. The x points to that location. Another way to think of a variable is like it's a container with a label on it. A container is a separate thing from whatever it contains but if I ask you to pass me the catch up you'll know which bottle to pass me even if you can't see the catch up inside it because the bottle is labeled. Variables can store values of any data type. A data type is an attribute that describes a piece of data based on its values, its programming language or the operations it can perform. In Python this includes strings, integers, floats, lists, dictionaries and more. You've already encountered some of these in this course. We'll explore these data types throughout the program. When assigning a new variable it's helpful to answer these questions before you code. What's the variable's name? What's the variable's type? And what's the variable's starting value? These questions help you create variable names that remain meaningful and easy to reference again later. Naming variables is important because these names are reminders for yourself and others about what you are storing in the variable. Calling the data type also helps you understand what the data can and can't do. Next, consider how you will assign expressions which can help make your code more concise. An assignment describes the process of storing a value inside a variable. And an expression is a combination of numbers, symbols or other variables that produce a result when evaluated. Now let's examine that in Python. We'll translate a variable algorithm into Python code. Let's start with a list of the ages of the starting five players on a professional basketball team. We'll assign the list to a variable called age list. Notice that we didn't call it for example x because x doesn't tell us anything about the value it contains. And if we encountered an x later we might not remember that it's a list of ages. One nice thing about Python is that the computer interprets the data type for us automatically when we assign a new variable. This is called dynamic typing. Dynamic typing means variables can point to objects of any data type. Also, there are no default types for most new variables. So we need to assign or initialize before calling them. Let's return to our example. We'll find the maximum age of the basketball players by using Python's built-in max function and passing our age list to it as an argument. We'll assign the result to a new variable called max age. When we call the variable the computer returns a value of 34. This is an integer. So the max age variable contains a value whose data type is integer. Now, suppose we want the max age variable to contain a string value. We can convert it to string by using the string function represented by str and reassigning the result back to the max age variable. Now, our variable has a data type of string which is indicated by the quotation marks in the output. You'll learn more about strings later. For now, just remember that those quotation marks are unique to strings. We can also overwrite the contents of the variable entirely if we want. For instance, we can store in it the text string 99. Now, when we call the variable the computer returns our new string. There are a couple of important things to note here. First, notice that when we converted max age from an integer to a string, we reassigned it back to itself. If we hadn't done this and we had simply used the string function on the max age variable, the computer would have returned a string, but the contents of the variable would not have changed. Generally, when you want to modify the contents of a variable, you have to reassign it. The second important thing to be aware of is that the order that you run your cells in matters. For example, if I rerun the cell where I first assigned the max age variable and then call this variable in a new cell, you'll notice that its value has reverted back to the integer 34. It's no longer the string 99. In these examples, the value contained in the max age variable changed every time we reassigned it. That's why it's dynamic. Variables are convenient because you can refer to them instead of the values they contain. So if we define a new variable that contains the minimum age in our list of ages, we can subtract the two variables to find the age difference of the oldest and the youngest player. There is so much we can do with variables and expressions in Python. The program asks a question and the variable helps us capture the answer based on input from a specific source. Just remember that if you want to modify the contents of a variable, you usually need to reassign it. It's also important to consider that the order you run the cells in matters when coding in Jupyter Notebook. Soon, we'll learn about variable naming conventions and restrictions. I'll meet you there. Python has certain spelling or grammatical rules to follow, just like any other language. In programming, we call these rules naming conventions and naming restrictions. If you completed the Google Data Analytics Certificate, you might remember that naming conventions are consistent guidelines that describe the content, creation date, and version of a file in its name. Naming restrictions are rules built into the syntax of the language itself that must be followed. In Python, there are some important naming conventions to keep in mind. One of these is to avoid keywords. A keyword is a special word that is reserved for a specific purpose and that can only be used for that purpose. You've already encountered some keywords, such as for, in, if, and else. There are others, too, which you'll learn about soon. Keywords should never be used when naming variables. Thankfully, you don't have to worry about accidentally using a keyword. First of all, keywords will appear in a special color in most coding environments. So a good rule to follow is that if you're naming a variable and the name changes color, don't use that name. And if you're thinking, but I like colorful variable names and you try to assign a value to a keyword anyway, the computer will say not so fast and throw an error. Python will let you make spelling mistakes, but it won't let you use a keyword for a variable name. By the way, throw an error is a common phrase that every coder knows only too well. It just means that the computer returns an error in your code doesn't successfully run. Also, some names are reserved for existing functions. For example, print and str. So you should also avoid using names of existing functions. To sum up, a big thing to keep in mind regarding variable naming conventions is that you don't want to use reserved keywords or functions. Precision is essential when programming. This is why there are naming restrictions for variables. For instance, variables can only include letters, numbers, and underscores. This means you can't use spaces, tabs, or special characters such as the dollar sign or ampersand. Another rule to keep in mind is that while variable names can contain numbers, they must start with a letter or underscore. Also, Python is case sensitive, which means capitalization matters. Lastly, variable names cannot include parentheses. This is because parentheses have other uses in Python. Let's review some examples of effective and ineffective variable names. Any underscore a underscore variable is a valid variable name. Any underscore a underscore variable underscore two is also good. However, one underscore is underscore a underscore number would be invalid because variable names must start with a letter or underscore. Apples underscore ampersand underscore oranges is also invalid because it uses this special character ampersand. You do have some flexibility when naming your variables. Since these are references you create, these conventions and restrictions just help make them consistent and useful. Okay, now let's go back to parentheses to understand more about how these function in Python. When doing calculations, the rules for parentheses follow the mathematical order of operations. For instance, if we input two times three plus four in parentheses, Python will read three plus four first because it's following the order of operations. This is equal to fourteen, but two times three in parentheses plus four equals ten. That's because the operation within parentheses will always be completed first. By the way, if we don't use any parentheses, Python will follow the stared mathematical order of operations. Naming conventions and restrictions for variables help maintain consistency and usefulness as you continue to use Python for a variety of activities. A key part of working with Python as a data analytics professional is being able to effectively name your variables to create meaningful code. Soon, we'll explore conversions and data types. Bye for now. Previously, you learned about variables and how to name them. You learned that variables point to values that are stored in the computer's memory. In other words, the variables are like containers and the values they store are their contents. Now, you're going to learn more about the values that your variables can contain. As you've learned, text written between quotes and Python is called a string. Programs need to manipulate data which can come in a lot of different forms or types. These data types include strings, integers and floats. First, a string is a sequence of characters and punctuation that contains textual information. Strings are instantiated with single or double quotation marks or the string function. This is an immutable data type, which means the values can never be altered or updated. An integer is a data type used to represent whole numbers without fractions and float data types represent numbers that contain decimals. Most computers understand when you tell them to add two integers or add two strings. But generally speaking, computers don't know how to work with different data types. If you try to mix different data types, it can sometimes throw an error. As always, the computer tells us the cause of the error it's like a little clue to help you improve your programming skills. Read the errors carefully, understand what they're trying to tell you, and use that knowledge to fix the mistake. In this example, the last line of the error message says we've encountered something called a type error. 7 is being read as an integer and 8 is being read as a string because of the quotation marks. No wonder we got an error. You can't add 7 to a word. As a data professional, you'll often need to aggregate a lot of data of different types. This will require converting the various types so they can be combined successfully. There's an effective method for doing this, but first, it's important to know what you're working with. Python offers a helpful way to identify data types with the type function. You can use the type function to have the computer tell you the data type. For instance, the type function here tells us that A belongs to the STR class, which is short for string. The number 2 belongs to the INT class or integer and 2.5 belongs to the float class. As a reminder, a class is an object's data type. The class bundles data and functionality together. Now, let's find out how to combine these different data types. There are two ways to convert data in Python. First is implicit conversion. Implicit conversion automatically converts one data type to another without user involvement. Here's an example. In arithmetic operations involving both integers and floats, the interpreter works in the background and converts integers to floats. You don't have to specify anything in your code to do this. However, if you want to convert numerical values to strings, you will need to do an explicit conversion. Explicit conversion is where users convert the data type of an object to a required data type. We use the predefined functions INT, float, and STR. This is sometimes called typecasting because the user casts or changes the data type. Let's use the STR function inside the string we want to have interpreted as output. And now, the output of this calculation will be stored as a string. Great work. You'll continue exploring data types and seeking solutions to errors throughout this program. Debugging or figuring out what's wrong when your code won't work is a really useful skill for any data professional. Also, as a final tip, all of us in the profession even experienced data professionals and code developers search for answers online when we encounter an error. This is a common strategy and a huge time saver. Always look to the data community for answers and inspiration. We've come to the end of the first section of the Python course. You've developed a lot of new Python skills already. Well done. Along the way, you've discovered that Python is a powerful tool for data professionals and learn how Python can help you work with your data quickly and efficiently. We began with a general introduction to the Python programming language and explored how data professionals use Python to power the data analysis. Then, we discussed Jupyter notebooks. You learned about the main features of Jupyter notebooks and how to write Python code in the notebook environment. Next, you explored the benefits of object-oriented programming for data professionals and learned about its basic concepts. After that, we focused on working with variables. You learned how to assign specific values to variables and effectively store and label your data. We also reviewed the standard naming conventions for variables. You learned useful guidelines for making your code clear, precise, and consistent. Finally, we explored different data types in Python, such as integers, floats, and strings. You learned how to convert and combine data types to organize your data. Coming up, you have a graded assessment. To prepare, review the reading that lists all the new terms you've learned and feel free to revisit videos readings, and other resources that cover key concepts. Congratulations on your progress so far, and I'll meet you again soon. Welcome back. I'm excited to continue our learning journey together. I really enjoyed exploring the basics of Python programming with you. You learned that Python is a powerful tool for data professionals and lets you analyze data with speed, accuracy, and efficiency. You now know how to use variables and data types to store and organize your data and have already started writing your own Python code. In this course, we'll continue to build on your foundation of Python knowledge. By the end of the course, you'll be able to write statements in Python code to perform multi-step operations on your data. You'll also learn how to write clean, readable code that can be easily understood and reused by other data professionals. Being able to collaborate with teammates is one of the most important skills for a data professional and writing clean code is a great way to collaborate with your teammates and help your team achieve its goals. Working with clean code helps your team work faster, communicate more effectively, and produce better results. We'll start with functions or reusable chunks of code that let you perform specific tasks. Functions are like the verbs or action words of a programming language. You can call on a function at any time to help you perform useful actions on your data such as sorting, classifying, summarizing, and much more. Then we'll discuss how to write clean code that can be easily understood by your teammates and collaborators. You'll learn about two important elements of writing clean code reusability and modularity. Both of these practices speed up project development and help data professionals focus on core business needs and avoid spending time doing rework. After that, we'll examine another key aspect of writing clean code commenting. Commenting is a useful practice because it helps you consider your thought process while documenting your workflow for teammates. Using comments to describe the component parts of a problem helps you solve it in clear simple steps. Next, we'll discuss how to use operators to compare values. We'll review two types of operators comparators and logical operators. Comparators such as greater than or less than allow you to compare two values. Logical operators such as and and or let you connect multiple statements together and perform more complex comparisons. Data professionals use operators to analyze and make decisions about their data. Lastly, we'll consider conditional statements which tell the computer how to make decisions based on your instructions. You'll learn how to write if-outs and else-if statements. Data professionals use conditional statements to structure complex operations and to perform all kinds of practical tasks such as binning data and organizing files. Conditional statements make your Python code more flexible and powerful. When you're ready, I'll meet you in the next video. My name is Michelle and I'm a data engineer here at Google. I actually started as a documentation specialist right out of college but being surrounded by so many technical people working within analytics really made me interested in the field and made me want to join the team and work in that world. I anticipated that I would face judgment or you know people looking down on the fact that I didn't have a degree in analytics when I was breaking into the field and I am happy to report from my experience that did not happen. It was only my own negative self-talk in my head. Everyone around me was welcoming and accepting and really happy to have someone who took a non-traditional path into engineering and analytics on the team because of the unique viewpoints that I brought. Impostor syndrome is a very real thing. I think everybody experiences it and so did I. There were so many times where I would stop and think maybe I'm not meant to be here. I don't have an advanced degree in analytics or information science is there really anything that I could even contribute being in this room right here right now and the way I got through it was realizing that having a career in data analytics and data science is not about memorizing every possible answer for every possible scenario. That's not it at all. The purpose is you're supposed to develop the skills to be able to approach a problem in an analytical mindset. There was a project earlier in my career where I really wanted to automate part of an analytics workflow and I knew what I needed to do but I didn't know exactly how to do it. The way I approached this problem was to first write down what I wanted to do in plain English without any computer code any programming nothing like that. Then I had to do a lot of searching on Google. I was looking on various forums for how to do XYZ in Python how to use a for loop how to use Python for data science and automate analytics and I slowly made my way over to completely automating the workflow that I wanted to automate one piece at a time. Being able to automate that workflow has left me with a sense of accomplishment that has remained with me to this day. Sometimes it can seem really discouraging when you have a mountain of work in front of you or somewhere that you want to get to and you're thinking this is going to be impossible but it's really not. If you just break things down into smaller more manageable chunks you absolutely will get there and then you'll get to a point where you look back and think oh my gosh I did it I've made it all the way here. Recently we've been exploring variables expressions and data types. In this video we'll consider another important component of programming in Python functions. A function is a body of reusable code for performing specific processes or tasks. We've come across a few built-in Python functions so far. For instance the print function writes text on the screen. The type function tells us the data type contained within a variable and the str function converts an object into a string. Note that in previous version of Python print was handled as a statement and did not use parentheses but for Python 3 the print syntax is a function and requires parentheses even when there are no arguments used and the parentheses are empty. Okay so we know that Python has many built-in functions but if we want to tell a computer to do other things particular to our own use cases it's important to know how to define our own functions. To define a function we use the keyword def at the start of the function block. You've encountered the defined function once before in a previous video but let's consider another example when defining a new function always begin with the def keyword. The name of the function comes next let's call it greeting. After that we have the function arguments also known as parameters. A function's arguments are always written in parentheses. The arguments are the things you give to the function to modify in some way. You can call them anything you want. Whatever we call them here when we define the function is how we'll have to refer to them below in the function's body. In this example our function will have just one parameter name. When we're done defining the arguments close the parentheses put a colon at the end and hit enter to get to a new line. Now we can write the body of the function this is where we say what we want the function to actually do. Note how the body is automatically indented to the right. In Python lines of code or hierarchical any line that is indented pertain specifically to the less indented code that precedes it. We can add as many lines as we'd like to the body of the function but each line must be indented to the right. Here it's indented for spaces. You can use however many spaces you like as long as you're consistent. However four spaces is usually the preferred way because it makes code more readable. Our greeting function will take a name and output a greeting using that name. We'll have the function print welcome the person's name and then on a new line print you are part of the team. To finish defining the function simply unindent the next line of code. Now we can call the function using the word greeting inside the parentheses we'll type the name Rebecca then we'll run the cell. Of course functions can do a lot more than print messages. This is just one simple example of defining your own function. Next let's consider how to get values out of a function. This is where return values can be used. Return is a reserved keyword in Python that makes a function do work to produce new results. But instead of printing the results the function saves them for later use. Let's define a new function that accepts two arguments the base and height of a triangle and returns the area of the triangle. The area is calculated as base times height divided by two. We use the keyword return to tell Python that this is the value that we want to come out of the function. Instead of printing return lets us store this value in a variable. So suppose we have two triangles and want to add the sum of both areas. Here's what we would do. First calculate the two areas separately storing each value in its own named variable. Then add the two areas together assigning the results to a variable called total area. If we call this variable the Jupyter notebook returns its value but we don't have to call it. We could continue writing code if we want. This demonstrates the power of the return statement. It enables us to combine function calls with other operations which makes the code reusable. Reusability involves defining code once and using it many times without having to rewrite it. There's more information on reusability soon but for now just understand that reusing something takes a lot less time and effort than recreating it every time. Let's do one more. Here's a function called get seconds. This function takes hours, minutes and seconds as inputs and returns the total number of seconds those inputs represent. In the first line we begin with the keyword depth and name the function get seconds. In the parentheses we give it three parameters hours, minutes and seconds. The next line performs a computation that calculates the total number of seconds and assigns that value to a variable called total seconds. The third and final line is the return statement that returns the value of total seconds. When we call the function we have to give it three arguments hours, minutes and seconds. We'll use 16 hours, 45 minutes and 20 seconds. And there's our result 60,320 seconds. Now you understand more about functions and how to use the return keyword to save the results of a function for later output. Code reuse is a key element of Python that you will continue to appreciate as a data analytics professional. Your data toolbox is growing and growing and there's more on the way. In the early years of software development it was common for developers to write each bit of code themselves. Now we know it's much more efficient to reuse code that others have written and put in online code repositories. Or we can develop modular code which you'll learn about in this video. Both of these practices speed up development and help data professionals focus on using code logic to meet business needs instead of doing rework. As we've discussed reusability involves defining code once and using it many times without having to rewrite it. Consider this example. This script uses the length function which returns the length of an object. In this case it's the number of characters in the string. Then it uses that length to calculate a number which we're calling the lucky number. Finally it prints a message with the name and the number. Each time we want to perform the calculation we change the values of the variables and write the formula. Notice how there are exactly two lines that are the same in the first and second part of the code. When you find code duplication in your scripts it's a good idea to check if you can clean things up by using a function. Let's rewrite this code creating a function to group all the duplicated code into just one line. This updated script gives us the exact same results as the original one but it's cleaner and easier to understand. Best of all it's now reusable. We can execute the code inside the lucky number function as many times as we need to by just calling it with a different name. Because of its modular nature Python is well suited to making code reusable. Modularity is the ability to write code and separate components that work together and that can be reused for other programs. Modularity is closely related to reusability because it lets you reuse blocks in sections of code. Reusing code blocks can help you more effectively collaborate with data engineers on larger projects so that they don't have to start their code from the beginning. Here's an example. These variable names don't really tell us anything about what this code is trying to do. We can run it and yes it does something but it was pretty difficult to read and understand that code. So let's try to make this code clear for other users. Refactoring is the process of restructuring code while maintaining its original functionality. This is a part of creating self-documenting code. Self-documenting code is code written in a way that is readable and makes its purpose clear. This involves everything from selecting your variable names to writing clear, concise expressions. Comments are a helpful supplemental explanation of the code. When your computer registers the hashtag character in front of the comment line it knows to ignore everything that comes after that character on that line. So let's refactor this code to make it self-documenting. Now the intent and construction of our code is more clear. It's also broken up into functions and commented sections. Commenting is a useful practice because it helps you think about your process while documenting your workflow for other collaborators. Although messy code doesn't necessarily cause a script to fail the cleaner the code the more useful it is for the rest of your team. Your colleagues will appreciate clean code because they can understand and reuse it to save themselves both time and effort. Plus code reuse and modularity reduces errors enhance his teamwork and builds trust. Previously you learned about writing clean code. In that video I mentioned that lines of code that began with a hashtag don't get executed and instead serve as comments for the human reading the code. In this video you'll learn more about commenting and why it's such an important part of writing good code. Building good coding habits will enable you to use Python effectively when using data to inform solutions to business problems. Let's begin with algorithms which will help you learn to think like a programmer and translate instructions into Python code. In programming languages an algorithm is a set of instructions for solving a problem or accomplishing a task. One everyday example of an algorithm is a recipe consisting of specific instructions for baking bread. First you preheat the oven to 425 degrees Fahrenheit then you mix two cups of flour three eggs two cups of water and a teaspoon of yeast in a bowl using an electric hand mixer. Then you let the dough rise for an hour. After that you transfer the dough from the bowl to the baking pan. Finally you insert the baking pan in the oven. In a similar way every computerized device is given instructions in the form of algorithms as hardware or software based routines to perform its functions. That's why it's important to know how to explain things logically to the computer. This is what it means to think algorithmically. You've already started to think this way because you've learned about functions and functions are algorithms. As your coding skills develop you'll be able to write longer and more complex functions. The best way to approach writing a new function is to break it into small simple pieces beginning with the comments. Outlining the comments and steps before you even write the code helps you to better understand the problem. Let's review an example. Suppose we have a square fountain and we want to plant grass in a border around that square. Let's write a function to calculate the amount of grass seed we'll need if we know the length of the side of the fountain and the width of the grass border. As always begin with the def keyword. We'll name the function seed calculator. Its parameters will be the two things we know the fountain side length and the width of the grass border. Now we'll write the body of the function breaking it into small steps that we'll outline with comments. First we'll find the area of the fountain. Next we'll calculate the total area of the square and the grass border combined. From these we can derive the area of just the grass border by subtracting one from the other. Then we'll calculate the amount of seed we'll need which is 35 grams per square meter. We'll have to convert that to kilograms because that's what we said the function would output. And finally we have the return statement. So let's review what we've done. We use comments to create a logical scaffolding before writing any code in the body of this function. In other words we use comments to break down the thought process that outlines each segment of code that we'll need in order to meet our goal. The only thing left for us to do is fill it in with code step by step. We can get the total area by finding the length of one side of the larger outer square and squaring it. The length of the large square is equal to the width of the border times two plus the length of the side of the fountain. So we'll code that as total area equals fountain side plus two times the grass width and we'll take that whole expression and square it. The area of the grass border is equal to the total area minus the fountain area. Then the amount of seed we'll need is the grass area times 35 grams per square meter. Next we convert grams per square meter to kilograms per square meter by dividing by 1000. And return our seed variable. We're almost done. There's another important part of writing functions that are user friendly and easy to understand. It's called a document string or doc string as it's most commonly referred to. The doc string is a string at the beginning of a function's body that summarizes the function's behavior and explains its arguments and return values. A function's doc string begins and ends with three quotation marks. They can be single quotes or double quotes. First, we write what the function does. It takes the form of a command and ends in a period like calculate the number of kilograms of grass seed needed for a border around a square fountain. Next, we'll describe the function's parameters. Ours has two. We have fountain side which is numerical data that represents the length of one side of the fountain in meters. Then we have grass width which is also numerical data and it represents the width of the grass border in meters. Lastly, we'll describe what the function returns. This function returns the seed variable which is a float that indicates the amount of grass seed needed for the border in kilograms. Great, we have a function that performs a complex task and can be used as many times as we need. Using comments to break up the parts of the problem allowed us to solve it in clear, simple steps. Best of all, other people can use this code and understand exactly what it's doing because we've written a doc string and concise comments. So how much seed do we need if our fountain is a square that's 12 meters long on a side and we want a 2 meter border of grass around it? 3.92 kilograms. To recap, comments act as a scaffolding that breaks up your code into manageable pieces. Along with the function's doc string, they help you and others understand and use your code. It's important for data professionals to get in the habit of writing well-documented code. It's a little more work up front but you'll thank yourself later and so will your colleagues. You've learned about data types like integer, string and float. Another data type is Boolean data. This is data that has only two possible values, usually true or false. The word Boolean comes from George Bull, a 19th century English mathematician. Every time you compare things in Python, the result is Boolean type data. Data professionals use Boolean data every day to control logical flows in their code. In previous lessons, you discovered how Python can be used like a calculator for basic arithmetic. Now we'll find out how to use the power of Python to compare values with comparators and operators. Comparators are operators that compare two values and produce Boolean values. All right, now let's consider an example. If we print 10 is greater than 1, a comparator produces the result, a Boolean value, true. There are six comparators in Python. They let us confirm whether something is greater than, greater than or equal to, less than, less than or equal to, equal to, or not equal to something else. Data professionals take the results of comparator expressions and use them to make decisions about data assigned to these descriptions. Here's an example. Cat is not, in fact, equal to dog. So it produces the Boolean value false. Now, let's pair an exclamation mark and an equal sign, which is the not equals comparator. So the comparator checks that one isn't equal to two. And produces the Boolean value of true. As we've learned, the plus operator doesn't work between integers and strings. So let's consider what will happen if we try to compare an integer and a string. Yep, another type error. The good news is that Python also has a set of logical operators. Logical operators are operators that connect multiple statements together and perform more complex comparisons. Examples of these are the words and or and not. These operators allow you to connect multiple statements together and perform more complex comparisons. The and operator needs both expressions to be true to return a true result. Here we're comparing strings. When used on strings of text, comparators evaluate the first letter of each string with a being least and z being greatest. If two strings have the same first letter, the second letter will be compared. In this case, the y in yellow is greater than the c in cyan. But the b in brown doesn't come after the m in magenta. So this means that the first statement is true, but the second one isn't true. So if only part of an expression is true, the result of the whole and statement is false. Or statements are the opposite. If we use the or operator, the expression will be true if either of the expressions are true and false only when both expressions are false. Try it out. Print this code. Open parentheses, 25. The greater than comparator, 50. The or operator, 1. The not equal comparator, 2. Close parentheses. 25 is definitely not greater than 50, but 1 is not equal to 2. So in the end, the whole expression is true. Now the not operator inverts the value of the expression that follows it. If it's true, it becomes false. If it's false, it becomes true. Because there is a not statement in front of 42 equals the string answer, the result is true. Comparators and logical operators are very useful in the data field because they make it possible to write much more complex code. Later, I'll be back to demonstrate some examples of these expressions. Keep practicing and reviewing comparators and operators. I'll be with you again soon. Now that we know about variables, expressions, functions, data types, comparators, and logical operators, we can perform exciting actions in our scripts based on their values using branching, which is exactly what we'll learn about in this video. Branching describes the ability of a program to alter its execution sequence. This is a key component to writing useful scripts. Branching uses if statements based on certain conditions. If is a reserved keyword that sets up a condition in Python. If statements, also known as conditional statements, are just like using the word if in everyday life. For example, if it's before noon, you'll greet someone by saying good morning rather than good afternoon or good evening. If it's raining outside, you might choose to carry an umbrella. And if it's snowing, you'll probably wear a jacket. Here's an example of this concept in a business context. At a company, new employees can choose their user names. However, the user names need to fit a given set of guidelines. Maybe a valid username requires at least eight characters. As the data professional at this business, you're tasked with writing a program that will tell the user if their choices are valid or not. To accomplish this task, we'll write a function. The goal is to define the function so that it generates a username hint using an if statement. As a reminder, the built-in lend function will return the length of an object and that can be paired with the less than comparator to identify user names that don't meet the criteria. Great. Now your function checks whether the length of the username is less than eight. If it is, the function prints a message saying that the username is invalid. Let's review our if statement. We write the keyword if followed by the condition that we want to check for followed by a colon. After that, we have the body of the if block which is indented further to the right. Here's a very important point. The body of the if block will only execute when the condition evaluates to true. Otherwise, it does not execute. What this means is if you run an if block and the argument conditions are not met, the indented code beneath it gets ignored. The if statement is a useful construct in Python syntax. But what if we could extend it to make it even more powerful? What if we want the computer to do something else? Else is a reserved keyword that executes when preceding conditions evaluate as false. The else statement lets us set a piece of code to run only when the condition of the if statement is false. Here's an everyday example. If you're hungry, you eat. But if you're not hungry, if that concept is false, then you'll do something else. Maybe you'll make another choice and take a nap. Think about our company username example. Maybe now we want to print a message when a username is valid. The function can now go in different directions, depending on the length of the username. If the username is not long enough, a message indicates that it's invalid. But if the function verifies that the username is long enough, it will print a message saying that it's valid, which is dictated by the else statement. Notice the structure of the function right now. The if statement is indented in the body of the function and the action we want to happen, if that statement is true, is indented beneath it. We could write as many lines as we want here, and as long as they're all indented beneath the if statement, they'll all execute when that if statement is true. Then we have the else statement. Note that it's unindented to the same level as the if statement. An if statement in its corresponding else statement are always written at the same level. Beneath the else statement, we indent once more to indicate that this is what must execute when the if statement is not true. Sometimes you don't need to add an else statement because that logic is already built into the code. Let's explore a new operator that will help with this. We're going to use a new operator, the modulo. Represented by the percentage sign, the modulo is an operator that returns the remainder when one number is divided by another. The division between integers yields two results which are both integers, the quotient and the remainder. So for an integer division between five and two, the quotient is two and the remainder is one. For an integer division between eleven and three, the quotient is three and the remainder is two. Even numbers are all multiples of two which means the remainder of the integer division between an even number and two is always going to be zero. So the modulo division of ten by two is zero. Let's review an example. This function checks whether a number is even by dividing it by two and checking that the remainder is zero using the modulo operator. If the remainder is zero, the function will return true. Now here's the interesting part. You can put an else statement here. That would work, but it's not strictly necessary because of the way if statements work. Remember, when the if statement evaluates to true, the code indented beneath it will execute. But when the if statement evaluates to false, nothing indented beneath it will execute. The code will then continue running until it gets to the end of the function. Let's try entering the odd number 19 using the is even function we defined. The function returned false because 19 modulo two does not evaluate to true. So the code indented beneath the if statement doesn't execute. And then the function continues running. In this case, the only remaining code in the function is return false. So that's what it does. It returns false. At first, you might prefer to include the else statement in such instances. And that's okay. It's important to know that both ways are correct. But keep in mind, this technique can only be used when returning a value inside the if statement. To recap, an if statement branches the execution based on a specific condition being true. And the else statement sets a piece of code to run only when the condition of the if statement is false. For situations where there are more conditions to consider, the elif statement short for else if is useful. The elif keyword is a reserved keyword that executes subsequent conditions when the previous conditions are not true. It's Python's way of saying if the previous conditions were not true, then try this condition. Let's consider an example to better understand elif. It is likely the case that the weather might influence what you choose to do with your afternoon. If the weather is nice, you might go to the park. If it's raining, you might go to the movies instead. Depending on which of these activities you choose to do, you also need to decide how to get there. And the activity may determine your mode of transportation. So the choices you make depend on different conditions at each point. These are if else if statements you may encounter in daily life. Let's return to the username validation example. Perhaps now we want to limit how long the username can be. Maybe our business has a rule that usernames longer than 15 characters aren't allowed. Let's type our first condition. Usernames of less than 8 characters are invalid. Now we want to add another condition to this, limiting the username to a maximum of 15 characters. Notice there are two else statements. The first else is the second course of action. If the first condition is not met and the length of the username is greater than or equal to 8. In other words, if the first if statement is false, then the code executes the first else statement. This else statement has two more nest conditions. If and an else. The indentations make the relationships between the different branching statements easier to read. But the nesting adds some complexity. Remember, you can choose to use as many or as few spaces as you want for the indentation. But generally, it's best to use four spaces for readability and it's important to be consistent. To avoid unnecessary nesting and to make the code clearer, Python's ALIF keyword lets us handle more than two comparison cases. In fact, the ALIF keyword allows us to handle an unlimited number of comparison cases. The ALIF statement is similar to the if statement. The abbreviation for else if prevents a lot of nested if and else statements. If all the above conditions are false, then the final else statement executes. Now, let's run our function on a really long username. This script works just the same as the one with nested if else comparison. I just demonstrated, but it's much easier to understand. Let's examine it. The function first checks whether the username is less than a character's long. If that's the case, it prints a message. Next, if the username has at least a characters, the function then checks if it's longer than 15 characters and prints a message if that's true. If neither of the above conditions are met, the function prints a message indicating that the username is valid. Now you know how to use if, ALIF, and else statements inside functions. This kind of branching is super helpful when determining the flow of your scripts. Use branching to pick between different pieces of code to execute, making your script pretty flexible and efficient. Branching also helps with all kinds of practical things such as bin data based on its value, backup files, or to only allow log in access to a server during a certain time of day. Anytime your program needs to make a decision, you can specify its behavior with a branching statement. Now you have a strong foundation to build branches and code, which is going to enable you to do a lot of useful work in Python as a data professional. You've learned so much already and I hope you're enjoying the discovery as much as I am. You've come to the end of the second section of the Python course. You've added a lot of new Python skills to your skill set and gained valuable practice working with data. Well done. Along the way, we've explored how Python code can help you quickly perform complex operations on your data. You've also learned how to write clean, readable code that can be easily understood and reused by other data professionals. This is an important part of collaborating with teammates on any data project. Writing clean code will help your team reduce errors, work faster, communicate more effectively and deliver better results. We begin by exploring functions or reusable chunks of code that let you perform specific tasks. Next, we discussed two important elements of writing clean code, reusability and modularity. We also discussed another best practice for writing clean code, commenting. After that, we've reviewed two types of Python operators, comparators and logical operators. And finally, you learn how to write conditional statements such as if, else and else if statements. Coming up, you have a greeted assessment. To prepare, review the reading that lists all the new terms you've learned and feel free to revisit videos, readings and other resources that cover key concepts. Congratulations on your progress. Let's keep it going. Welcome back. Wow, you've learned a lot of new Python skills. You can use variables to store and label your data and convert and combine different data types such as integers and floats. You can call functions to perform useful actions on your data and use operators to compare values. You also know how to write clean code that can be easily understood and reused by other data professionals. Finally, you can write conditional statements to tell the computer how to make decisions based on your instructions. Your knowledge of fundamental coding concepts is the first stage of a journey that leads to more advanced methods of data analysis. And this learning journey will continue throughout your future career as a data professional. To me, that's one of the most exciting parts of the job. I'm always learning new ways to use Python for data analysis, whether from teammates at work or from the super supportive online community. Learning new Python skills helps me become a more effective data professional. In this section of the course, you'll learn how to use Python code to automate repetitive tasks. Often, data professionals need programs to perform an action repeatedly. For example, if you're analyzing sales data, you may want to perform the same calculation on hundreds of price values. Rather than write new code each time you want to make the calculation, you can instead write an iterative statement or loop. Loops automatically repeat a portion of code until the process is complete. We'll discuss two types of loops, while loops and for loops. Using Python to automate repetitive tasks saves me tons of time and effort and reduces the risk of human error. It also reduces my overall workload and increases my productivity. I have more free time to focus on the main goal of any data analysis project to generate insights for stakeholders. You can iterate over many different data types in Python such as strings, lists, sets and dictionaries. In this section of the course, we'll focus on strings. Later on, we'll discuss the other data types in detail. As a data professional, you'll often work with strings when analyzing data. For example, you might examine textual data related to a company's products, services, customer feedback, and more. Operations like indexing and slicing allow you to select, filter, and edit data quickly and efficiently. These are valuable Python skills for any data professional. When you're ready to learn more, I'll meet you in the next video. If you had to do the same thing over and over and over again, well, you might get a bit loopy. And you'd probably think to yourself, I wish I could spend my time on something more meaningful. Well, that's where computers can help. They do the looping for you. As a refresher, a loop is a block of code used to carry out iterations. An iteration is the repeated execution of a set of statements where one iteration is the single execution of a block of code. And an iterable is an object that's looped or iterated over. Typically, data professionals use for loops and while loops to work with iterables. In this video, we'll focus on while loops. A while loop is a loop that instructs your computer to continuously execute your code based on the value of a condition. Magali and Fido are here to help me explain. Notice that Fido is eating treats while the treat bag is in Magali's lab. Fido stays until Magali puts the treat bag away. Then Fido leaves. This is a while statement in action. While Magali has the treats, Fido is there. Once the condition of Magali holding the treat bag is no longer met, Fido exits. After all, there are no more treats. While loops work in a similar way to branching if statements, the difference is that and while loops, the body of the block can be executed multiple times instead of just once. This is great for avoiding redundancy in code. Let's review an example. Assign the value of 0 to the variable x. As you've learned, we call this action initializing to give an initial value to a variable. Next, start the while loop. Set a condition for this loop that x needs to be less than 5. The prior line just initialized it, so this condition is currently true. Then end the while loop line with a colon. On the next two lines, you may notice a block that's indented to the right. This is the while loops body. There are two lines in the body of the loop. In the first line, print a message followed by the current value of x. In the second line, increment the value of x by adding 1 to its current value and assigning it back to x. In the body of the while loop, we told the computer to print the results of each iteration. Notice how each iteration changed the value of x. Our while loop started at x equals 0, then printed the message, and increments the value of x to 1. But because this is a loop, the computer doesn't just continue executing with the next line in the script. Instead, it loops back around to reevaluate the condition for the while loop. This is the second iteration. And because 1 here is still less than 5, it executes the body of the loop again. The third iteration increments x by 1. Now the value of x is 2. The computer will keep doing this until the condition isn't true anymore. Let's add a print statement outside the body of the while loop to let us know the final value of x. When the loop finishes, the next line of code is executed. So now that x has reached the value 5, the loop statement ends. The computer prints the last line of output as x equals 5. We can also use the logical operators and or and not that we worked with earlier. Similar to if statements, using these operators allows us to combine the values of several expressions to get the result we want. The most important thing to remember is that the condition used by the while loop needs to evaluate to true or false. It doesn't matter if this is done by using comparison operators or calling additional functions. I think you're ready for an example that uses several different concepts that you've learned so far and even a couple that you haven't. To help you explore the example, I'll guide you as we go and I'll explain the concepts to you. You'll probably find that as you continue on your journey learning Python, you'll encounter new concepts that can be understood in context. It's the same way with spoken language. If I say the elephant was so gargantuan that it crushed the car by stepping on it, the word gargantuan may or may not be in your vocabulary. But the context helps us understand that it has something to do with being big or heavy. It's the same with coding. Pause the video and try to figure out what this code is doing. When you're ready, continue and I'll lead you through it. In this example, we're going to write a short program that generates a random number and then gives the user five chances to guess it. We begin by importing a package called random. This package has many uses, but the one we want is its ability to generate a random number. Now, instantiate a variable called number and use the random int function from the random package to assign it a value. The value is a random number between 1 and 25 inclusive. Next, we instantiate another variable called number of guesses and assign it a value of zero. This variable is important because it will behave as a counter that controls the logic of our program. Now comes our while loop. We write, while the number of guesses is less than five, and below it, we give the instructions for what should happen at each iteration of the loop. First, the script will print a message telling the user to guess a number from 1 to 25. Then, we instantiate a new variable called guess, which will capture a string of whatever the user inputs. The input function used here will create a prompt that allows the user to type their guess. The next line converts the guests to an integer. This is important because without it, the input function would capture the number as a string, and it would be impossible to win the game because even if they guessed the right number, the string of the number will never evaluate as being equal to the integer of the number. Now, we increase the number of guesses variable by one. This is an important step because, remember, the while loop will continue to iterate for as long as the number of guesses variable is less than five. If we left this out, the loop would never stop running and the user would have infinite guesses until they got it right. By the way, this syntax is a helpful shortcut. Instead of writing number of guesses equals number of guesses plus one, we can simply write number of guesses plus equals one. They mean the same thing, but one is way shorter. Now we're going to have our branching logic that determines what the program does. We want the behavior to be different depending on how many guesses the user has made and whether their guess was correct or not. This code is still within the body of the while loop, so it will execute on every iteration. The first line checks if the guess is correct. If it is, the while loop breaks. This word break is a key word that lets you escape a loop without triggering any else statement that follows it in the loop. In this case, if the user guesses the correct number, the while loop would break and the code would continue at the if statement below. Because the guess is correct, we print a message telling the user that they are correct and include how many tries it took them in the message. But let's go back up to the while loop and go through what happens if the guess is not correct. We use the LF statement to check if the guess was their fifth guess. If it was, we break the while loop. Evaluation would continue below and because the guess was incorrect, it would trigger the very last else statement that outputs a message telling the user they were unsuccessful and revealing the correct number. Without this LF statement, the code would still work. Kinda. The user would get a message saying nope, try again and then immediately get another message telling them they were unsuccessful and revealing the number. But if the guess was not their fifth guess, the code would proceed to the else statement which prints nope, try again. Let's run the program and try to guess the number. Congratulations. You just integrated a bunch of concepts into a single script of code and worked on a problem that incorporated some complex logic. These skills are invaluable to data professionals. Have you ever tried to withdraw money from your bank's ATM but keyed in your password incorrectly? As you probably experienced, you only have a certain number of tries for the system will lock you out. Sure, it's frustrating if you're just typing too quickly or temporarily forget your pin but it's a great safeguard that helps protect your bank accounts. Well, the software in many cash machines runs using a for loop. This detects how many attempts have been made on a single transaction before blocking the user. And in this video, we're going to learn how that works. A for loop iterates over a sequence of values. A simple example of a for loop is to iterate over a sequence of numbers. For X and range 5, print X. Notice how the structure of a for loop is similar to the syntax of a typical Python statement. The first line indicates a distinguishing keyword, in this case four. And like functions and other expressions that start a distinct code block, it ends with a colon. The body of the for loop is indented to the right. What's different in this case is that we have the keyword in. Also, between the keywords for and in, we have the name of a variable, X. This variable will take each of the values in the sequence the loop iterates through. In this example, X takes the values zero, one, two, three, and four. We don't have to use X. We could use any term we want. N, number, monkey, it doesn't matter. As long as we maintain consistency between what we name it here and how we refer to it below. Okay, now let's examine the range function. The range function is a Python function that returns a sequence of numbers starting from zero, increments by one by default, and stops before the given number. It can be used in while or for loops. In Python and many other programming languages, a range of numbers will start with the value zero by default. The list of numbers generated will be one less than the given value. Check it out. By default, our range function starts with zero. After the first iteration, the value will be one. The second iteration outputs two, and so forth. Whatever code we put in the body of the loop will be executed on each of the values, one value at a time. You can also use a for loop to read in a file and iterate over the file line by line. The with open statement uses the file path to read in the file. In this case, it's a text file containing the Zenith Python, a famous poem written by software engineer Tim Peters in 1999. For easier notation, assign it the value of f. Otherwise, we'd have to write the file path again. In the next line, start the for loop for each line. Inside you indent, and on the next line, tell the computer to print each line. After the loop is complete, tell the computer to print, I'm done. I once had to locate all of the unique words in a 2D array of text. I used a for loop inside of another for loop. The outer loop ran over each column, and the inner loop iterated over each cell for the column. That's just one example. Let's check in on Magali and Fido for another. Magali doesn't want Fido to eat too many treats, so she only gives him five treats per day. Fido wags his tail each time he gets a treat. Once he has all five, he stops wagging his tail. Good dog, Fido. Data professionals use for loops in Python and other programming languages all the time. They're one of the fundamental tools of coding. They're great for performing a repeated procedure on an object with a fixed length to create something new. Next, we'll explore more ways to use loops. Earlier, you learned about the range function and how it generates a sequence of numbers starting with zero. However, your work as a data professional won't always need to start with zero. Instead, you might want to use a step parameter in the range function. This makes it possible to specify which elements of a variable or data set we want to start and end with. In this video, you'll learn how to use the three parameters of the range function. Start value, stop value, and step value. Here's a for loop that calculates a factorial of nine. Let's explore how it works. We begin by assigning a variable called product with the number one. Next, we have a range function that starts at one and stops at 10. Since the stop value is not included in the calculation, this means that code multiplies a variable by every whole number in the sequence beginning with one and ending with nine. So for n in range one to 10, we multiply our product variable by one and reassign the result back to itself. Then we multiply the product variable by two and reassign the result back to itself then by three all the way up to nine. This produces a factorial of nine. The factorial of nine equals 362,880. Note that we started with one and not with zero. If we multiplied by zero, the product would be zero. The range function also lets us specify a third parameter to change the size of each step in our range. In other words, instead of going one by one, you can have a larger difference between each of the values in your range. Let's explore an example. First, define a function that converts a temperature value from Fahrenheit to Celsius. Use the standard conversion formula. The temperature in Fahrenheit that needs to be converted is identified by X. 32 is subtracted from X times five divided by nine. Next, we have a for loop that will print a table of temperature conversions every 10 degrees from zero to 100 degrees Fahrenheit. Notice that the for loop starts at zero and goes up to 100 in steps of 10. Remember that the range excludes the final value in the secrets. So to include 100 in our sequence, we put the end value as 101. The body of the for loop prints the value in Fahrenheit and the value in Celsius to create a conversion table. For every 10 degrees Fahrenheit, the code prints the corresponding value in Celsius on each line. So now you know more about setting parameters for the range function when you're working with for loops. As a quick reminder, use for loops when there's a sequence of elements that you want to iterate over. For example, to loop over a variable such as a record in a data set, it's always better to use for loops. This also improves readability of your code. Use while loops when you want to repeat an action until a Boolean condition changes. Without having to write the same code repeatedly. Remember, Booleans are a data type that represents one of two possible states, usually true or false. And if whatever you're trying to do can be done with either a for loop or a while loop, just use whatever you prefer. I happen to love them both and I'm really glad I have both tools in my Python toolbox. We've covered a lot of materials so far. If you learned about functions and conditional statements and loops, now we'll learn more about different data types in Python beginning with strings. Recall that a string is a sequence of characters and punctuation that contains textual information. This is an immutable data type which means the values can never be altered or updated. Even though strings are immutable, we can still do a lot with them. For instance, we can concatenate them. To concatenate means to link or join together. So concatenating strings is making a single longer string from two or more shorter ones. To concatenate strings in Python, we simply use the addition operator. If we have two strings, hello and world, we can join them by adding them together. The result is a single string, but it's also a single word. This is because blank spaces are white spaces as they're known in computer programming count as their own characters. If you want a white space between your concatenated strings, one of the strings must contain a white space. Or you must add a third string between them that contains just a white space. The same rules apply when using variables that point two strings. If hello is assigned to greeting one and world is assigned to greeting two, we can concatenate the strings by adding the two variables. We can also multiply strings using the multiplication operator. Danger times three equals danger, danger, danger. However, we cannot divide or subtract strings and trying to do so will throw an error. As you know, some characters are reserved for specific purposes when working with strings. For example, quotation marks are used to indicate the beginning and end of strings. But what if we want our string to contain quotation marks? There are two ways we can approach this. The first way takes advantage of the fact that strings can be written with either single quotes or double quotes. If you want to include double quotation marks in your string, use single quotation marks to begin and end your string and vice versa. The second way is to use the backslash which functions as an escape character. An escape character changes the typical behavior of the characters that follow it. In this case, the typical behavior of quotation marks is to begin or end the string. But if we proceed each with the backslash, they'll behave as regular punctuation marks in the string. The backslash character is useful as an escape character for other special characters in strings too. For example, backslash N is a special character combination that's used to indicate a new line when printing a string. But if you want to include backslash N as characters in your string when you print it, you must precede the combination with an initial backslash. Moving on, we can also iterate over strings with loops. In this example, we use a for loop to iterate over each letter of the word Python and print the letter plus the letters U-T. These are just a few of the ways to work with strings. As a data professional, you'll often work with strings when analyzing data. Coming up, we'll cover more useful operations that we can perform over strings. Recently, you learned the basics of strings. In this video, you'll add to your knowledge by exploring a new way to work with strings, slicing. But before you learn about slicing, you'll need some background information about how Python works. As a reminder, an object is iterable if you can sequence through all of its values or items. Indexing is Python's way of letting us refer to individual items within an iterable by their relative position. Indexing is a very important part of Python because it allows us to select, filter, edit, and manipulate data. And it opens up many possibilities to the data professional. By the way, indexing isn't just for strings. It also works on many other data types, like lists, tuples, sets, and others as long as they're iterable. You'll learn about these other data types soon. Python uses zero-based indexing. That means that the first element of a sequence is indexed as zero. With strings, indexing works by interpreting a string as a sequence of characters where each character has a numbered slot. If you're reading from left to right, the first character is located at slot zero. The second character is located at slot one and the third at slot two, and so on. Indexing lets us slice strings to create smaller strings or substrings. Here's an example that many data professionals have experience with. A column in a data set contains employee salary information. In the same field, there will be both strings and integers, the currency symbol and the salary amount. In this case, Python would automatically interpret the mix of data types as a string data type. This is often a problem because we usually want values that represent money to behave like numbers so we can perform mathematical operations on them. If they're strings, we can't do that. So to fix the problem, slicing helps us remove the non-numeric characters like the dollar sign from the string. In this case, we could drop the character at the zero index of each value in the salary column. This gives us salary information without the currency prefix. Let's explore some ways of working with indices. One useful tool is the index method. Index is a string method that outputs the index number of a character in a string. Remember, a method is a function that applies to a variable. We can call it by following the variable with a dot. We use the index method to identify the location of a character or substring in a string. Here, we have a variable called pets, which has been assigned the string cats and dogs. We use the index method by attaching it to the pets variable with a dot. In its parentheses, we enter the character we want the index of. Let's find s. When we run the cell, the computer returns the number three. This means that index three of our string contains the letter s. What if there's more than one of the same character or substring? Here, we know that there are two s's in cats and dogs, but only one index returns to us, three. That's because the index method just returns the first position that matches. And if you search for a substring that is not there, say z, you'll get a value error because the substring is not found. Additionally, we can also use an index number to find a specific character in that position. For example, we'll assign the string Jolene to a variable called name. By placing the index number in brackets after the variable, we can access the character at that position. Name at index zero is j and name at index five is e. What happens if we put six instead? We get an index error indicating that the string index is out of range. You can access the last character of a string even when you don't know how long the string is by using negative indices. Let's consider an example. We don't know the length of this string, but it doesn't matter. Since it isn't super efficient to count each character out, we can reference it by starting from the last position with a negative index. By using the index negative one, we get the last character an exclamation point. And if we use the index negative two, we get the second to last character a. Now that we've gone over the fundamentals of indexing, let's do some slicing. A string slice is a portion of a string. It's also known as a substring. String slices can contain more than one character. Here's an example of how we slice a string of the word orange. Let's start by putting some index numbers inside square brackets and separating the numbers by a colon. This defines the range of characters in the new slice. We'll go from index one up to index four. The closing index is not included in the range that's returned to us. So this would capture indices one, two, and three. And we've extracted a slice that contains the characters that correspond to these indices. R, A, N. We can also use slice notation with just one of the two indices. Omitting the first number in the range implies that the range begins at zero. So if the string is pineapple and we indicate our slice using colon four we'll capture the first four letters pine. Similarly if we slice using four colon we'll capture everything beginning with index four all the way to the end apple. Great. We have one more thing to learn. Sometimes data professionals want to check whether or not a substring is contained in a string. To check whether or not a substring is contained in a string use the keyword N. Let's find out if banana is in the string contained in our fruit variable. It's false. There is no banana in our pineapple. But is there apple? Let's check. Yes, apple is a substring of pineapple so the computer returns a value of true. Confirming whether a substring is contained in a string is a common practice in all kinds of data careers. I encourage you to take some time to go through the steps again on your own. The more you apply what you learn the more comfortable you will become. As you're discovering understanding strings is a big part of python and programming success. Now that you have a foundational understanding of strings let's learn some new approaches to save time when creating and manipulating strings. In this video we'll focus on formatting strings using the format method. The format method formats and inserts specific substrings into designated places within a larger string. Let's examine how this appears in code. We have two variables name and number. The format method uses the curly braces to designate where the variables should be inserted into the string. If we pass the variable names as parameters to the format method we find that it doesn't matter that the name is a string and the number is an integer. The format method will insert strings of the values represented by these variables. The order that they're inserted in is the order that they're entered as parameters into the format method. In this case name corresponds to the first set of braces and number corresponds to the second set. You can also be more explicit with how you designate what substring goes where. You can name your own keywords and insert them into the braces. In this case we'll use name and num. Now we explicitly assign our variables to those keywords in the method parameters. When we run the cell the values represented by the variables get inserted into the printed string according to their keywords. Notice that when we do it this way the order that we enter the method arguments doesn't matter. Name will be inserted into the name field in the string and number will be inserted into the num field of the string. This approach is very helpful for example and cases where an output message needs to be translated into another language. Many languages change the order of words to communicate the same message. This method makes rearranging strings fast and easy. And yet another way to insert values into strings is to use integer values in the braces to indicate the order in which to insert the arguments. Notice how in this example we can enter the variables number and name in the argument field in a different order than they get inserted into the printed string. As a data professional these different ways to insert values into strings offer you a lot of flexibility and how you choose to work and solve problems. Here's an example that not only inserts substrings into a larger string but also formats them. Imagine you want to print the price of an item with and without tax. Depending on the tax rate the number might extend more than two places beyond the decimal. We can use string formatting to set a limit on the number of decimal places in the output which makes it more readable. In this case our item costs seven dollars and seventy five cents without tax and the tax rate is seven percent so the price with tax would be eight dollars two nine two five. To limit the output to two places beyond the decimal point we use special syntax. Start with a colon to separate the expression from the keyword name if you decide to use one. After the colon write dot two f. Dot two refers to two places beyond the decimal and f stands for float. Now let's check what happens when we run this cell. Nice. The price of tax has two decimals now. You can replace the two in the expression to any number of places beyond the decimal you want. If you put zero only a whole number will print. We're not done yet. There are even more ways to use the format function to improve expressions. Let's explore our conversion of temperature values from Fahrenheit to Celsius from earlier. At the top is the function we wrote to calculate the conversion but now instead of just printing the results we'll format them too. Again begin with a colon then use the greater than operator to align the text to the right so that the output is neatly formatted. Greater than three will align the output three spaces to the right. For the converted Celsius value we'll use greater than six which will align the Celsius temperatures six spaces to the right. Notice how clean the output is. Our decimals are cut at the hundreds place and the values output in a nice table. Everything you've been learning about strings will help you work more effectively streamline processes and save your company lots of time and resources. Using Python is all about maximizing productivity while minimizing effort making it the perfect tool to help you achieve these goals. This is the end of the third section of the Python course. You've come a long way since the beginning of the course congratulations on all your progress. In this section of the course we focused on using Python code to automate repetitive tasks rather than write new code each time you want the computer to repeat an action. You can instead write an iterative statement or loop. Loops automatically repeat a portion of the code until a process is complete. Using Python to automate repetitive tasks will help you work more effectively streamline processes and save tons of time and effort. As a data professional you'll have more time available for your most important task. Analyzing data to generate useful insights for stakeholders. We discussed two different approaches to automating repetitive tasks while loops and for loops. You learn how to write code for both while and for loops and when to use each approach. We also discussed strings or sequences of characters like letters and punctuation marks. You learn how to manipulate strings by slicing indexing and formatting them. As a data professional you'll often work with textual data such as product information or customer feedback. Operations like slicing and indexing enable you to select filter and edit data quickly and efficiently. Learning Python is an exciting journey that will continue throughout your future career. Each data project that I work on has its own specific challenges. I'm always exploring online or chatting with teammates to learn new Python skills on the job. This helps me solve problems and work more efficiently. As you continue learning and practicing Python your data analytics skill set will continue to grow. Coming up you have a graded assessment. To prepare review the reading that lists all the new terms you've learned and feel free to revisit videos readings and other resources that cover key concepts. You're doing great. Keep it up. Hello again. You've come so far on your learning journey. Just think of all the new Python skill you've developed along the way. You've learned how to use variables to store and label your data and how to work with different data types such as integers floats and strings. You can call functions to perform actions on your data and use operators to compare values. You also know how to write clean code that can be easily understood and reused by other data professionals. You can write conditional statements to tell the computer how to make decisions based on your instructions. And recently you learned how to use loops to automate repetitive tasks. Coming up, we'll explore data structures which are collections of data values or objects that contain different data types. Data professionals use data structures to store, access, organize and categorize their data with speed and efficiency. Knowing which data structures fit your specific task is a key part of data work and will help streamline your analysis. We'll focus on data structures that are among the most useful for data professionals. Lists, tuples, dictionaries, sets and arrays. Part of what makes Python such a powerful and versatile programming language are the libraries and packages that are available to it. After we review fundamental data structures, we'll discuss two of the most important libraries and packages for data professionals. The first is numerical Python or NumPy which is known for its high performance computational power. Data professionals use NumPy to rapidly process large quantities of data. I often use NumPy in my job because it's so useful for analyzing large and complex data sets. The second is Python Data Analysis Library or PANDAS which is a key tool for advanced data analytics. PANDAS makes analyzing data in the form of a table with rows and columns easier and more efficient because it has tools specifically designed for the job. When you're ready, I'll meet you in the next video. Great to be with you again. In this video, we'll discuss the differences between data types and data structures. Then we'll explore lists which are a specific kind of data structure. As you've learned, a data type is an attribute that describes a piece of data based on its values, its programming language, or the operations it can perform. In the context of Python, this includes the classes, integers, strings, floats, and packages. Floats, and Boolean expressions, among others. A data structure is a collection of data values or objects that can contain different data types. So data structures can contain data type elements such as a float or a string. Data structures also enable more efficient storage, access, and modifications. They allow you to organize and categorize your data, relate collections of data to each other, and perform operations accordingly. One data structure in Python is a list. A list is a data structure that helps store, and if necessary, manipulate an ordered collection of items such as a list of email addresses associated with a user account. A list is a lot like a string and you can do many of the same things with lists. For example, both strings and lists allow duplicate elements as well as indexing and slicing. Additionally, both are sequences. A sequence is a positionally ordered collection of items. However, strings are sequences of characters while lists store sequences of elements of any data type. There are some other key differences between lists and strings. First, note that different data structures are either mutable or immutable. Mutability refers to the ability to change the internal state of a data structure. Immutability is the reverse where a data structure or elements values can never be altered or updated. Lists and their contents are mutable, so their elements can be modified, added or removed. But strings are immutable. Think of a list like a long box with the space inside divided into different slots. Each slot contains a value and each value can store data. This could be another data structure such as another list or an integer, string, float, or output from another function. When working with lists, you use an index to access each of the elements. Recall that an index provides the numbered position of each element in an ordered sequence. In this case, our sequence is a list. Let's go through an indexing example together. First, assign the following list of words to a variable x. Now, we are cooking with seven ingredients. In Python, we use square brackets to indicate where the list starts and ends and commas to separate each element contained in it. To print an element of a list, use its index number. So, to print the word cooking, print the third element of the list variable x. This is just like focusing on a specific character or substring in a string. The first element in a list, as with strings, is zero. So, if we print the element with slot number three, we get the item or word cooking from our list of seven words. Remember that indexing always starts at zero. So, if we had type seven to try to access the last word of our list, we'd get an index error. We can also use indices to create a slice of the list. For this, use ranges of two numbers separated by a colon. To get the second and third words of our list, we and R. You can also use a colon two to get all words until the index slot two. We get now and we, which have index slots zero and one. And to leave one of the range indexes empty, use two colon. This will give us the other part of the list. So, just as with string indexing, the first value for the first element in the list defaults to zero. And the second value, if left empty, defaults to the length of the list. To check if a list of words contains a certain element, like the word this, use the keyword in to generate a Boolean statement. This verifies whether the word exists. The result of this check is a Boolean, which we can then use as a condition for branching or looping in the rest of the code. Lists are very useful when you're working with many related values. They enable you to keep the right data together, simplify your code, and perform the same operations on multiple values at once. Coming up, we have even more on lists. In this video, we'll continue with lists. You'll learn how to modify the contents of a list. This will give you greater control over your lists because you can add, remove, and change the items that they contain. Previously, we thought about a list as a box divided into different slots. Modifying it means we can keep the box, but we add, remove, or change what's inside. When thinking about modifying lists, there are a few methods that can be used. We'll begin with the append method. The append method adds an element to the end of a list. This requires one argument because this function adds the incoming element to the end of the list as a single new entry. You can even start with an empty list and all new elements will be added at the end. Let's explore an example. We'll begin by typing a list of fruits. Upon further inspection, it seems we forgot to add Kiwi to the list. So we can use the append method to add it. This uses one parameter, in this case, the string Kiwi. Another common method for modifying lists is insert, which requires two arguments. The index number of the element to be modified and the contents being put in that slot, such as a string or integer. Let's investigate how insert works. Insert is a function that takes an index as the first parameter and an element as the second parameter. Then insert the element into a list at the given index. Returning to our list of fruits. Now, orange is inserted at the second spot at index one in our fruit list. Let's add an element at the beginning of the list. In the first parameter, we'll put zero. Then, type mango as the second element. To remove an element from a list, let's consider the remove method. Remove is a method that removes an element from a list. Similar to the append method, remove only requires one parameter. Now, our fruit list no longer has a banana. If we try to remove an element that is not in the list, like strawberries, for example, we get a value error. Another common way to remove elements is with the pop method, which uses an index. Pop extracts an element from a list by removing it at a given index. So, to remove orange, pop the third element in the list with index number two. Now, suppose that after removing orange, you decide to also remove pineapple and replace it with mango. Simply reassign its value. Reference the pineapple's items index number one and replace it with mango. This renders the list without orange because we already removed it as well as mango which replaced pineapple. Our fruits have changed a lot since we started, but it's always the same list, the same box. We've just modified what's inside. At this point, I want to address something that new learners of Python often wonder about. You will recall that strings are immutable and lists are immutable. What this means exactly might not be clear at first. After all, didn't we have multiple videos about how to manipulate strings? A new example will help make this more clear. Whenever we modify a string, we always have to reassign the change back to the variable that contains the string. This is overriding the existing variable with a brand new one. Notice that we can't say overwrite the character at index zero. We get an error. However, we can do this with a list. That's why lists are considered mutable. Great work modifying lists. Now, after all that talk about tasty fruit, I think we deserve a snack break. I'll catch up with you again very soon. As a data professional, sometimes it will be more important to access and reference data than to change and manipulate it. When you simply need to find some information but keep the data intact, you can use a data structure called a tuple. A tuple is an immutable sequence that can contain elements of any data type. Tuples are kind of like lists, but they're more secure because they cannot be changed as easily. They're helpful because they keep data that needs to be processed together in the same structure. Tuples are instantiated or expressed with parentheses or the tuple function. Here, we have a tuple that represents someone's full name. Notice that it was instantiated using parentheses. The first element of the tuple is their first name. The second is the first letter of their middle name and the third is their last name. The position of the element is fixed in tuples. So you can't add new elements in the middle and you can't change any of the elements. If we try to change the last name which lives in index number two from hopper to copper, the code will throw an error. You can add a value to the end but only if you reassign the tuple. Another way we can create tuples is by using the tuple function to transform input into tuples. In this case, our name is represented as a list. We can convert the list to a tuple by using the tuple function. Notice that the name is no longer a list so it doesn't have the brackets anymore. Tuples are also used to return values and functions. In fact, when a function returns more than one value, it's actually returning a tuple. For example, here's a function that takes as an argument a float value that represents a price. The function returns the number of dollars and cents. Notice that when we use the function to convert $6.55 to dollars and cents, the return value is represented as a tuple that contains two numbers. Interestingly, even though tuples are immutable, they can be split into separate variables. When we run the $2.00 sense function, we can directly assign the output into distinct variables. The information stored as a tuple in the result variable has now been reassigned to two separate variables that we can manipulate as we please. This process is known as unpacking a tuple. Notice that the unpacked variables themselves are no longer tuples. In this case, they're integers. A big advantage of working with tuples is that they let you store data of different types inside other data structures. Here's an example of how this might be useful. This is a list of the starting five players on a university women's basketball team. Each player is represented by a tuple that contains their name, age, and position. This is a useful way of working with this type of information. The order of the players doesn't matter that much, and we might want to add to or rearrange them. So we use a list, which is mutable. However, the players themselves are individual records that are represented by tuples. They are a bit more secure because tuples are immutable and more difficult to accidentally change. Because lists and tuples are iterable, we can extract information from them using loops. For example, we can write a for loop that unpacks each tuple into three separate variables and then prints one of the variables for each iteration. This is equivalent to looping over each player record and printing the record at index zero. Using tuples in data professional work helps to make your processes more efficient. It saves memory and can really optimize your programs too. Plus, when others collaborate with you on your code, your use of tuples will make it clear to them that your sequences of values are not intended to be modified. This is yet another great way to save your team time and effort. Hello again. In this video, we'll consider more complex examples of different ways to work with loops, lists, and tuples. I'll also introduce a few new tools that are useful for data professionals. Let's return to the women's basketball team list of players from the previous video. In this example, we'll integrate string formatting, loops, tuples, and lists. Remember, we have a list of tuples where each tuple contains the name, age, and position of a player. Let's define a function that extracts the name and position of each player into a list that we'll use to print the information. We'll call the function player position and its argument will be a list of tuples that contain player information. Next, we'll instantiate an empty list that will hold our result, which will build as we loop through the data. Now we'll use a for loop to unpack the tuples and our list of players. The variables we assign in the for loop must align with the format of the tuples. There are three components of each tuple in our list, name, age, and position. So we need three variables in our for loop. If we try to unpack the tuples using only two variables, like for name, age, and players, the computer will throw an error. It wouldn't know what to do because there are three elements of the tuple, but we're only giving the computer two containers to put them in. So we'll begin our for loop, for, name, age, position, and players. Then we'll use string formatting to append each name and position to the result list. Each string will include some positional formatting and a new line too. Finally, we'll call this function in a for loop. This for loop will iterate over the results list that is output by the function and print each one. Now we have a nicely formatted, easily readable table of players and positions. Here's another application of loops and lists. This is an example of nested loops. A nested loop is when you have one loop inside another loop. These loops create all the different domino tiles and a set of dominoes, which is a tile-based game played with numbered gaming pieces. Feel free to pause the video and try to figure out what's happening. We start with the numbers on the left side of the dominoes. These numbers represent the dots, or pips, on the domino. They range from zero to six. For each number in this range, we'll run another loop to generate the pips on the right side of the domino. Then, we insert the left number and right number into a formatted print statement. And here are the dominoes. Notice that in this first print statement, we've included a parameter called end, whose value was a white space. When a print statement executes, by default, it will end with a new line. So without this parameter, all the dominoes would have been printed in a vertical line, each one beneath the next. But when we set the end character to a white space, it prints a space between each domino instead. Here's the same code, but instead of printing the dominoes as strings, it stores each one as a tuple of integers in a list called dominoes. Now, suppose we want to check the second number of the tuple at index four. We can do that by using indexing. Start with the list we want to access, dominoes, and put the index of the tuple we want to access in brackets. Then, add another pair of brackets containing the index of the value within that tuple. What if we want to calculate the total number of pips on each domino? We can do that with the for loop that iterates over each tuple, sums the value at index zero, and the value at index one, and appends the sum to a list. But there's a much easier way of doing this. It's called a list comprehension. A list comprehension, formulaically, creates a new list based on the values in an existing list. Here's how it works. Begin by assigning a variable for our new list. We'll call this one pips from list comp. For its value, create an empty list. Now, we basically write a for loop, only in reverse. We begin with the calculation that creates each element of the list. In this case, we want each element to be the total number of pips on the domino, which is the domino at index zero, plus the domino at index one. Then, we add a for statement. We can check to make sure it gave the same result as our for loop. They're the same. Note what happened. This is why I said a list comprehension is like a for loop written in reverse. The for part of it is at the end of the statement, and the computation is at the beginning. Both the for loop and the list comprehension do the same thing, but the list comprehension is much more elegant and usually faster to execute too. Hopefully by now, you can appreciate how powerful the building blocks of Python can be. I encourage you to explore this notebook on your own and play with the code to discover what happens when you add something here or change something there. Playing with code is one of the best ways to learn. I'll see you soon. Dictionaries are one of the most widely used and important data structures in Python. A dictionary is a data structure that consists of a collection of key value pairs. They are instantiated with braces or the dict function. We'll discuss that more shortly. Both veteran and entry-level data professionals use dictionaries to analyze large data sets with fast processing power. This helps them gather and transform user information. Dictionaries also provide a straightforward way to store data, making it easier for users to find specific information. To use a regular dictionary, not the data structure, but the actual book with words and definitions, you look up a word, find it, then read its definition. It's essentially the same with the Python dictionary. You look up a key, which will let you access the values associated with that key. That's what's meant by key value pairs. Here's a simple example to illustrate the concept. Suppose we have a zoo, and the zoo has different pens that contain different animals. We could have a dictionary that stores this information for us, with the pen numbers as keys and the animals as values. We could use this dictionary to look up which animals are in each pen. For example, if we want to know what animals are in pen 2, we type the name of the dictionary, zoo, followed by the pen in brackets. Accessing a dictionary this way always searches over the keys and returns the values of the corresponding key. It doesn't work the other way around. I can't use indexing to search for zebras and find out their pen. I will get a key error, because zebras is not a key in the dictionary. Dictionaries are instantiated mainly in two ways. The first way is with braces. With this approach, each key is separated from its value by a colon, and each key value pair is separated from the next by a comma. The second way to create a dictionary is with the dict function. When using the dict function, the syntax is a little different. When the keys are strings, you can type them as keyword arguments. The last time we made this dictionary, we used quotation marks to indicate that the keys were strings. Here, we don't because they're keyword arguments. Also, instead of using a colon between the keys and their values, you use an equal sign. The dictionary lookup is the same, irrespective of whether braces or the dict function is used. The dict function is also a little more flexible and how it can be used. For example, we can create this same dictionary once again by passing a list of lists as an argument, or even a list of tuples, a tuple of tuples, or a tuple of lists. They all give us the same result. If we want to add a new key value pair to an existing dictionary, say, to put crocodiles in pen 4, we can assign it like this. A dictionary's keys must be immutable. Immutable keys include, but are not limited to, integers, floats, tuples, and strings. Lists, sets, and other dictionaries are not included in this category since they are mutable. Another important thing to note about dictionaries is that they're unordered. That means you can't access them by referencing a positional index. If I try to access our zoo dictionary at index 2, I get a key error, because the computer is interpreting the 2 as a dictionary key, not as an index. Also, because dictionaries are unordered, you'll sometimes find that the order of the entries can change as you're working with them. If the order of your data is important, it's better to use an ordered data structure like a list. Finally, you can check if a keyword exists in your dictionary simply by using the in keyword. Note that this only works for keys. You can't check for values this way. There's a lot that we can do with dictionaries, and what we can do with dictionaries and what we've reviewed here is only the beginning. Next, we'll consider more examples that show the power of dictionaries. You'll also learn about some tools that make working dictionaries easy and convenient. Meet you there. Previously, you were introduced to dictionaries and learned a little bit about how they work. Let's continue our exploration of dictionaries and how to use them. Let's consider a previous example and revisit the women's basketball team roster. Recall that the roster was encoded as a list of tuples. Each tuple represented the name, age, and position of a player on the team. The list of tuples was useful when we had a single team and one player per position. What if we add more players beyond the starting five? A dictionary can help us organize the data according to our specific needs. For example, what if we want to be able to look up players by their position? We can create a dictionary where the positions are the keys and the players are the values. Each represented as a tuple that contains two values, their name and age. We could retype the information into a dictionary or cut and paste it, but if you find yourself doing these things, you can take this as an opportunity to improve your coding skills. Consider that this is the information for just 10 players. What if we had data for the whole week? We can convert this information to a dictionary with a for loop and some conditional logic. We'll begin by instantiating an empty dictionary and assigning it to a variable named new team. The idea is to loop over each tuple in the list, extract the position and assign it as a dictionary key, and extract the player's name and age and assign them as a tuple within a list, representing the value for that key. The process would repeat for each iteration of the loop until all of the players are recorded in the dictionary. Notice that each position is only represented once as a key, so the final dictionary has five keys and there are two players in the list at each key's value. Now let's write the loop. First, we'll assign the empty dictionary to a variable called new team. Then we'll write a for loop that unpacks the information in the original tuples for name, age, position, and team. And here's where the conditional logic comes in. If the position already exists as a key word in our dictionary, then we want to append the name and age tuple to the list of tuples. Remember, the value for each key will be a list that contains tuples of player information. If the position does not already exist as a key word in the dictionary, we'll have to assign it. We'll use an else statement to do this. With only a few lines of code, we have converted our list of tuples to a dictionary. Let's check that it works. It sure does. Creating dictionaries this way is a common practice in data analytics with Python, so learning this process will help make you a more capable data professional. Now, let's learn about some useful methods that we can use on dictionaries to really take advantage of their power. Specifically, we'll discuss the keys, values, and items methods. If you run a loop over a dictionary, the loop will only access the keys, not the values. For example, if we loop over the dictionary we just created and print each iteration, the computer will return five positions, the keys of the dictionary. But you don't need to write a loop every time you want to access the keys of your dictionary. That's what the keys method is for. The keys method lets you retrieve only the dictionary's keys. Returning to our new team dictionary, when we apply the keys method to the dictionary, the computer returns a list of all its keys. Similarly, the values method lets you retrieve only the dictionary's values. When applied to our new team dictionary, we get the values returned as a list. Since our values are lists of tuples, it means the result of calling this method is a list of lists of tuples. But what if you want to access both the keys and their values? You can, using the items method, which lets you retrieve both the dictionary's keys and values. Let's use a loop to print what the items method returns so the output is prettier. Dictionaries make storing and retrieving data fast and efficient. Keep exploring the many things you can do with them. With time, you'll find that they become an important tool in your data analytics toolbox. Welcome back. In this video, we're going to discover sets. A set is a data structure in Python that contains only unordered, non-entertainable elements. Sets are instantiated with a set function or non-empty braces. Each set element is unique and immutable. However, the set itself is mutable. Sets are valuable when storing mixed data in a single row or record, in a data table. They're also frequently used when storing a lot of elements, and you want to be certain that each one is only present once. Because sets are mutable, they cannot be used as keys in a dictionary. There are two ways to create a set. The first way is with the set function. The set function takes an iterable as an argument and returns a new set object. Let's examine the behavior of sets by passing lists, tuples, and strings through the set function. To turn the list containing foo, bar, baz, and foo into a set, pass a list through the set function, and notice the list loses the second foo. As I've mentioned, each element must be unique in sets. Pass a tuple through the set function using two sets of parentheses, one to tell the computer that the data we are working with is a tuple, the other because the set function only takes a single argument. Again, the same result. Only one foo element can be present in the set. Finally, pass a string through the set function. It doesn't return the string, just the singular occurrence of the letters in the string, o and f, in an unordered way. This is because the set function accepts a single argument and that argument must be iterable. A string is iterable, so the set function splits it into individual characters and keeps only the unique ones. The second way to instantiate a set is with braces. However, you have to put something inside the braces. Otherwise, the computer will interpret your empty braces as a dictionary. Note here that instantiating a set with braces treats what's inside the braces as literals. So, when instantiating a set of only a single string using braces, it returns a set with a single element and the element is the string itself. Remember, to define an empty set or a new set, it is best to use the set function. You can only use curly braces when the set is not empty and you are assigning the set to a variable. Also, keep in mind that because the elements inside a set are immutable, a set cannot be indexed or sliced. Now, let's discuss some additional functions you can use on sets. First, intersection finds the elements that two sets have in common. Union finds all the elements from both sets. Difference finds the elements present in one set, but not the other. And symmetric difference finds elements from both sets that are mutually not present in the other. Python provides built-in methods for performing each of these functions. Let's start with the intersection method denoted by the ampersand. First, define two sets. Then, apply the intersection function either by attaching the intersection method to set one and passing set two to the method's argument or by using the ampersand operator. Great. Now, let's apply the union function. Use braces to define the two sets, X1 and X2. The goal is to observe where they overlap. Print them using the union method on the X2 variable or the union operator symbol. Union is a communicable operation in mathematics, so the overlapping values will be the same no matter what order you put your variables in. The difference operation on sets, however, is not a communicable operation. Just like in math, if you subtract 5 from 7, you get a different result than if you subtract 7 from 5. You can use either the difference method or the minus sign as a set operator. Subtracting set two from set one gives us only the elements in set one that are not shared with set two. But we don't know if set two contains any elements that are not shared with set one. The inverse is also true. Subtracting set one from set two gives us only the elements in set two that are not shared with set one. But we don't know if set one contains any elements that are not shared with set two. To get around this and observe the difference between two sets mutually, use the symmetric difference function. As you might have guessed, you can use the symmetric difference method or the symmetric difference operator expressed by a carat. Symmetric difference outputs all the elements that the two sets do not share. Excellent work with sets in Python. You've learned so much in this section of the course already and everything you're learning is preparing you for a really rewarding career working with data. Can't wait to be with you soon again. Python has many advanced calculation capabilities that are used for data work and other scientific applications. These features make it possible to extend, enhance, and reuse parts of the code. To access these features, you can import them from libraries, packages, and modules. These features are not included in basic Python, so it's necessary to add them to your scripts. The additional functionality can save you time constructing functions and objects in your own work. Using these features can also help you obtain extra data types for analyzing data or building machine learning models. Let's start with libraries. The library, or package, broadly refers to a reusable collection of code. It also contains related modules and documentation. You'll often encounter the terms library and package used interchangeably. Commonly used libraries for data work are Matplotlib and Seaborn. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Seaborn is a data visualization library that's based on Matplotlib. It provides a simpler interface for working with common plots and graphs. This certificate program integrates two other commonly used libraries for data work, NumPy and Pandas. NumPy, or numerical Python, is an essential library that contains multi-dimensional array and matrix data structures and functions to manipulate them. This library is used for scientific computation. And Pandas, which stands for Python Data Analysis, is a powerful library built on top of NumPy that's used to manipulate and analyze tabular data. There are many other popular Python libraries and packages for data professional work, such as Scikit-learn, stats models, and others. Scikit-learn, a library, and stats models, a package, consists of functions data professionals can use to test the performance of statistical models. They're used across various scientific fields. Scikit-learn and stats models are pretty advanced, so you won't be working with them in this course, but you'll have opportunities to work with these libraries elsewhere in the program. Again, different practitioners across the field often conflate libraries and packages, so you may hear them referred to one, the other, or both ways. Libraries and packages provide sets of modules that are essential for data professionals. Modules are accessed from within a package or a library. They are Python files that contain collections of functions and global variables. Global variables differ from other variables because these variables can be accessed from anywhere in a program or script. Modules are used to organize functions, classes, and other data in a structured way. Internally, modules are set up through separate files that contain these necessary classes and functions. When you import a module, you are using pre-written code components. Each module is an executable file that can be added to your scripts. Commonly used modules for data professional work are math and random. Math provides access to mathematical functions, and random is used to generate random numbers. This is useful when selecting random elements from a list, shuffling elements randomly, or working with random sampling, which you'll explore in a later course. There are several ways to import modules, depending on whether you want to use the whole package or just a single predefined function or feature. This adds functionality for carrying out specialized operations. There's lots more to learn about libraries, packages, and modules, so feel free to refer to the course resources for more information on installing these features and to continue growing your Python knowledge. But as a reminder, you don't have to install anything, because everything you need to complete the different sections of the certificate program are already built into the notebooks you'll be using in Coursera. I'll introduce you to some libraries in the next video. You've learned that part of what makes Python such a powerful and dynamic language are the packages and libraries that are available to it. One of the most widely used and important of these is NumPy. Recall that NumPy contains multi-dimensional array and matrix data structures, as well as functions to manipulate them. You'll learn more about these data structures and functions soon, but for now, let's just consider NumPy at a high level and learn more about it. NumPy's power comes from vectorization. Vectorization enables operations to be performed on multiple components of a data object at the same time. This is particularly useful for data professionals, because their jobs often involve working with very large quantities of data, and vectorized code saves a lot of time, because it computes more efficiently. Let's explore this a little more. Suppose we have list A and list B, both the same length, and we want to create a new list, C, that's the element-wise product of each list. In other words, we want to multiply the first element of list A by the first element of list B, then multiply the second element of list A by the second element of list B, etc. If we try to multiply list A and B together, the computer throws an error. To perform this operation, we could write a for loop. We'd start by defining an empty list, list C, then make a range of indices to loop over, and depend the product of list A and list B at each index to list C. This gets the job done, but if you're thinking there's got to be an easier way, you're right. We can use NumPy to perform this operation as a vectorized computation. Simply convert each list to a NumPy array, and multiply the two arrays together using the product operator. The results are the same, but the vectorized approach is simpler, easier to read, and faster to execute, because while loops iterate over one element at a time, vector operations compute simultaneously in a single statement. The efficiency of this might not be noticeable now, but when working with large datasets it will be. Vectors also take up less memory space, which is another factor that becomes important when working with a lot of data. You might have noticed that when we use NumPy, we first had to import it. This is called an import statement. An import statement uses the import keyword to load an external library, package, module, or function into your computing environment. Once you import something into your notebook and run that cell, you don't need to import it again unless you restart your notebook. When we import NumPy, we import it as NP. This is known as aliasing. Aliasing lets you assign an alternate name or alias by which you can refer to something. In this case, we're abbreviating NumPy to NP. Notice the NP's in the code below the import statement where we create our arrays. If we didn't give NumPy an alias of NP, we'd have to type out NumPy here in order to access its array function. Aliasing as NP makes the code shorter and easier to read. Note that NP is the standard alias. If you use something else, other people might get confused when reading your code. In addition to being highly useful in its own right, NumPy powers a lot of other Python libraries, like pandas. So understanding how NumPy works will help you use these other packages. There's a lot more to discover about NumPy. Coming up, you'll learn about its core data structures and functionalities. See you soon. Welcome back. You recently learned that the NumPy library uses vectorization to make working with data faster and more efficient, which makes your job easier. I demonstrated how NumPy performed an element-wise multiplication of two lists by converting the lists to arrays and then simply multiplying them together. Now we're going to continue learning about arrays and how to work with them. The array is the core data structure of NumPy. The data object itself is known as an n-dimensional array, or nd array for short. The nd array is a vector. Recall that vectors enable many operations to be performed together when the code is executed, resulting in faster run times that require less computer memory. You can create an nd array from a Python object by passing the object to the nd array function. nd arrays are mutable, so you can change the values they contain. If I want to change this last value from a 4 to a 5, I can do that by identifying the index number. Since we're dealing with the last value here, it's necessary to use a negative one. But I can't change the size of the array without reassigning it. If I try to add a number to the end of this array, the computer throws an error. So if you want to change the size of an array, you have to reassign it. Another requirement of the array is that all of its elements be of the same data type. If I create an array with the integers 1, 2, and then a string of coconut, NumPy will create an array that forces everything to the same data type, if possible. In this case, everything becomes a string represented here by U21, meaning Unicode21. So be careful when creating your arrays that they all contain data of the same type, or if they don't, that this is intentional and useful to what you're doing. You previously learned that calling the type function on an object will return the data type of the object. If we do that with an array, as you might expect, we get a NumPy array. We can use the dtype attribute if we want to check the data type of the contents of an array. Here, the dtype attribute indicates that this array consists of integers. As the name implies, nd arrays can be multidimensional. For a one-dimensional array, NumPy takes an array-like object of length x, like a list, and creates an nd array in the shape of x. A one-dimensional array is neither a row nor a column. We can use the shape attribute to confirm the shape of an array. We can also use the ndim attribute to confirm the number of dimensions the array has. Data professionals will often need to confirm the shape and number of dimensions of their array. For example, if they're trying to attach it to another existing array. These methods are also commonly used to help understand what's going wrong when your code throws an error. A two-dimensional array can be created from a list of lists, where each internal list is the same length. You can think of these internal lists as individual rows, so the final array is like a table. Notice that this array has a shape of four rows by two columns and is two dimensions. If a two-dimensional array is a list of lists, then a three-dimensional array is a list that contains two of these, so a list of two lists of lists. This array can be thought of as two tables, each with two rows and three columns. Thus, it has three dimensions. This can go on indefinitely. Thankfully, there are ways to help simplify working with multi-dimensional arrays, which you'll learn about later. And unless you're doing very advanced scientific computations, you typically won't be working directly with NumPy arrays that are more than three dimensions. NumPy also lets us reshape an array, using the reshape method. Our two-dimensional array was four rows and two columns. But what if we wanted this data to be two rows by four columns? We just plug these values into the reshape method and reassign the result back to the array 2D variable. Reshaping data is a common task in data analysis, so it's important for you to be familiar with what it means and how it works. There are many other operations that can be performed with arrays, and you'll surely learn those as the need for them arises in your projects. But there are other helpful functions and methods in NumPy that you'll use regularly. These include functions to calculate things like mean and natural logarithm and floor and ceiling operations, which round numbers to the nearest, lesser and greater whole number, respectively. And many other frequently used mathematical and statistical operations. NumPy is very robust. There are so many things you can do with it that we can only briefly consider them here. As you know, NumPy powers many other useful libraries and packages. In this certificate program, we won't be working a lot directly with NumPy, but we will be working a lot with the library that depends on it, pandas. It's important that you understand the basics of NumPy, because it will help you when you start working with pandas. As you develop your skills as a data professional, you'll find yourself returning to NumPy time and time again, because it's such an integral part of advanced data analysis. Bye for now. Hi there. We've previously discussed NumPy and how it's an important tool for data professionals and anyone else whose job requires high-performance computational power. We've investigated how other libraries and packages use NumPy because of the efficiencies that come with vectorization. One of these libraries is pandas, a quintessential tool both in this certificate program and in the world of data analytics. In this lesson, you're going to learn more about pandas and why it's so useful. Because pandas is a library that adds functionality to Python's core toolset, you have to import it. When we imported NumPy as NP, pandas has its own standard alias of PD. Typically, when using pandas, you import both NumPy and pandas together. This is just for convenience, given that numpy is often used in conjunction with pandas. Strictly speaking, you don't have to import numpy to work in pandas. Pandas is fully operational on its own. Pandas' key functionality is the manipulation and analysis of tabular data. That is, data that's in the form of a table with rows and columns. A spreadsheet is a common example of tabular data. While numpy is capable of many of the same functions and operations as pandas, it's not always as easy to work with because it requires you to work more abstractly with the data and keep track of what's being done to it even if you can't see it. Pandas, on the other hand, provides a simple interface that allows you to display your data as rows and columns. This means that you can always follow exactly what's happening to your data as you manipulate it. In this video, I'll give you a demonstration of pandas and what it's like to use it. Later, we'll go into greater detail on its unique classes, processes and functions. First of all, you can load data into pandas easily from different formats like separated value files, or CSVs, Excel, and other spreadsheets, databases, and more. Here, I'm loading a CSV file that I'm accessing via a web URL. The file contains information for some of the passengers from the Titanic, including their names, what class ticket they had, their age, ticket price, and cabin number. By the way, this table of data is called a data frame. The data frame is a core data structure in pandas. Notice that the data frame is made up of rows and columns, and it can contain data of many different data types, including integers, floats, strings, bullions, and more. If I want to calculate the average age of the passengers, we do so by selecting the age column and calling the mean method on it. I can also get the max, min, and standard deviation with minimal effort. I can also quickly check how many passengers were in each class. Checking summary statistics of the entire dataset only requires one line of code. This method gives me the number of rows, as well as the mean, standard deviation, minimum, and maximum values, along with the core tiles for every numeric column. These concepts are all covered in greater depth elsewhere in the program. I just want you to pay close attention to the power of pandas and all that you can accomplish with it. Pandas also allows me to filter based on simple or complex logic. For example, here I'm selecting only the third class passengers who were older than 60. In addition to all of these data analysis tools, pandas also gives us ways to manipulate and change the data. For example, I can add a column that represents the inflation adjusted price of a ticket from 1912 to 2023. Florence Briggs-Feyer paid 71.28 pounds for her first class ticket. Today, that ticket would have cost her 10,417 pounds sterling. If you're wondering how I knew her name was Florence Briggs-Feyer, it's because I can also select rows, columns, or individual cells from the data using indexing. Her name is in row 1, column 3. I can also do more complex data groupings and aggregations. For example, here I'm grouping the passengers by class and sex and then calculating the mean cost of a ticket for each group. Hopefully you're excited to start working with pandas. I know I'm looking forward to guiding you as you learn more about this powerful and fun data analysis tool. Now that you have a good understanding of the core structures and routines of Python and some of the basics of NumPy, you're ready to start working with pandas. Pandas is one of the primary tools that you'll use throughout the rest of the certificate program, as well as in a large and growing number of data professions. In this video, you'll learn about the main classes in the pandas library and some important ways to work with them. Pandas has two core object classes, data frames and series. Let's begin with the review of data frames. A data frame is a two-dimensional labeled data structure with rows and columns. You can think of a data frame like a spreadsheet or a sequel table. It can contain many different kinds of data. Data professionals use data frames to structure, manipulate and analyze data in pandas. Just like we did in the previous video with the Titanic example. We can create a data frame using the pandas data frame function. This function has a lot of flexibility and can convert numerous data formats to a data frame object. In this example, we created a data frame from a dictionary where each key of the dictionary represents a column name and the values for that key are in a list. Each element in the list represents a value for a different row at that column. We can also create one from a NumPy array resembling a list of lists where each sub-list represents a row of the table. Notice that in this example we included separate keyword arguments for columns and index. This approach lets us name the columns and rows of the data frame. These are just a couple of the many different ways to create a data frame with the data frame function. For examples of some others, be sure to review the available pandas documentation on this topic. Often, data professionals need to be able to create a data frame from existing data that's not written in Python syntax. For example, maybe we want to take an existing spreadsheet and manipulate it in pandas. Spreadsheets can be saved as CSV files which can then be read into pandas as a data frame. CSV stands for comma separated values and it refers to a plain text file that uses commas to separate distinct values from one another. Here is a sample of the first few lines of a source data from the Titanic data set that we used previously. This is what a CSV file looks like. In this file, you'll find values for passenger name, age, sex, fare, and more. Notice that a comma is used to separate each value from the next. To create a data frame from a CSV file, pandas has the read CSV function. Here's the same Titanic data rendered as a data frame. For the sake of an example, it's defined here as df3. The read CSV function can read files from a URL, like in this example, and it can also read files directly from your hard drive. Instead of a URL, you just provide the file path to your file. Now, let's discuss the other main class in pandas, series. A series is a one-dimensional labeled array. Series objects are most often used to represent individual columns or rows of a data frame. So, if we select a row or a column from this Titanic data frame file type on it, it will return as a pandas series object. Like data frames, individual series can be created from various data objects, including from NumPy arrays, dictionaries, and even scholars. Again, refer to the pandas documentation for examples. Now, let's use the Titanic data set to review some of the basics of working with data frames in series. The data frame in series classes have many super useful methods and attributes that make common tasks easier. Remember, a method is a function that belongs to a class. It performs an action on the object. An attribute is a value associated with a class instance. It typically represents a characteristic of the instance. Both methods and attributes are accessed using dot notation, but methods use parentheses, while attributes do not. Earlier in the video, we named the Titanic data set df3, but let's change the name to Titanic for clarity. We can do this by simple reassignment. If we want to access the columns of the data frame, we can use the columns attribute. This returns an index of all of the column names. We can use the shape attribute to check the number of rows and columns contained in the data frame. This data frame has 891 rows and 12 columns. And we can get some summary information about the data frame by calling the info method. This tells us that there are 891 rows and 12 columns and it also gives us the column names, the data type contained in each column, the number of non null values in each column, and the amount of memory the data frame uses. By the way, I want to address a couple of points about terminology in Pandas. First, null values and Pandas are represented by NAN, which stands for not a number. And second, if a series object contains mixed or string data types, when you check its data type, it will come back as an object. This is an example of how Pandas is built on NumPy, but the details of this are beyond the scope of this video. One of the most common tasks when working in Pandas is selecting or referencing parts of a data frame. This has many similarities with indexing and slicing. For example, if you want to select a single column, you can type the name of the data frame, followed by brackets, and within the brackets enter the name of the column as a string. This returns a series object of that column. You can also use dot notation, but this only works if the column name does not contain any white spaces. Using dot notation is faster to type, so for very simple lines of code, you may prefer to do this, but if the code begins to get more complex, it's generally better to use bracket notation because it makes the code easier to read. To select multiple columns of a data frame by name, use bracket notation. Within the brackets, enter a list of column names. This returns a view of your data frame as a new data frame object. If you want to select rows or columns by index, you'll need to use iLoc. iLoc is a way to indicate in Pandas that you want to select by integer location based condition. If you enter a single integer into the iLoc brackets, you'll get a series object representing a single row of your data frame at that index. Because I entered zero here, I got the very first row in my data frame as a series. If you enter a list of a single integer in the iLoc brackets, you'll get a data frame object of a single row of the data frame at that index. You can access a range of rows by entering the indices of the beginning and ending rows separated by a colon. Pandas will return every index starting with the beginning index up to but not including the last index. So, 0 colon 3 returns row indices 0, 1, and 2. You can select subsets of rows and columns together too. This returns a data frame view of rows 0, 1, and 2 at columns 3 and 4 only. So, if you want a single column in its entirety, you select all rows and then enter the index of the column you want. And you can even get a single value at a particular row in a particular column by using two indices separated by a comma. This is similar to iLoc, but instead of selecting by index location, loc is used to select Pandas rows and columns by name. Let's investigate loc with the Titanic data frame. In this example, I'm selecting rows 1, 2, and 3 at just the name column. Note that in this example, we're referring to the rows with numbers even though we're using loc to select. This is because our rows are indexed by number. If we had a named index, however, we'd have to use row names, like what we're doing for columns. And one more thing, if you want to add a new column to a data frame, you can do that with a simple assignment statement. Now, we have a new column at the end here. There are so many things you can do in Pandas, and I can only share so much in the time we have. In this case, the documentation is your friend. There will inevitably be times where you need to do something that wasn't explicitly covered here. In those cases, the documentation almost always has simple examples that demonstrate how to do the thing you need to do. There's still more to come though, so I'll meet you soon. Previously, we investigated several different ways of selecting data in a data frame, like selection, row selection, and selection of combinations of rows and columns using name-based and integer-based indexing. In this video, you'll learn how to filter the data in a data frame based on value-based conditions. You know that Boolean is used to describe any binary variable with two possible values, true or false. With Pandas, you can use a powerful technique known as Boolean masking. Boolean masking is a filtering technique that overlays a Boolean grid onto a data frame in order to select only the values in the data frame that align with the true values of the grid. Data professionals use Boolean masking all the time, so it's important that you understand how it works. Here's an example. Suppose you have a data frame of planets, the radii, and the number of moons each planet has. Now suppose that you want to keep the rows of any planets that have fewer than 20 moons and filter out the rest. A Boolean mask is a Pandas series object indicating whether this condition is true or false for each value in the moons column. The data contained in this series is type BOOL. Boolean masking effectively overlays this Boolean series onto the data frames index. The result is that any rows in the data frame that are indicated as true in the Boolean mask remain in the data frame, and any rows that are indicated as false get filtered out. Here's how to perform this operation in Pandas. We'll begin by creating a data frame from a pre-defined dictionary using the Pandas data frame function. The data frame is called planets. The next step is to create the Boolean mask by writing a logical statement. Remember, the objective is to keep planets that have fewer than 20 moons and to filter out the rest, so we define the mask by writing planets at the moons column is less than 20. This results in a series object that consists of the row and disease where each index contains a true or false value, depending on whether that row satisfies the given condition. This is the Boolean mask. To apply this mask to the data frame insert it into selector brackets and apply it to the data frame. It's also possible to apply the conditional logic directly to the data frame, skipping the part where we assign it to a variable named mask, but breaking out the steps individually can make the code easier to follow. Note that we haven't permanently modified the data frame. Applying the Boolean mask using the conditional logic only gives a filtered view of it, so when you call the planets variable again it returns the full data frame. However, we can assign the result to a named variable. This may be useful if you'll need to reference the list of planets with moons under 20 again later. Sometimes you'll need to filter data based on multiple conditions. Pandas uses logical operators to indicate which data to keep and which to filter out in statements that use multiple conditions. These operators are the ampersand for and, the vertical bar for or, and the tilde for not. Let's review how this works. Here's how to create a Boolean mask that selects all planets that have fewer than 10 moons or greater than 50 moons. Notice that each condition is self-contained and a set of parentheses and the two conditions are separated by a vertical bar, which is the logical operator that represents or. It's very important that each component of the logical statement be in the parentheses. Otherwise, the statement will throw an error, or worse, return something that isn't what you intended. To apply the mask, call the data frame and put the statement or the variable it's assigned to in selector brackets. Here's an example of how to select all planets that have more than 20 moons, but not planets with 80 moons and not planets with a radius less than 50,000 kilometers. Let's break it into pieces. First is the statement for planets. In parentheses, we put planets at the moons column must be greater than 20 and close the parentheses. Then we use the and not operators before parentheses. Planets at the moons column equals 80, close parentheses. Then we again use the and not operators before parentheses. Planets at the radius column less than 50,000, close parentheses. When we apply the mask to the data frame, we're left with just one planet, Saturn with 83 moons and a radius of 58,232 kilometers. There are a near infinite number of ways to select and filter data using the basic tools that you've learned so far. As always it takes a lot of practice before you know exactly how to execute every selection statement that your work requires. So make sure to bookmark any and all resources that you find helpful to reference. Keep up the good work and I'll meet you in the next video. Now that you've learned how to select and filter data using name and location based indexing, as well as Boolean masking, it's time to take the next step. In this video, you'll learn how to group your data, aggregate it and perform calculations on these groupings to help you discover what the data is telling you. One of the most important and commonly used tools to group data in pandas is the group by method. Group by is a pandas data frame method that groups rows of the data frame together based on their values at one or more columns, which allows further analysis of the groups. Let's return to our planets data set to demonstrate some different ways to use the group by method. This time we'll add a little more data including the type of planet whether or not it has a ring system average temperature in degrees Celsius and whether or not it has a global magnetic field. As always, when learning a new data tool, it's helpful to begin by applying it to a small example. This will better enable you to understand exactly what's happening. First, let's examine the mechanics of what happens when you use group by. When you call the group by method on a data frame, it creates a group by object. If you do nothing else, the group by object isn't very helpful. You'll basically get a statement saying, here's your object. It's stored at this address in the computer's memory. But once you have this object, there are all kinds of things you can do with it. For example, if we group the data frame by the type column and then apply the sum method to the group by object, the computer returns a data frame object with three columns, one for each planet type and three columns, one for each numerical column. Only the numerical columns are returned because the sum method only works on numerical data. The type column is an index of this data frame. This information can be interpreted as the sum of all the values in each group at these respective columns. So for example, radii of all the gas giant planets, sum to 128,143 kilometers. That information probably isn't very useful in those cases, but the total number of moons could definitely be something we want to calculate. If you want to isolate the information at particular columns, just insert the columns as a list in selector brackets following the group by statement. You can also use other methods instead of sum. For example, min, max, mean, median, count and many others. Group by will work on multiple columns too. When we pass a list containing the type and magnetic field columns to the group by method and then apply the mean method to the result, we get a data frame that contains a row for each unique combination of planet type and magnetic field. Again, the columns contain the mean calculated values for each numerical column for each group. Group by is very useful because it helps you to better understand your data. It's also useful to organize data that you want to plot on a graph. You'll learn more about this later. Another important method to use on the group by objects is the ag method. Ag is short for aggregate. Ag method allows you to apply multiple calculations to groups of data. Let's start simple. What if we want to group the planets by their type and then calculate both the mean and the median values of the numeric columns for each group? We call the ag method after the group by statement. In its argument field, we enter a list of the calculations we want to apply to the data. If these calculations are existing methods of group by objects, they can just be entered as strings. We can group by multiple columns and apply multiple aggregation functions to each group. For instance, we can group the planets by type and whether or not they have a magnetic field, and then use the ag method to calculate the mean and max values of each group. And we can even define our own functions and apply them. For example, suppose we want to calculate the 90th percentile of each group. We can define a function called percentile 90 that uses the quantile method on the array and returns the value at the 90th percentile. Then we can call this custom function in our aggregation. Notice that we can enter mean as a string because it's an existing method of group by objects, but we type the percentile 90 function as an object because it's custom defined. Group by an aggregate are two tools that together can give deep insight to the story that your data is telling you. The types of calculations that we just viewed are daily tasks of data professionals in nearly every field. Even though we only applied them to a very tiny dataset, the same exact operations would work on a dataset of every planet in the galaxy. If we knew them all and if we had enough computing power to perform the aggregations. There's a lot more we can do with group by an aggregate and as always, I encourage you to explore more on your own, but you should now have a solid understanding of how and when to apply these tools. Hello again. You've learned a lot about pandas, that it's a powerful library that makes working with tabular data easier and more efficient. How to select an index in a data frame, how to filter data using Boolean masks and how to group and aggregate data to derive insights from the story it's telling. In this video, you'll learn how to add new data to existing data frames. This is a common task for data professionals, but it's not as simple as just adding two data frames together. There are important considerations to be aware of. By the end of this lesson, you'll have a good understanding of what these considerations are so you can make informed decisions about how best to add data to your project. We're going to learn about two pandas functions, can cat and merge. There's considerable overlap between the capabilities of these functions, but it's most important that you learn the basics of each because you will encounter them regularly as a data professional. We'll start with the can cat function. Recall that to concatenate means to link or join together. The pandas can cat function combines data either by adding it horizontally as new columns for existing rows or vertically as new rows for existing columns. It's also capable of handling many data specific complexities that arise, which allows for a high degree of user control. In this video, I'll demonstrate how to use the can cat function to add new rows to existing columns. But remember, there's plenty of support documentation if you'd like additional information. Pandas has a specific way to indicate which way we want the data to be concatenated. We do this by referring to axes. In fact, many pandas and numpy functions have an axes keyword, so you can specify whether you want to apply the function across rows or down columns. The two axes of a data frame are 0 which runs vertically over rows and 1 which runs horizontally across columns. We'll use our basic planets data set to demonstrate how can cat works. This data has four planets, their radii, and the number of moons, but it's missing the data for Jupiter, Saturn, Uranus, and Neptune. Now, suppose we want to add this data which exists as a separate data frame. Let's examine this second data set with information about Jupiter, Saturn, Uranus, and Neptune before joining them. Notice that this data is in the same format as the data in the DF1 data frame. It has the same columns for planet, radius, and moons. To combine the two data frames we'll want to add DF2 as new rows below DF1. To concatenate the first data set with information about Mercury, Venus, Earth, and Mars with the second which has information about Jupiter, Saturn, Uranus, and Neptune, we call PD-Concat and insert a list of the data frames we want to concatenate. Then, we need to include an axes keyword argument. This instructs the function to combine the data either side by side or one on top of the other. We want our resulting data frame to have eight rows and three columns which means we want to combine the data vertically. In other words, we want to add new data by extending axes zero, the vertical axes. Perfect! The data was added as new rows. Notice that each row retains its index number from its original data frame. If you want the numbering to restart, just reset the index. We can include the drop equals true argument because otherwise a new index column will be added to the data frame which we don't want in this case. Now, the enumeration of the row indices goes from zero to seven. The concat function is great for when you have data frames containing identically formatted data that simply needs to be combined vertically. If you want to add data horizontally, consider the merge function. The merge function is a pandas function that joins two data frames together. It only combines data by extending along axes one horizontally. Let's return to the planets. Now, we have the radius and number of moons for all eight planets. But suppose we want to add the data for the planet type, whether it has rings, its average temperature, whether it has a magnetic field and whether it has life on it. Perhaps this data exists as a separate data frame, but it's missing Mercury and Venus and it has some recently discovered planets from other star systems, Janssen and Tadmore. That's okay. We can still work with this. First, let's conceptualize how data joins work. For two data sets to connect, they need to share a common point of reference. In other words, both data sets must have some aspect of them that is the same in each one. These are known as keys. Keys are the shared points of reference between different data frames, what to match on. In our case, the keys are the planets. Each data frame contains planets for us to match on. Now let's consider the different ways that we can join this data. We can join it so only the keys that are in both data frames get included in the merge. This is known as an inner join. Alternatively, we can join the data so all of the keys from both data frames get included in the merge. This is known as an outer join. We can also join the data so all of the keys in the left data frame are included, even if they aren't in the right data frame. This is called a left join. Finally, we can join the data so all the keys in the right data frame are included, even if they aren't in the left data frame. This is called a right join. Let's examine how each type of join affects our planet data. We can use the function and enter df3 and df4 as the left and right positional arguments respectively. Then we include the keyword argument on which lets us specify what our keys to match on should be. In this case, we want to use the planet column. Now we have the how keyword argument. This is where we enter the kind of join we want. Let's try inner first. We only kept the planets that appeared in both data frames. This means we're missing data from Mercury and Venus from the left data frame, as well as for Janssen and Tadmore from the right data frame. Now, let's try an outer join. Our function call will remain the same, except for the how keyword argument, which we'll set to outer. As expected, this results in a data frame that contains all the keys from both initial data frames. Notice that because Janssen and Tadmore aren't in the left data frame, they don't have information for radius and moons. So these columns get filled with NaNs. Similarly, because Mercury and Venus aren't in the right data frame, they too are missing some information in the final table, which is represented by NaNs. Next, we'll do a left join. Again, the function gets the same syntax except for the how argument, which is set to left. This results in a data frame that retains all the keys from the left data frame and only the keys from the right data frame that exists in the left data frame too. So Janssen and Tadmore are excluded. Finally, we'll perform a right join. As expected, the result is a data frame that has all the keys from the right data frame, but none of the keys from the left that weren't also in the right. So Mercury and Venus are excluded. Nice job. Now that you know the fundamentals, you can use these pandas tools to do the most common kinds of data joins, which will be useful for a wide variety of data projects. And as you advance in your career, you'll discover even more about joining data and how it can get very complex. These tools will be a big help as you do. We're in a long way and we are now ready to start using pandas to explore your data like a true data professional. See you soon. This is the end of the fourth section of the Python course. You now have a strong foundation of Python skills that you can continue to build on throughout your future career as a data professional. In this section of the course, you learned how data professionals use data structures to store, access and organize data. Understanding which data structures fit your specific task is a key part of data work and will help you analyze your data with speed and efficiency. We reviewed fundamental data structures that are super useful for data professionals. Lists, tuples, dictionaries, sets and arrays. We also discussed two of the most widely used and important Python tools for advanced data analysis. The first was NumPy, which data professionals use for computational power. You learn how NumPy can help you rapidly process large amounts of data and perform useful calculations. The second Python tool you learned about was pandas, which is a powerful tool for analyzing tabular data. You learn how pandas can help you perform key tasks such as filtering, grouping and merging data. Data professionals often work with tabular data. You'll use pandas throughout the rest of the certificate program and your future career. Coming up, you have a greeted assessment. To prepare, review the reading that lists all the new terms you've learned and feel free to revisit videos, readings and other resources that cover key concepts. Congratulations on all your progress. Way to go. Hi, it's great to be with you again. You might recognize me from the last course. I'm Tiffany and I lead the responsible AI program management team here at Google. I'm back to talk more with you about your portfolio projects and how you can use them in your job search. Now that we've had some time to explore Python, I'm excited to help you work on a project that you can add to your professional portfolio. As we complete this segment of the program, you'll have the opportunity to begin showcasing your coding skills. This portfolio project is a really valuable opportunity to develop your interview skills. When potential employers assess you as a candidate, they might ask for specific examples of how you tackled coding challenges in the past. You could use your portfolio as a way to discuss a real problem you have solved. Additionally, some employers might ask you to load, clean and structure data during an interview to prove your proficiency. Getting some practice creating a database structure to address data-driven projects means you'll be prepared for that type of situation. You have already learned about experiential learning or the idea of understanding you're doing. This portfolio project is a great opportunity to really discover how organizations manage data with Python and practice the skills you've been learning in this course. To complete the portfolio project, you'll be presented with details about some business cases and some unstructured data files. Choose one business scenario and use the instructions to complete a new entry in your energy document. Based on the scenario, your task is to load, clean and structure the data so that your end product is a tidy data set. Data tidying is structuring data sets to facilitate analysis. Tidy data sets are easy to manipulate, model and visualize and have a specific structure. Each variable is a column, each observation is a row and each type of observational unit is a table. By the time you complete this project, you'll have a structure data set that you'll use in the next course portfolio project. In your pay strategy document, you'll also have documentation of the steps you took along the way, which you can use to explain your work and thought process to future hiring managers. At this point, you're almost finished with this course, which means you've learned everything you need to keep advancing as a data professional. This part of the project will focus on demonstrating mastery of data manipulation and understanding how data professionals use Python to explore and extract information through custom functions. Ready? Then let's get started. In this course, you've been learning about the advantages and simplicity of Python, as well as basic Python syntax, loops, strings, data structures and object oriented programming. Now it's time for an exciting next step, putting all this to work for your portfolio project. In the previous course, you learned about the flexibility of a data professional career and the ways communication has a direct impact on data driven work. You also practice thinking like a data professional, as you assess a business scenario and recorded project considerations in your pay strategy document. These skills will also be applicable to this new project. In this part of the course, you'll be presented with some unstructured data files. Your goal is to load, clean and structure their data in a tidy data set that is targeted towards a specific business scenario. Coming up, you'll begin to explore what it means to be a data professional. In other sections of this program, you'll work on developing additional skills to help you succeed in the data career space. There's so much more to learn about data visualization, statistics, models and machine learning. The skills you learn and strengthen through this program will help you be a better collaborator when completing future data projects. Learning how to use and navigate Python will also make you an ideal candidate for data professional roles. As a data professional, a large part of your job involves engaging with data to help your team and others in your organization develop critical insights that ultimately drive business decisions. Often, there is so much data that tools like Python are needed to successfully complete daily work. This part of the portfolio project is a great opportunity to demonstrate to potential employers that you can do exactly that. Take unstructured data and clean, organize and manage it to achieve an actionable goal. And remember, developing your skills as a data professional is an iterative process. So you can continue to improve as you have new ideas or learn new things. I appreciate how far you've come at this point in the program. You've done a lot of work already. You completed two entries in your pay strategy document and began writing your own code. As you continue to work on your portfolio projects, you will want to consider how you can document your process and explain what you've done to potential employers and hiring managers in future interviews. First, it's important to recognize that as a data professional, you may be asked to learn new tools. There are a lot of great options out there and different businesses have their own preference depending on their needs. As you apply for jobs, keep in mind that you have learned a lot of transferable skills that can be applied across organizations and industries. For example, in the part of your portfolio project you just completed, you use Python to build a tidy database, focus on solving a data-focused business scenario. Python is a great tool and knowing how to use it is a tremendous skill. But even more importantly, you've learned to consider how a data professional's work contributes to business decisions and strategic insights. You've learned the importance of communication, the value of tools available to you and how to use Python to manage large datasets. These are skills worth highlighting in job interviews, no matter what tools the position requires. This portfolio project is a great way to showcase these transferable skills and give interviewers insight on your approach to problems, your thought processes, and why you made certain decisions. In addition to making sure you're highlighting transferable skills when talking about your portfolio project, you'll also want to make sure you're considering your audience. As you have been learning throughout these courses, you will often work with different kinds of stakeholders who have different levels of technical knowledge. When you're communicating with them about technical processes, you'll want to keep in mind who your audience is, what their goals are, what they already know, and what they need to know. This is just as true when you're discussing your portfolio project with interviewers. Often, there will be people conducting or joining your interview who aren't necessarily data professionals. For example, hiring managers may not have the same detailed understanding of data processes as you do. In order to keep your presentation relevant to them, try to remember any questions about your audience. Your interviewers have a business challenge, just like stakeholders on data projects. They have an open job position they need to fill. Think about what they need to know about you to make a decision that solves that challenge. Coming up, you're going to learn all about how to tell stories with data. Then you'll have an opportunity to perform some exploratory data analysis and create data visualizations. By the end of the program, you'll have a strong portfolio. Congratulations! You have completed the end-of-course project. You now have a tangible product you can present to future employers that demonstrate your Python proficiency. Wow, you've learned so many new Python skills. First, you learned how to use variables to store and label your data. And how to convert and combine different data types such as integers and floats. Next, you learned how to call functions to perform useful actions on your data and use operators to compare values. You also learned how to write conditional statements to tell the computer how to make decisions based on your instructions. And you practiced writing clean code that can be easily understood and reused by other data professionals. Then you discovered how to use loops to automate repetitive tasks. You also learned how to manipulate strings by slicing, indexing, and formatting them. After that, you explored fundamental data structures such as lists, tuples, dictionaries, sets, and arrays. Lastly, you learned about two of the most widely used and important Python tools for advanced data analysis. NumPy and Pandas. Coming up, you have even more exciting discoveries to make. Now that you understand how to create systems to prepare data for stakeholders, it's time to start thinking about how to present that data and make it useful for decision making. You now have a strong foundation of Python skills that you can continue to build on in your future career as a data professional. So go get ready and continue your learning journey. Imagine you're an archaeologist, someone who unearths artifacts to study ancient civilizations and preserve the stories of history. You're excited because you're the first to explore a new dick site. The rising sun creates a warm glow on the orange and yellow rock of the ancient riverbed in front of you. You forget your early morning tiredness and yesterday's 15-hour flight as you breathe in the crisp, clean air. You pause as you remember the words of your site leader the previous evening. This spot has never been studied, but it's absolutely perfect for preservation. You're guaranteed to find something, anything we find will be able to present at the International Archaeological Research Institute next summer. What could be hidden under that rock? What ancient mysteries will be uncovered? What stories will be unearthed? The possibilities seem endless. Hi there. My name is Rob. I'm a consumer product marketing leader. I work on marketing projects here at Google. And every time I get the opportunity to analyze data it feels like I could be an archaeologist on the verge of an incredible discovery. Some people see a data sheet or a table of disorganized numbers and they think, ugh, this is so lame. But data professionals know better, don't we? We know that hidden inside the numbers, columns and rows are golden nuggets of information never before seen insights or compelling trends. These interesting bits of hidden knowledge are stories waiting to be shared. And stories are one of the most impactful ways to communicate ideas. The amazing truth for us data professionals is that all data has stories to tell. So whether you're an aspiring data professional or you want to learn to tell a good story, preferably both, welcome to this course. By now you have a fairly good idea of the scope of this program and the basics of Python coding. Now it's time to dig deep into unexplored data and try to make sense of it. Are you ready to explore? We will begin with how to find and sculpt stories using a six part process called exploratory data analysis. Then we'll discuss how PACE applies to telling stories using data. Do you remember the data professional workflow acronym PACE? That is plan, analyze, construct and execute. We'll talk about how PACE applies to exploring data and learn about the necessity of visualizations in understanding the data. Finally, I'll show you how to perform exploratory data analysis on a data set in Python expanding on those Python skills you learned previously. Later in this course, you'll learn about data sources, data types, data structuring and data cleaning. Using Python notebooks, we'll discuss how to work with missing values, outliers and categorical data. We'll talk about the ethics of exploring and cleaning raw data and how to communicate your questions and findings to various audiences. Later in this course, you'll learn more about the visual analytics platform tab below. We'll talk about how to enhance your data story with visuals and presentations. Along the way, I'll share tips on shaping your data stories for different audiences like how to build data visualizations that target an audience's needs. Data professionals at Google identify the essential skills that are fundamental to working in data. Throughout this course, there will be opportunities to practice finding and telling stories with data. In short, stories can change lives. And data-driven stories are particularly compelling because they are based in numbers. They can also communicate principles, concepts, cautions and new ideas to others. To give you an example for my own life, I actually started my career as a data analyst for a healthcare consulting firm. I was responsible for identifying and recommending the most effective treatments for patients with serious medical additions ranging from asthma to diabetes to cancer. To develop my recommendations, I analyzed the medical records of millions of patients and compared and contrasted the outcomes, side effects and medical costs of each treatment. Through my findings, we were able to recommend treatments that would not only help these patients heal but also help improve their quality of life and reduce their medical bills. The stories are out there just waiting to be found. So get out your shovels and magnifying glasses and let's start exploring. I'm Rob. I'm a product marketing manager at Google. I kind of grew up in the late 90s and the early 2000s in Boston. As an Asian American, I was often really ridiculed for all the standard stereotypes of being Asian and among those were being good or passionate about math or science. When I got into high school, I really found myself rejecting math and science and really trying to avoid those classes, trying to avoid studying them and leaning more into things like humanities or even sports just because I just wanted to fit in so badly. I was lucky enough to be hired as essentially a literature review specialist at an economic healthcare consulting firm. What I found when I was first hired there was that there was this whole other arm of the company that was devoted to analyzing millions and millions of medical records data to really understand and help test the efficacy and safety of medications or pharmaceutical medications to help people with really severe issues and they would do this again by analyzing data for I think it was like over 30 million users and when I heard about this I was so excited. I said, wow, that is a really cool field to potentially get into. I had a chat with my manager who was really open to me taking on maybe some side projects of course after I finished my day to day job and I remember I would spend time after work after hours. I had this textbook that was reading up on just standard data analytics and statistics but also something called SAS programming which is a statistical analysis software that was used by the analysts at our company and so I dove straight and head on into that and eventually I became so fluent at it that I was able to transition to becoming a data analyst and during my time as a data analyst I was so passionate about understanding more and more about statistics and math in general. I actually ended up taking night courses at a local community college because it was my goal to really try to study more about statistics and eventually I was lucky enough to apply to a lot of masters programs in statistics and given my background, my experience I was fortunate enough to get into one. I really wish that I hadn't maybe listened to the other people kind of making fun of me and I just did what I wanted to do and I do to a degree regret that right and I didn't have that foundation but what makes me almost proud to a degree is that like I was able as an adult to really transition into this field and like it really just highlights in my mind that you can really do whatever you want and accomplish whatever you want at any time in your life. Take a step back and just realize that we're all different people on different paths at the end of the day. Take your time each person learns in so many different ways whether it be through these awesome online courses that we offer or enrolling in a university program or you know talking to their friends and working with them it's okay whichever way you are best at learning you know the point is take your time and do what you need to do to learn. You shouldn't put yourself under pressure and compare yourself to anybody else you know just be you and it's okay. Hello again welcome to the first section of this course. In the next few lessons we'll be exploring the foundations of data analysis and how the six practices of exploratory data analysis help us find and tell stories using data. We'll begin by discussing each of the six practices of exploratory data analysis. You'll learn about what they are and why they're important. Then we'll learn about how our data professional workflow PACE or plan, analyze, construct and execute fits into the process of exploratory data analysis. Finally you'll learn the importance of data visualizations in the data exploration process. We'll consider what data visualizations are and how data professionals use them to learn about and share data. The next few lessons will lay the foundation for how to tell stories using data. I hope you're as excited as I am to get started. Imagine you work for a company that's been hired to clean out a very old one-room warehouse full of antiques. Your manager says no one has used the building for decades and doesn't know what the owner's used it for over the years. You've been assigned to clean it out and create a detailed inventory of what you find inside so that the items can be sold at auction. The manager suspects there'll be interesting antiques inside. Since many items are covered by blankets, you don't know what the room contains. You start cleaning from one side of the room to the other and keep a digital inventory of the items you find. As you explore, you find a pile of metal gears, chains and connection parts of all different sizes and shapes. You count them and write Metal Parts 47. You also find 5 large metal sheets, 12 large tricycles and 7 metal poles, among other things. As you write these new items you start to wonder if they're related or not. This simple example demonstrates how data professionals explore data and learn more about the stories and trends along the way. As you uncover and clean up what is in front of you the individual parts and pieces within the data will start to tell a larger story. Of course, data requires a few extra techniques and practices than in the warehouse example, but the general idea is the same. You'll need to make sense of raw data content by reordering, categorizing and reshaping it. You'll find a lot of different terms in the world for this process. Data wrangling, data remediation, data munging data cleaning are all the most common. We combine all of these practices into a familiar term that most data professionals are familiar with exploratory data analysis or EDA. Exploratory data analysis or EDA is the process of investigating organizing and analyzing data sets and summarizing their main characteristics often employing data wrangling and visualization methods. The 6 main practices of EDA are discovering structuring cleaning, joining validating and presenting. These practices do not necessarily have to go in this order and depending on the needs of the data team and the type of data they study, they may perform EDA in different ways. You'll also find that often the EDA process is iterative which means you'll go through the 6 practices multiple times in no particular order to prepare the data for further use. Let's spend time on each of the 6 practices so that you have a better idea of what I mean. Discovering is typically EDA's first practice. During this practice, data professionals familiarize themselves with the data so they can start conceptualizing how to use it. They review the data and ask questions about it. During this phase, data professionals may ask, what are the column headers and what do they mean? How many total data points are there? In the example of the old warehouse beginning of the video, the discovering practice might involve walking through the room and removing coverings to get an idea of the amount and types of items. After some initial discovering, the next step is to start organizing. This EDA practice is called structuring. Structuring is the process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeled. Structuring refers to categorizing data columns based on the data already in the data set. In terms of the calendar data, for example, it might look like categorizing data into months or quarters rather than years. From the old warehouse analogy, structuring could be categorizing the items into metal and non-metal categories and getting a total count for each type. Before we move to the next practice of EDA, let's take a moment to talk about bias. Bias in the context of data structuring is organizing data in groupings, categories, or variables that don't accurately represent the whole data set. Most experts would agree that eliminating all bias from how data is structured is almost impossible because each person's individual ideas, training, and experiences are different. As professionals, however, it's important to try to avoid bias while structuring data. For example, imagine you want to know what percent of the population has a college degree in the UK, but your model data only contains London residents. The data would be considered bias until you add data from the rest of the UK. In our warehouse example, trying to avoid bias might involve remaining flexible with the categories, potentially creating new categories as you uncover more and more items. The next EDA practice is data cleaning. Cleaning is the process of removing errors that may distort your data or make it less useful. Missing values, misspellings, duplicate entries, or extreme outliers are all fairly common issues that need to be addressed during the data set cleaning. In the warehouse example, you might decide to put broken or unusable items in a separate box away from other items. We'll move to another practice called joining. Joining is the process of augmenting or adjusting data by adding values from other data sets. In other words, you might add more value or context to the data by adding more information from other data sources. For example, you might find during the discovering, structuring, or cleaning processes that the data set doesn't have enough data for you to complete a specific project. In that case, you should enrich the data by adding more to it. As an example, remember when I talked about the UK and college degrees to help understand bias? In that instance, joining would be adding the data from the rest of the UK rather than just including data on London residents. Going back to our warehouse analogy, imagine a museum manager sorts through the items and gives you the dates of when each was made. The information from the museum manager acts as a different group of data you can join with your own, as you can. Next on the list of EDA practices is Validating. Validating refers to the process of verifying that the data is consistent and high quality. Validating data is the process for checking for misspellings, an inconsistent number or date formats, and checking that the data cleaning process didn't introduce more errors. Data professionals typically use digital tools such as R, JavaScript, or Python to check consistency and errors in a data set and its data types. As for our warehouse analogy, Validating could be like using the museum manager's knowledge to get an idea of how old the items are. The last EDA practice is Presenting. Presenting involves making your cleaned data set or data visualizations available to others for analysis or further modeling. In other words, Presenting practice is sharing what you've learned through EDA and asking for feedback whether in the form of a clean data set or data visualization. We will be using the term data visualization a lot. To be clear, a data visualization is a graph, chart, diagram, or dashboard that is created as a representation of information. You might think that Presenting always comes at the end of the EDA process. However, Presenting can come at any point in EDA. Data visualizations aren't exclusive to the Presenting practice. They should be used throughout EDA. They help you understand data and point out trends and insights to others. In the warehouse analogy, Presenting could mean showing your manager the progress on the warehouse and how many different items were found. In the workplace, as a data professional, Presenting might look like preparing visuals and a slide presentation to share with your team. As you begin to plan your presentation, you should consider people with visual or auditory impairments by providing robust descriptions of the data. You can use things like alt text, descriptive text, or captioned recording of the data so that your audience can explore the data themselves. We will cover each of the EDA practices in detail later on, but one of the most important things about the process is to ensure your EDA work does not misrepresent the data itself. The story you uncover should come from the data, not from your mind or biases in the data. It is your duty to convey your data in both an ethical and accessible way. Consider the warehouse analogy. To ensure you communicate your data ethically, you should give an accurate count of the materials and not overestimate the quantities of the antiques. In the workplace, communicating data ethically would be presenting sales numbers in context, year over year so that rises and falls don't appear exaggerated in data visualizations. Once you complete the warehouse project, you realize you've uncovered a story. You find a dusty pair of signs that advertise food for sale. Suddenly the items you found start to make sense as a whole. These antiques appear to be a collection of supplies of food vendors. How about that for discovering a story? Of course, you should confirm this information using historical records before sharing. Hi, I'm Benj. I'm a product analyst at Google. I lead a team of analysts attempting to understand Google Chrome through real-world usage and insights. Exploratory data analysis is a key piece of my work and in particular when you get a new piece of data or when you have a new question, the first thing you have to do is take a deep look through the data that's there and understand what's in front of you. You have to understand who the data is from. You have to understand why it was created and what isn't there and what important caveats are sitting on top of it. Storytelling is the way that your insights make it to other people and really make change. It's oftentimes the case that we'll make reports that are tens of pages long, but what really changes the way people think is when you can tell a short and clear story about what's going on. A really good way to tell a story with data is to think about categories of users, categories of devices or categories of use cases. When you can tell people that a lot of people are using Chrome on low-end phones in India, it tells a story about who your user base is and what they're doing and knowing who they are and knowing what that looks like allows them to think differently about who the audience is and what product they should build. One thing I try to do to make sure that I'm really staying true to data is trying to check my own biases and make sure that I am giving back what I actually see. One way that I attempt to make sure that my reporting isn't biased is to try to come into new analyses with as few preconceptions as I can. You need to know some things to be true and to verify them against the data set to make sure that users aren't spending 27 hours a day on their phone because there aren't 27 hours in a day. To not assume that everyone is going to look the same. To not assume that all places in the world are going to act the same when it comes to any data. It's very important ethically to make sure that your data is working for everyone. To make sure that if you try to remove identifiers and try to take data and de-personalize it, that you've really done that well and effectively. The simplest methods aren't always the best. It isn't enough to just say, oh I tried something and I need to stop. You have to explore a little bit more and make sure you're really de-identifying data. Curiosity is one of the most important traits for an analyst in general and by trying to learn something new you're showing that curiosity already. Have you ever walked into a room to grab a specific item and then become distracted and forget why you were there? As data professionals, our curiosity and excitement for finding stories and data might cause us to forget the original purpose for our data exploration. Obviously we want to maintain our natural curiosity but most importantly we want to focus that curiosity on what questions need to be answered or what problems need to be solved. We seek a balanced mindset, one of targeted curiosity. This balance can be achieved by using pace. As you learned, pace or plan, analyze, construct and execute is the workflow some data professionals use to remain focused on the end goal of any given data set. Imagine you work for a multinational company that manufactures office equipment. The finance department asks you to use the last 10 years of data to predict sales for the next 6 months. How might you approach your EDA of the data given this task? In this video you'll learn to combine pace and EDA practices. Remember, EDA stands for exploratory data analysis. You might think that because analysis is in its name that EDA falls only in the analyze part of the pace workflow. The truth is EDA applies to every part of pace. You'll find the 6 practices of EDA all intersect with the other parts of pace. Discovering for example is in line with the planning part of pace and presenting can be a major part of the executing part of pace. As you will learn, data professionals don't just find and tell any random story from the data. Data insights are guided by a project's purpose and goals. The plan may come from a stakeholder or manager. You can think of it as the project plan or the stated goal of the company. For example, let's consider the sales prediction from the office equipment company earlier in the video. As you recall, your task is to predict the next 6 months of sales performance based on 10 years of sales data. When you start your EDA, you realize the data sets you've been given contain a lot more data than you need. It would benefit you and your company to extract only the columns you need to predict sales which would be date of sale item, price and sales rep for a total of 4 columns. While cleaning, joining and validating, you can exclude data on material cost, item ID numbers, database cost center numbers and vendor names. Cleaning 10 years of data sets with 4 columns is much more manageable than 8 columns. Of course, in a typical workplace, it shouldn't surprise you that after you've done a lot of the work to complete your EDA based on the plan, the finance department says they actually wanted the predicted profit margin, not the sales targets. Miscommunication happens in every workplace, even in data analysis. If stakeholders, engineers and data specialists are not clear on the plan, the results won't tell a cohesive or effective story. Instead, the results will likely lead to confusion, disagreement and wasted time. One example of good communication would be sharing the pace plan with anyone who might be involved. Another example could be sharing the analysis with a working group to get feedback before sharing the analysis more broadly. And finally, it's important to understand stakeholders' most important goals for the company before presenting them. We will address details about communication strategies in upcoming videos. The point is when you have a data set in front of you, remind yourself of the reasons for your analysis. Once again, consider the office equipment company. You've been given two data sets, one with transaction ID number and date, the other with transaction ID number and total cost. You're asked to forecast the next six month sales date. If you consider the purpose of this task, what is something you might do with the two data sets? You might consider the EDA joining practice and include all data from the two data sets together, transaction ID, date and total cost. For another example, let's say you have a hospital's data on treatments performed over the last year. If the goal is to prepare data to make a purchase order for the materials and supplies for the upcoming year, what type of EDA might you consider? It would be a good idea to start by using the EDA practice of structuring to group the data by the type of items needed for each treatment. Data professionals are set up for success when they follow a framework like PACE. When a professional applies that framework to the way they perform the six practices of EDA, they keep priorities in order and they focus on achieving the project's purpose. Of course, as data professionals, our first priority is to accurately represent the data itself. If your company's project plan does not align with what the data is telling you, it is your responsibility to communicate that to stakeholders. Let me return to the sales forecast request at the Office Equipment Company. Imagine you're given data on sales revenue specific from two geographic regions but stakeholders are requesting a global forecast. As a data professional, it is your responsibility to ask for globally representative data to complete the task. Working with data from only two regions would be inadequate to make a global projection. Timelines, stakeholder pressure, or client needs should never cause a data professional to bypass what is required by the data. Misrepresentation of data is never warranted. Let's review. Data professionals should strive to align their work with the plan part of PACE. Keeping the focus on PACE helps determine the most effective ways to perform EDA practices while maintaining ethical representations of data. Coming up, we'll explore how PACE can help guide the development of data visualizations. I'll meet you there. Let's review a scenario, shall we? You're given a giant table of numbers and a data visualization, which would you choose to most quickly communicate data insights. Data visualizations are far more effective at quickly communicating complex information than data tables. Now, let's consider how this relates to EDA. Data visualizations are important tools that data professionals use to tell stories with data throughout their workflow, particularly the EDA practice of presenting. Plotting parts or all of your data set on a bar graph, a scatter plot, a pie chart, or a histogram will help you and others to understand the data no matter where you are in the EDA process. As a data professional, you won't be working with simple, clean data sets that have only a few hundred lines of data. You'll be working with data frames that have thousands or hundreds of thousands of data points spanning months, years, or decades even. The more data you have, the more visualizations you'll need to create to understand how each variable impacts the other. When you start looking at new data, a popular and valuable tactic is to visualize it, like time series data, online charts to understand periodicity, or scatter plots to get a good idea of data distribution. Data visualization can also help you explain your data set to stakeholders and other data professionals. It is your job to discover the important points, trends, biases, and stories that need to be shared than design data visualizations in an effective way for different audiences. For example, imagine you're a data professional who works for an appliance manufacturer. Let's say you perform an analysis for the manufacturing team. During the analysis, you discover a delay in the manufacturing process. You need to communicate these findings to two different audiences, the manufacturing supervisors, and the executive leadership team. When you share your findings, some stakeholders may not understand the data insights without the help of data visualizations. The ways data visualizations are designed will communicate different messages to stakeholders with vested interest in different parts of the project. For example, the manufacturing supervisors might need to review time series data plotted out over time to identify the manufacturing delays. Meanwhile, the executive leadership team would likely be more interested in financial impact analysis. Those two data visualizations would need to be designed differently based on the needs of your audience. We will go into detail on exactly how to do that in the upcoming videos, but in the meantime, understand that the balance of words and data visuals in a presentation impacts the business decisions made from data insights. A carefully prepared data visualization can mean the difference between changing the mind of a stakeholder and the data being ignored. However, visualizations can also cause confusion or even misrepresent the data. For instance, imagine a data professional develops a visualization and changes the scale of the axes or the ratio of the graph height and width to make the line chart look flat or steep. This example of skewing is the opposite of what data analysis is about. Because you will be the person most familiar with the data in its story, it is essential that your visualizations not mislead your audiences. Returning again to the manufacturing equipment example, if you provide a data visualization showing a sharp increase in sales over the last six months, your audience might assume the company is doing very well. But if data from the last two years shows the six month increase comes right after a long 18 month decline, the last six months has a different context. Showing only the last six months as opposed to the two years is misrepresenting the sales data. Your data driven storytelling is an opportunity to present facts and visualizations that are ethical, accessible and representative of the data. Being an ethical presenter of the data you analyze means being honest and very clear about what is and isn't in the data. Along with that, remember to design your data visualizations in a way that is accessible to everyone. For example, avoid pairing red and green in data visualizations as it can be difficult for people with color blindness to read. Generally blue and orange are the better choice to use in data visualizations. We'll go into more detail on ethics and accessibility in upcoming videos. In other videos, we'll use digital visualization tools like Tablo and Python packages like Matplotlib and Seaborn and Plotly to understand how to use your data and graphs and charts in ethical and accessible ways. These tools will be part of your everyday work as a data professional. Remember, creating visualizations about your data sets will help you throughout the entire EDA process. There's often no better tool for telling a good data-driven story than a well-made visualization. We've talked a lot about finding stories within data and about how it is your job to tell those stories to the best of your ability. There is, however, another story weaving its way through the readings, videos and quizzes and is still continuing to unfold as you listen to this video. And if it isn't clear, it's the story of coming this far in the program shows you are determined to become a data professional. I hope you feel inspired by the idea of defining not only the stories of your data, but your own story as well. I'm sure you noticed that this part of the course had very little coding instruction in it. That's by design. The job of data professionals is about more than coding and statistical modeling. The first part of the course reflects that. It is also important to recognize how exploratory data analysis or EDA applies to finding, sculpting and telling data-driven stories. If you talk to data professionals in the career space, many will share how beneficial EDA is in day-to-day work. So I took some time to talk about the practices of EDA. Discovering, structuring, cleaning, joining, validating and presenting. And review the value of these practices. We talked about the pace workflow and how it can apply to EDA in our search for stories within the data. You also learned about the ethics of working with data and the importance of representing data accurately when we tell data stories. Lastly, you learned how data visualizations are essential for understanding, forming and presenting data-driven stories. You learned that telling a data-driven story that accurately represents the data and is inclusive of its audience is an example of exceptional data analysis. The story doesn't end here though. We have many more concepts to learn in this course, including using Python tools and coding blocks to help you in your EDA practices of discovering, structuring and cleaning raw data sets. In this program you will work with data sets that represent the type of raw data you'll likely see every day in your career as a data professional. Remember, all data has stories to tell. These stories are often hidden well and it will take an especially curious and determined professional to discover them. The data-driven stories you'll be discovering have the potential to change an entire company or even the world. I am looking forward to continuing our journey together through the rest of this course as you discover questions solve problems and learn new concepts. I hope with each new concept you learn you discover a little more about your own story as well. I'll see you soon. Have you ever tried to explain something to a friend that you didn't understand very well? Maybe it was a difficult math assignment, a complex news story, or the way a family member cooks a favorite dish. When you're trying to teach a topic you're not an expert on, it can be challenging to explain the details and give clear instructions. As a data professional, you never want to be in this position with the data you analyze. In fact, your goal should be to know your data very well. When you're reviewing a table of data, it's important to understand where the data is from, what the column headers mean, what the data will be used for, what the imperfections are, and the small details in between. Making sense of raw data is why we're here. Welcome. I'm excited for you to build on the knowledge that you've learned so far to perform exploratory data analysis or EDA on the kind of data sets you'll see on the job. You'll start by learning about the many different data types and data sources that you will encounter in your work and how to study them. After that, you will return to those Python notebooks to start coding the concepts you've learned. We'll also go into more details on the first two practices of EDA, discovering and structuring. While learning about the discovering practice, we will use popular Python functions to get to know the information contained in the data sets. You'll learn to use different visualization techniques to uncover hidden correlations and connections in the data. As for structuring, you will learn to apply Python functions to large data sets for many different operations, including sorting, filtering, extracting, slicing, joining and indexing large data sets. You'll learn to make basic corrections or formatting improvements to hold data columns or entire data sets. All of these techniques and practices will help you learn about the data and find the story that needs to be told. Along with all the Python functions and coding scripts, we will talk about what to do when questions about the data set arise that you can't answer, like why they're missing data fields, for example. We'll also make sure the EDA you perform aligns with the pace workflow that we have set. At its core, this section of the course is about digging through a data set for the first time and investigating it as meticulously as you can. It is up to you to find the stories in the data. Most data professionals will tell you that comprehensive EDA is the key to useful visualizations and models. If there are questions left unanswered or misunderstanding still present in the data after EDA, any presentation or machine learning model based on it will not be particularly useful. It might feel like trying to explain a math assignment you don't fully understand. Remember, all data sets have stories to tell. Let's get to work and find them. My name is Alassair and I'm a senior financial analyst at Google. I was born in Nicaragua and I immigrated to the United States when I was two years old. I grew up in inner city Philadelphia. After high school, I joined the United States Marine Corps, spent time in Japan, South Africa, Saudi Arabia and Panama. One of the big reasons I got interested in finance is because when I was in my first duty station in South Africa, I befriended the ambassador who had worked in Wall Street. He took me under his wings and mentored me and taught me a lot of things about personal finance corporate finance and the global economy and my mind shifted from I'm going to be a Marine forever to I'm going to get out of the Marine Corps after my four years and go back to school and it worked out. I learned about advanced data analytics on my own actually between random YouTube channels just Googling stuff. I got into big data analytics partly because to understand and drive value in your organization, you have to really understand underlying data. As a senior financial analyst, the typical day in the life really is focused around a couple of things working with strategic business partners. I spend a lot of time trying to figure out what the right financial view is so they can have the insights into their business to influence their decision making. Completing an online course like this is difficult. When you look at that course schedule and see that it's three months or six months long just focus on that next day and stick to your schedule, show up every day. It's like running a marathon you know you're not running 26.2 miles, you're running one mile 26.2 times just focus on the very next thing. I believe in you, you can do it. Imagine for a moment you're preparing a meal for friends. You have a recipe and raw ingredients but this is the first time you'll cook the dish. Of course you want your friends to enjoy it. If you're like me the idea of being judged on perfectly preparing a recipe the first time you've made it is scary. Believe it or not this cooking example is what it's like to be a data professional during the discovery practice in EDA. The recipe you're trying to follow is equivalent to your company's project plan. You have everything you need at your disposal, a kitchen or a digital workspace with dedicated servers for data analysis computing, raw ingredients or a data set and it's up to you on how to best mix, blend and cook the ingredients in order to make a winning dish that your friends will enjoy. In this video, let's focus on raw ingredients. As a data professional, you will be handling all kinds of different data in a number of different formats and file types. I will share with you the most common data sources data formats, data types and a few Python functions. Once you're familiar with a data set source, format and the data types within it you'll be ready to handle questions or challenges as they arise during the analysis. First when you're given data you'll want to know the source. The term data source can have many different meanings in different contexts, but for our purposes we'll be defining data source as the location where the data originates. One good thing to know about a data source is how and when to contact subject matter experts such as engineers or database owners. These are the people who either generate the data or are in charge of delivering the data sets. When you have questions about the data, as you do your discovery they will be the ones you reach out to. Knowing the ownership and source of the data is critical because by understanding the data source and who is responsible for it, you can determine its reliability. Does the data owner have experience in storing data? Does the data owner have any financial stake in the data's output? Understanding the source of the data will help you in telling the story of the data and make ethical decisions about its use. Another important part of determining the data source is understanding how it's collected. Whether the data was collected through a report from a computer system a custom selection from a large online database or a data table that has been manually entered knowing about how the data was gathered will help you understand questions that may come up during EDA. For example missing values could mean many different things. Maybe the database owners either didn't know or didn't want to disclose the data for manual entry or there might be lagging data or system bug from an online database. The next thing you'll need to know about your data sources is the data file format. The main data formats you'll experience as a data professional are tabular files, XML files CSV or comma separated value files spreadsheets, database files or JSON files which stands for JavaScript object notation. Here are a few examples of what these file types look like. If you've gotten this far into the program you'll be familiar with tabular and Excel files by now. As you know they organize data in tables with data variables organized by rows and columns. Rows representing the objects and columns representing aspects of the objects in the data set. The advantage of this file type is a clear identification of patterns between variables. CSV files are a simple text file which can be easy to import or store in other softwares, platforms and databases. They look like rows of text and numbers separated by commas or other separators. The rows of data are broken up by commas rather than strict columns. The advantage of CSV files is more from a computer science aspect in that it is really a file type which is easily read even in a text editor. It is also easy to create and manipulate. In Python you can use the read CSV function to read, write and work with tabular data that is in the CSV format. Database or DB for short is another way of storing data often in tables, indexes or fields. Database files are great for searching and storage. They often require some basic knowledge of search query language or SQL. We will explore how to query data from databases in an upcoming section. Lastly, JSON files are data storage files that are saved in a JavaScript format. You'll find the information in these files better resembles with different language, function and format. JSON files may contain nested objects within them. You can think of nested objects as expandable file folders or drop down menus within the code itself. For example, a JSON file might have the ingredients for a recipe listed and under each ingredient you would have included nested information like weight, calories and price objects that define the ingredients. There are a few advantages of JSON files. They have a small message size. They are readable by almost any programming language and they help coders easily distinguish between strings and numbers. There are Python libraries and functions built for working with JSON files. As you learned previously, you can import the JSON Python module into your Python notebook to use as an encoder and decoder of JSON files. There are also tools in the pandas called read JSON and to JSON which will translate JSON files and convert an object to JSON format type, respectively. As a data professional you may also be tasked to find data stories in other formats like HTML, audio files, photos, email text, images or text files. In every case there are Python libraries or functions that can help you discover and structure the data for your research project. There's no best format for the data. It will depend on the project and storage type. You're just looking for the format that best fits that particular data set. Finally, the last raw ingredient in understanding data is types of data. You may already be familiar with the many different categories of data. As a reminder there is a first, second and third party data. First party data is data that was gathered from inside your own organization. Second party data is data that was gathered outside your organization but directly from the original source. Finally there's third party data which is data gathered outside your organization and aggregated. Knowing the types of data whether first, second or third party will help you be able to more efficiently answer or seek answers to the questions you may have that arise during analysis. For example, if there are missing values in first party data then someone in your organization can help you determine whether the missing data can be recovered. For third party data you'll likely have to reach out to a separate organization. You'll also be familiar with different types of data like geographic, demographic numeric, time-based financial and qualitative data. Your job as a data professional will require that you understand and work with all these types of data. Knowing the data source format and type you're working with will help you to answer two very important questions. First, given what you know of the data so far does it align with the plan as defined by your pace workflow? And second, do you have enough data to follow through with the plan in the pace workflow? If you've answered either of these questions with a no or not sure it is your job to reach out to the owners of the data and project stakeholders to inform them of the issue you found. For example, let's say you're assigned to the project manager and the project manager will predict the number of customers a retailer will expect to see next month. Unfortunately, you're only given profit margin data and only two months worth of customer purchase data. Because the profit margin data won't help you with returning customers and only two months of data won't give you much confidence in prediction in the last few years to accurately predict an upcoming month. Returning to the data source and requesting more data will keep everyone in the process including you, focused on the plan that was established as part of your pace workflow. When you're working as a data professional, keeping this focus will be essential in identifying and carrying out high priority and value add tasks. As a data professional, you're going to know what the data means and how to work with it to find solutions to business problems like cooking a meal. You'll be given the tools and ingredients you need to work on the problem. It is up to you to work with your team if there aren't enough ingredients or if what you've been given won't make a great dish. Now that you have a better understanding of exploratory data analysis and why it's important, let's do some data sets similar to one you might work with in your career as a data professional. The National Oceanic and Atmospheric Administration or NOAA keeps a daily record of lightning strikes across much of North America. Let's imagine we've been tasked with performing EDA on this data set so that it can be used to predict future lightning strikes in this region. In this video, we will use Python to do the EDA practice of discovering on data gathered by the NOAA in 2018. We will talk through the first steps data professionals will typically do when encountering a data set for the first time. Let's open up a Jupiter Python notebook to begin. First, we'll want to import the data set we wish to analyze and the applicable Python libraries we want to use. This preparation step is similar to the way a painter gathers paint brushes, paints and an easel before they start painting. Similarly, we want to collect all our tools and data. The NOAA data set we will use is in the public domain, but we have provided a file to download so you can follow along. To begin, import the Python libraries and packages you plan to use. In our case, we'll use Pandas, NumPy and Matplotlib.Pyplot To save time and increase efficiency, rename each library or package with a two-letter abbreviation. PD, NP and PLT, DT, respectively. Another quick tip for saving time, instead of clicking Run Cell, you can always use the keyboard shortcut Control or Command Enter. You may recall that Python packages like Pandas and NumPy are open source code packages that have been pre-designed and coded to help analyze and manipulate data more efficiently. Pyplot is a package focused on plotting charts and visualizations. Most data professionals import all their commonly used libraries, packages and modules at the beginning of any coding session. But you can import them at any time while working with data in a Python notebook. The Pandas conversion function to date-time is incredibly helpful when working with dates and times. We will first convert our date column to date-time which will split up the date objects into separate data points of year, month and date. This is good practice because throughout your career you'll be grouping data by various time segments. We'll get to that later. After running these initial functions you are ready to begin your EDA discovering practice with this data set. Let's begin with the head method or function. Head will return as many rows of data as you input in the argument field. We'll look first at the top 10 rows. When I input head with a 10 in the argument field and click run the first 10 rows of the data set are now in our notebook. This data set contains three columns of data date, number of strikes and center point geom. At this point in the exercise we will need to clearly understand what each of these columns means. Date and number of strikes are fairly straightforward. You'll want to pay attention to any dates and their format which in this case is year month date. You'll also find when looking at the date column that lightning strike data is recorded almost every day for at least one location in the year 2018. Some column headers are obvious but some like the column header center point geom aren't. If you are not sure about what the column headers mean while you're working on a project you could go back to the public documentation available from the NOAA to confirm its meaning. For this video the information is provided. The center point geom column refers to the longitude and latitude of the recorded strikes. We know the number of columns but we also want to know how many rows of data we'll be working with. Another great EDA discovering tool is the info function. The method called info gives the total number and data types of individual entries. Keep in mind that data types are called d-types in Pandas. If we type df.info into our notebook and run it we will get this output. Our range index tells us our total number of entries. Nearly 3.5 million. We also find that the data values in our date and center point geom columns are classified as objects and the data in our number of strikes data column is classified as int64. int64 refers to integer 64. It means that the data type contains integers or numbers between 2 to the power of negative and positive 63. Simply put, int64 is a standard integer somewhere between negative 9 quintillion and positive 9 quintillion. There is one more d-type you might see when performing the info function. S-T-R which refers to a string. Like objects you'll likely be familiar with strings already. Strings are sequence of characters or integers that are unchangeable. Getting back to the data set we have a pretty good idea about its size and scope. We know it has 3 columns and we know it has 3 million, 401 and 12 rows. We know what the columns mean. There are other methods or functions that we could use during our practice for discovery. For example the methods describe sample, size and shape are all useful for learning about a data set. For this data set though we have what we need in order to understand what is going on in the data set from a discovery standpoint. So next, let's determine which months have the most lightning strikes by plotting our first visualization. Remember your data set has over 3 million rows. If we don't do any categorizing or grouping we can end up with a notebook that is stuck endlessly trying to run the code. We've already converted our date column to date time which makes it easy for us to manipulate the data in the column. We want to group our daily strikes into something more manageable like months for example. So let's make our date reference the number each month corresponds to in a year. That is January is 1, February is 2 and so on. We'll do that by creating the column month using the code dfdate.dt.month dt in this instance stands for date time. Next, we'll make it easier to interpret on a chart. Let's convert the month data which are currently numbers to month name abbreviations. We'll do this by creating another column month underscore text. This code will take the string of characters or month names and slice to include only the first three letters. Before we try to plot this data on a chart we need to sum the numbers of strikes in all locations by month. We can do this by creating a data frame using the group by function in which we will include the columns month and month text because these are the columns we will use to group and order the lightning strikes. We'll use the sum function to add the lightning strikes for each month together. For our last lines of code in this video let's plot the number of strikes by month for the year 2018 into a bar chart and then total number of strikes by state on a map. This will help us get a good sense of the data's story. You'll find that this looks like a big block of code but because we are using the matplotlib.py plot the actual code for the visualization is not too difficult to follow. First, we have the plt.bar function which will have us enter the data columns we wish to plot with x-axis first month underscore txt followed by height or y-axis which we will fill in with number underscore of underscore strikes. Next, we will give the bars a legend or label as it's called in Python which we will input as number of strikes. Lastly we'll fill in our bar graph details with the x- and y-axis labels and the visualization title. Leave the legend and show argument field blank for now. After we run the code you'll see that August 2018 had the most lightning strikes while November and December 2018 had the least amount of strikes. Now, you've completed your first experience coding for an EDA practice in Python. Very well done. We have a lot more to talk about and this is a great start. I've often found that stepping away from a challenging project and taking a break helps me find new perspectives on my work. I like to get a can of seltzer from the fridge, take a walk and reflect on what I know and what I don't know about the project I'm working on. The EDA discovering process requires a similar shift in perspective. After I perform an initial discovering of a dataset I've usually learned enough about the project to make a hypothesis based on that data. A hypothesis is a theory or an explanation based on evidence that has not yet proved true. Data professionals often use hypotheses as a starting point for continued investigation or testing. Once I form my hypothesis I'm in a better position to discover more about the data and achieve my ultimate goal telling a story. So far in this course we've discussed how to begin the discovering practice of EDA. You learned how to examine data sources, data formats and data types. You've considered column header information and averages and you've made some initial visualizations to represent your data. You use Python to determine the size and scope of the dataset and learned when you need to ask the owner of the data clarifying questions. After you've made sense of the raw data you're ready for the next step of the process, drafting a list of questions and forming hypothesis. In this video you'll learn how to ask meaningful questions about the goal you've outlined in the PACE workflow to better understand what is missing from your dataset and what you still need to find out. One way I like to do this is by breaking the original problem into smaller chunks. Some questions you might ask include how can I break this data into smaller groups so I can understand it better? How can I prove my hypothesis or in its current form can this data give me the answers I need? Let's consider these questions in context. For example, imagine you work as a data analyst for an international airline and you need to determine whether lowering the prices of tickets will attract more customers in certain days of the week. To solve this problem you might ask which months have the most passenger traffic? Which weeks, dates or known holidays have the highest number of passengers? When are tickets typically purchased? Then you'd form your hypothesis. In this case, your hypothesis might be I predict that Tuesdays and Wednesdays of a normal business week have the fewest number of passengers and flights. So, if the airline lowers prices for Tuesdays and Wednesdays during non-holiday weeks, then they will sell more tickets. Eventually, you would test your hypothesis by analyzing the data to understand whether the airline would attract more customers by lowering the prices on those specific flights. The purpose of asking questions and forming a hypothesis is to better understand what you want to learn from the data and what the results of your testing might show. Later, when you're performing other practices of EDA, you can refer to these questions and your hypothesis to determine whether you've supported or refuted your original theory. To answer the questions and test the hypothesis you or your team formed, you will need a plan. For instance, you might need to contact the subject matter expert who owns the database or is more familiar with the data source. Or, you may need to do your own research. In other words, leave no stone unturned. It's an old saying about an ancient Greek legend and it means to search everything you can to think of to find what you're looking for. If you discover that your search for answers only brings more questions, that's a good thing. You're eliminating the possibility of misinterpreting or misrepresenting the data each time you learn more. At some point, you may need to decide you need to organize or alter the data to find the answers to your questions. For example, you may need to regroup entries into months or years rather than days or weeks. Or, you might want to group customer ages into age ranges to help you understand trends more effectively. Sometimes combining or splitting data columns will be necessary for creating models to answer questions. Other times, changing date formats or time zones in time bound data may be all that's required. For example, in your work with the international airline, you were tasked with finding days to enact lower ticket prices so the airline could attract new passengers. Imagine the data you were given listed ticket prices in US dollars, but the original request was to lower ticket prices for passengers departing from Europe. One change you would need to make immediately is to convert US dollars to euros. Making small changes to your data, like formatting the time, changing a unit, or converting the currency is all part of the discovering process. However, with every change you make stay focused on the problem you are assigned or the plan that you established as part of the PACE framework. As we discussed, every data set is different. Asking questions and forming a hypothesis will take time and effort but ultimately answering questions and testing your hypotheses will be the way you find the stories hidden in your data. If you get stuck, it might help to step away from your initial discovering work and think through your questions and hypotheses again. Any visualization rendered, conversion made, questions answered or hypothesis tested must be true to that data set story. Who knows, maybe you'll find the answer on your walk break and wouldn't that be refreshing. As a data professional, you can expect to work through time objects and date strings. In this video we'll continue coding in Python and practice converting, manipulating and grouping data. By the end of this video, we'll create a widely used data visualization, a bar graph that tells a story with your data. Working with date strings will often require breaking them down into smaller pieces. Breaking date strings into days, months, and years allows you to group and order the other data in different ways so that you can analyze it. Manipulating date and time strings is a foundational skill in EDA. In this video, you will learn to convert date strings in the NOAA lightning strike data set into date time objects. We will discuss how to combine these data objects into different groups by segments of time, such as quarters and weeks. Let's open a Python notebook and I'll show you what I mean. Let's begin by importing Python libraries and packages. To start, import Matplotlib and Pandas, which you've used before. To review, Pyplot is a very helpful package for creating visualizations like bar, line, and pie charts. Pandas is one of the more popular packages in data science because its specific focus is a series of functions and commands that help you work with data sets. The last package, Seaborn may be new to you. Seaborn is a visualization library that is easier to use and produces nicer looking charts. For this video, we'll use the NOAA lightning strike data for the years 2016, 2017, and 2018 to group lightning strikes by different time frames. This will help us understand total lightning strikes by week and quarter. As I mentioned at the beginning of this video, when manipulating date strings, the best thing to do is to break down the date information like day, month, and year into parts. This allows us to group the data into any time series grouping we want. Luckily, there's an easy way to do that, which is to create a date time object. Now, as you'll recall from a previous video, this NOAA data set has three columns giving us the date, number of lightning strikes, and latitude and longitude of the strike. For us to manipulate the date column and its data, we'll first need to convert it into a date time data type. We do that by simply coding DF date and making that equal to PD dot to date time with DF date input in parentheses. Doing this conversion gives us the quickest and most direct path to manipulating the date string in the date column, which currently is in the format of a four-digit year, followed by a dash, then the two-digit month, a dash, and lastly, the two-digit day. Okay. This is the exciting part. Because our dates are converted into pandas date time objects, we can create any sort of date grouping we want. Let's say, for example, we want to group the lightning strike data by both week and quarter. All we would need to do is create some new columns. You'll see here we're creating four new columns, week, month, quarter, and year. With the first line of code, we're creating a column week by taking the data in the date column and using the function S-T-R-F time. This function from the date time package formats are date time data into new strings. In this case, we want the year followed by a dash and the number of weeks out of 52. If we want that string, we need to code it as %Y-W %V. The % is the command which tells the date time format to use the year data in the string. The W implies this is a week and the V stands for value. As in a sequential number running from 1 to 52. The final string output for the column data will be in this format 2016 dash W 27. The next line of code gives us the new month column. The argument is then written as %Sign Y-%Sign M. This will output the four-digit year by a dash then the two-digit month. Essentially, we're removing the last two-digit date from the original date string. Next, we will create a column for quarters. In this case, a quarter is three months. Many corporations divide their financial year into quarters so knowing how to divide data into quarter years is a very useful skill. In this case, it only takes one month to write a column on a code. We'll call the new column quarter and we'll use our date column with 2 underscore period to create the quarter column. The date time package has a pre-made code for dividing date time into quarters. In the two underscore period argument field, we only need to place the letter and we can use the function S-T-R-F time to complete the string. For the argument, we put percentSign Y-Q percentSign lowercaseQ The first Q is placed into the string to indicate we are talking about quarters. The percentSign followed by the lowercaseQ indicates the pandas that we want the date formatted into quarters. Our final column will be the easiest to code them all. The year column is created by taking our original date column data and creating a string that includes only the argument percentSign Y. This creates a column of data with it with only the year in it. Now that we have formatted some strings, let's quickly review our work by using the head function we learned in the previous video. When we run this code our four new columns are there week, month, quarter, and year. They are all formatted just as we discussed. We can use these new strings to learn more about the data. For example, let's say we want to group the number of lightning strikes by weeks. An organization whose employees primarily work outdoors might be interested in knowing the week-to-week likelihood of dealing with lightning strikes. In order to do that, we'll want to plot a chart. We've reviewed a couple of charts coded in Python by now. Next, let's code a chart with the lightning strike data. For plotting the number of lightning strikes per week, let's use a bar chart. Our graph would be a bit confusing using all three years of data. Let's just use the 2018 data and limit our chart to 52 weeks rather than 156 weeks. We can do this by creating a column that groups the data by year and then orders it by week. We will then learn more about the structuring function in another video. For now, let's focus on plotting this bar chart. We'll use the plt.bar function to plot. Within our argument field, we select the x-axis, which is our week column. Then the y-axis, or height, which we input as number of strikes. Next, we'll fill in some of the details of our chart. Using plt.plot, we will place arguments in the x-label, y-label and title functions. The arguments are week number, number of lightning strikes and number of lightning strikes per week 2018, respectively. This renders a graph, but the x-axis labels are all joined together. So, we have a chart, but the x-axis is difficult to read, so let's fix that. We can do that with plt.xtix function. For the rotation, we can put 45, and for the font size, let's scale it down to 8. After we use plt.show, the x-axis labels are much cleaner. Given our bar chart illustrating lightning strikes per month in 2018, you could conclude that a group planning outdoor activities for weeks 32 to 34 might want a backup plan to move indoors. Of course, this is a broad generalization to make on behalf of every North American location in the dataset. But, for our purposes and in general, it is a good understanding of our dataset to have. For our last visualization, let's plot lightning strikes by quarter. For a visualization, it will be far easier to work with numbers in millions, such as 25.2 million rather than 25 million, 154,365, for example. Let's create a column that divides the total number of strikes by 1 million. We do this by typing df by quarter, and entering the relevant column in the arguments field. In this case, we want number of strikes. Next, we add on .div to get our division function. Lastly, for the argument field, we enter 1 million. When we run this cell, we have a column that provides the number of lightning strikes in millions. Next, we'll group the number of strikes by quarter using the group by and reset index functions. This code divides the number of strikes into quarters for all three years. Each number is rounded to the first decimal. The letter m represents 1 million. As you'll soon discover, this calculation will help with the visualization. You'll learn more about these functions in another video. We will plot our chart using the same format as before. We'll use the plt.bar with our x being from our df by quarter data frame with quarter in the argument field. For the height, we put the number of strikes column in the argument field. It would be helpful if each quarter had the total lightning strike count at the top of each bar. To do that, we need to define our own function which we will call add labels. Let's type add labels then input our two column axes quarter and number of strikes separated by columns and brackets. At the end, we use the format we created earlier number of strikes formatted to label the number of strikes by quarter. To finish the bar chart, we label the x and y axes and add the title. Before we show the data visualization, there are a few small things we want to add just to make it more friendly to read. Let's set our length and height to 15 by 5. Next, let's make the bar labels cleaner by defining those numbers and centering the text. Our bar chart now gives us the number of strikes by quarter from 2016 to 2018. To make the information easier to digest, let's do one more visualization. Here is the code for plotting a bar chart that groups the total number of strikes year over year by quarter. Review the code carefully and consider what each function and argument does in order to create this final polished bar chart. Each year is assigned its own color to highlight the differences in quarters. And now we have our chart. Coming up, you'll learn more about the different methods for structuring data. I'll see you there. As a data analytics professional, you will often need to learn more about your data sets. This is where the structuring practice of EDA can help. Let's explore the valuable methods that are part of this practice. As you'll recall from earlier in the program, structuring helps you to organize, gather, separate, group, and filter your data in different ways to learn more about it. Next, we'll talk about the methods involved in structuring, and later you'll practice these concepts using Python. First on the list of structuring methods is sorting. Sorting is the process of arranging data into meaningful order. Imagine that you're given a data set and these furry creatures are native to Australia and Papua New Guinea. They're known for their strong tails and belly pouch they use to cradle their babies called joeys. The kangaroo data set contains information about kangaroo characteristics like pouch size, tail length, total body length, and much, much more. The first data we'll consider is a data column measuring the volume of the kangaroo pouches in cubic centimeters. We can sort those values in ascending or descending order from biggest to smallest or smallest to biggest. Another useful structuring tool is extraction. Extracting is the process of retrieving data from a data set or source for further processing. You can think of extraction as retrieving whole columns of data. An example of extraction is to take the kangaroo data from before, then evaluate just two of the columns from the data set such as pouch volume and tail length. You can use the resulting data for analysis comparisons or visualization. Another structuring method is filtering. Filtering is the process of selecting a smaller part of your data set based on specified parameters and using it for viewing or analysis. You can think of filtering as selecting rows of a data set. In the case of our kangaroo data set, filtering can look like viewing only the kangaroo pouches of kangaroos that also have tails shorter than one meter. This is useful in finding meaningful groups or trends in the data. Next on the list of structuring methods is slicing. Slicing breaks information down into smaller parts to facilitate efficient examination and analysis from different viewpoints. Think of slicing as an either or both options for columns and rows. A combination of extraction and filtering. In the kangaroo data set, let's say you have a column of their body length called total body length. In another column, you have the kangaroos identified as one of the three different regional populations. If you were to take the body length of only one of the three regional populations, you would be pulling a slice of the data. Grouping is our next structuring method. Grouping, sometimes called bucketizing, is aggregating individual observations of a variable into groups. An example of grouping is to add a new column called total body length next to the kangaroo tail length column. Then group all the tail lengths into three types. Long, average, and short based on the measurements in the tail length column. You can now find and organize the total body length values based on the kangaroo tail length groups. The last structuring method is merging. Merging is a method to combine two different data frames along a specified starting column. For example, imagine we had an additional data set of kangaroo information from a different field study but with the same parameters and variables. We might use the merge or join functions to align the columns and combine the new data into one data set. It's essential that you do not change the meaning of the data while performing your filtering, sorting, slicing, joining, and merging operations. For example, we did not merge the kangaroo pouch measurements correctly with their matching kangaroo name and ID. The data would not be representative and our analysis would be far less than useful. Being true to the data is being true to its story. Hopefully, you are beginning to understand the value of organizing and structuring data in order to analyze it. Coming up, we will practice structuring Python. You have learned about how structuring data can help professionals analyze, understand, and learn more about their data. Now, let's use a Python notebook and discover how it works in practice. We will continue using our NOAA lightning strike data set. For this video, we will consider the data for 2018 and use our structuring tools to learn more about whether lightning strikes are more prevalent on some days than others. Before we do anything else, let's import our Python packages and libraries. These are all packages and libraries you're familiar with. Pandas, NumPy, Seaborn, DateTime, and Matplotlib.pyplot. For a quick refresher, let's convert our Date column to DateTime and take a look at our column headers. We do this to get our dates ready for any future string manipulation we may want to do and to remind us of what is in our data. As you remember, there are three columns in the data set. Date, number of strikes, and center point geom, which you'll find after running the head function. Next, let's learn about the shape of our data by using the df shape function. When we run the cell, we get 3,401,000 and 12 comma 3. Take a moment to picture the shape of this data set. We're talking about only three columns wide and nearly three and a half million cells vertically. That's incredibly long and thin. We'll use a function for finding any duplicates. When we enter df.drop underscore duplicates with an empty argument field followed by dot shape, the notebook will return the number of rows and columns remaining after duplicates are removed. Because this returns the same exact number as our shape function, we know there are no duplicate values. Let's discuss some of those structuring concepts we learned about earlier. Let's start with sorting. We'll sort the number of strikes column in descending value or most to least. While we do this, let's consider the dates with the highest number of strikes. We'll input df sort underscore values. Then in the argument field type by, then the equal sign. Next, we input the column we want to sort, number of strikes followed by ascending equal sign false. If we add the head function to the end, the notebook outputs the top 10 cells for us to analyze. We find that the highest number of strikes are in the lower 2000s. That does seem like a lot of lightning strikes in just one day, but given that it happened in August when storms are likely, it is probable these 2000 plus strikes were counted during a storm. Next, let's look at the number of strikes based on the geographic coordinates latitude and longitude. We can do this by using the value underscore counts function. We type in df followed by the center point geom. Then we type in dot value underscore counts with an empty argument field. Based on our result, we learn that the locations with the most strikes have lightning on average every one in three days with numbers in the low 100s. Meanwhile, some locations are reporting only one lightning strike for the entire year of 2018. We also want to learn if we have an even distribution of values or whether 108 is a notably high value for lightning strikes in the U.S. To do this, copy the same value counts function but input a colon 20 in the brackets so that you can see the first 20 lines. The rest of the coding here is to help present the data clearly. We rename the access an index to unique values and counts, respectively. Lastly, we'll add a gradient background to the counts column for visual effect. After running the cell, we discover zero notable drops in lightning strike counts among the top 20 locations. This suggests that there are zero notably high lightning strike data points and that the data values are evenly distributed. Next, let's use another structuring method grouping. You'll often find stories hidden among different groups in your data like the most profitable times of day for a retail store, for instance. For this data set, one useful grouping is categorizing lightning strikes by day of week which will tell us whether any particular day has fewer or more lightning strikes than others. Let's first create some new data columns. We create a column called week by inputting df. date. dt. ISO calendar. Let's leave the argument field blank and add a .week at the end. This will create a column assigning numbers 1 to 52 for each of the days in the year 2018. Let's also add a column that names the weekday. Type in df. date. dt. day underscore name, leaving the argument field blank. For this last part, let's input df.head again. For the dates now have week numbers and assigned weekdays. We have some new columns, so let's group the number of strikes by weekday to determine whether any particular day of week has more lightning strikes than others. Let's create a data frame with just the weekday and number of lightning strikes. We'll do this by inputting df. weekday. We'll add a number of strikes by weekday and number of strikes by day of the week. Both in single quotes followed by more double brackets. Next, we'll add one of our structuring functions grouped by, followed by weekday. mean within the argument field. What we're telling the notebook here is to create a data frame with weekday the mean number of strikes for that day. To understand what this data is telling us let's plot a box plot chart. A box plot is a data visualization that depicts the locality spread and skew of groups of values within quartiles. For this data set and notebook a box plot visualization will be the most helpful because it will tell us a lot about the distribution of lightning strike values. Most of the lightning strike values will be shown as grouped into colored boxes which is why this visualization is called a box plot. The rest of the values will string out to either side with a straight line that ends in a T. We will discuss more about box plots in an upcoming video. Now before we plot let's set the weekday order to start with Monday. Now to code that input G equal sign SNS dot box plot. Next in the argument field let's have X equal weekday and Y equal number of strikes. For order let's do weekday order and for the show flyers field let's input false. Show flyers refers to outliers that may or may not be included in the box plot. If you input true outliers are included. If you input false outliers are left off the box plot chart. Keep in mind we aren't deleting any outliers from the data set when we create this chart we're only excluding them from our visualization to get a good sense of the distribution of strikes across the days of the week. Lastly we will plug in our visualization title lightning distribution per weekday for 2018 and click run cell. Now you'll discover something really interesting. The median indicated by these horizontal black lines remains the same on all of the days of the week. As for Saturday and Sunday however the distributions are both lower than the rest of the week. Let's consider why that is. What do you think is more likely that lightning strikes across the United States take a break on the weekends? Or that people do not report as many lightning strikes on weekends? While we don't know for sure we have clear data suggesting the total quantity of weekend lightning strikes is lower than weekdays. We've also learned a story about our data set that we didn't know before we tried grouping it in this way. Let's get back into our notebook and learn some more about our lightning data. One common structuring method we learned about in another video was merging which you'll remember means combining two different data sources into one. We'll need to know how to perform this method in Python if we want to learn more about our data across multiple years. Let's add two more years to our data 2016 and 2017. To merge three years of data together we need to make sure each data set is formatted the same. The new data sets do not have the extra columns week and weekday that we created earlier. So to merge them successfully we need to either remove the new columns or add them to the new data sets. There's an easy way to merge the three years of data and remove the extra columns at the same time. Let's call our new data frame union underscore df. We'll use the pandas function concat to merge or more accurately concatenate the three years of data. Inside the concat argument field we'll type in df.drop to pull the week day and week columns out. We also input the axis we want to concatenate which is one. Lastly and most essentially we add the data frame name we are concatenating to df underscore two. We also input true for ignore index because the two data frames will already align along their first columns. And now you've just learned to merge three years of data. To help us with the next part of structuring create three date columns following the same steps you used previously. We've already added the columns for year month and month underscore text to the code. Now let's add all the lightning strikes together by year so we can compare them. We can do this by simply taking the two columns we want to look at year and number of strikes and group them by year with the function.sum on the end. You'll find that 2017 did have fewer total strikes than 2016 and 2018. Because the totals are different it might be interesting as part of our analysis to see lightning strike percentages for each month of each year. So let's call this lightning by month grouping our union data frame by month text and year. Additionally, let's aggregate the number of strikes column by using the pandas function named ag. In the argument field we place our column name and our aggregate function equal to sum so that we get the totals for each of the months in all three years. When we input the head function we have the months in alphabetical order along with the sums of each month. We can do the same aggregation for year and year strikes to review those same numbers we saw before with 2017 having fewer strikes than the two other years. We created those two data frames lightning by month by year in order to derive our percentages of lightning strikes by month and year. We can get those percentages by typing lightning by month dot merge with lightning by year comma on equal sign year in the argument field. You'll find that the merge function is merging lightning by year into our lightning by month data frame according to the year. Lastly, we can create a percentage lightning per month column by dividing the percentage lightning dot number of strikes by percentage lightning after which we'll add the asterisk 100 to give us percentage. Now, when we use our head function we have a restructured data frame. To more easily review our percentages by month, let's plot a data visualization. For this one, a simple grouped bar graph will work well. We'll adjust our figure size to 10 and 6 first. Then we use the Seaborn library bar plot with our X axis as month text and our Y axis as percentage lightning per month. For some color, we'll have our hue change according to the year column with the data following the month order column. Finally, let's input our X and Y labels and our title and run the cell. When you analyze the bar chart, August 2018 really stands out. In fact, more than one third of the lightning strikes for 2018 occurred in August of that year. The next step for a data professional trying to understand these findings might be to research storm and hurricane data to learn whether those factors contributed to a greater number of lightning strikes for this particular month. Now that you've learned some of the Python code for the EDA practice of structuring, you'll have time to try them out yourself. Good luck finding those stories about your data. You know that moment in the puzzle process when you're starting to see the full picture take shape? Like, maybe you have the frame pieces connected but still have a way to go. So far, you've learned about some practices of EDA. You've learned how to gather, analyze, organize, and structure data. Coming up, you will continue to put these pieces together and the picture will become clear. You're making great progress. You've learned about the data sources, data types and data formats and the importance of knowing the basics about your data. We've worked through how to use Python to uncover big picture understandings of your data sets like column headers, d-types, size, and shapes as well as basic visualizations. Along with these parts of EDA and discovering, we've learned about date and time transformations in Python. As for the EDA practice of structuring, you've learned to order from chaos with functions like sorting, extracting, filtering, slicing, joining, merging, and grouping. In Python, you practice applying these functions to data sets that are similar to those you might work with in your career as a data professional. Data professionals commonly use each of these concepts. You will continue to build your skills with them as you search for the stories hidden in data. As a data professional, I can say that cleaning and organizing your data set is 90% of the battle. And once you've structured your tables, uncovering insights and trends can be like a walk in the park. I mentioned earlier how I used to work as a data analyst in healthcare consulting and would analyze vast amounts of medical record data to help recommend treatments for patients with severe illnesses. Typically, data is hosted across multiple different sources and tables, making them difficult to merge together. Plus, medical records may be organized so that medications a patient took and their corresponding conditions are separated. So, in order for me to understand what type of treatment a patient took for which illness, the side effects they experienced and whether the treatment worked, I had to merge hundreds of tables together. Once I did, it was incredibly easy for me to compare and contrast the different types of treatments taken and its impact on each patient. You started to learn the skills needed to complete similar tasks. Along the way, you also learned some key workplace skills like the timing for communicating updates and posing questions to project stakeholders, managers and subject matter experts. We also talked about making and testing hypotheses on your data sets, which narrows the hopes and sharpens the detail of your data-driven stories. In short, you are understanding more and more what it means to perform EDA on a data set. Great work! Later in the program, we'll discuss how to put together the rest of the data story by learning what to do with missing data and outliers, as well as making and using visualizations to help tell that story. I'll meet you there. Let's imagine you work at a board game manufacturer. As the Quality Assurance Manager, your job is to make sure every game is put together correctly. One day, you discover that the machine that produces the games is having technical issues, causing misprinted cards and incorrect counts of game pieces. Your manager asks you and the rest of the Quality Assurance team to go through the affected boxes and replace misprinted cards and missing pieces. Your goal is to make sure each box is usable. Searching through game boxes for missing and incorrect game pieces is similar to cleaning, joining, and validating your data sets as part of exploratory data analysis, or EDA. Earlier, you learned how to discover and structure data in order to understand and tell its stories in impactful ways. In this part of the course, we will cover three of the other practices of EDA, cleaning, joining, and validating data. Although there are many different ways to clean, join, and validate data, we will focus on missing values and outliers, the need for transforming categorical into numerical data and the importance of input validation. Of course, we won't just talk about the concepts of cleaning, joining, and validating. You will learn how to apply them in a Python notebook setting. We'll use data sets that are comparable to what you will see on the job. Along the way, we'll go through some important tips for improving your workplace skills like communication. We will discuss when to communicate with stakeholders and engineers about missing or outlier values and about the ethical implications you must consider when dealing with missing and outlier data values. These practices of EDA, along with effective communication and Python coding, are all essential to preparing a data set for the next steps in the PACE workflow. By following PACE while performing the six practices of EDA, you will not only save time and energy on future processes, but you will be more effective in finding and telling the data story. One of my first jobs at Google was a quantitative analyst on Google Translate. It was my responsibility to help identify areas that we could improve to help users. I recall a project where I was overly eager to dive into the data. I spent countless hours identifying opportunities ranging from when a user first downloads the app to their second, third, and twentieth time using the product. I was proud of my findings and I developed a long presentation to share with my executives. Now, while my insights were well received, the team was reluctant to invest in any of my ideas because there were so many and they weren't sure which ones to prioritize. I realized after that meeting that I needed a plan. I had spent so much time mining for insights that I was suffering from what my manager called analysis paralysis. Instead, what I needed to do was find out what issues were affecting the most number of users and the severity of it. By organizing my thoughts using the strategy, I was then able to hone my analysis into one specific issue and then recommend clear actions to help. Like your job as a quality assurance manager that we imagined at the beginning of this video, as a data professional you may be responsible for cleaning up data sets so that they are ready to use. You won't always be sure what obstacles are in cleaning data, so let's get started on some of the most common. Remember earlier when you imagined yourself as a quality assurance manager at a company assembling a board game? When you discovered that the machine that produces the games was having technical issues and causing game boxes to have pieces missing, you and your company could have chosen a different number of paths forward. One, you could have thrown away the impacted game boxes. Two, you could have left them as is and pulled them at a discounted rate. Or three, and the one you chose in the example, you rummaged through the boxes and corrected the mistakes. There was a plan, strategy or protocol for dealing with missing pieces. The same should be true for any missing values you find in a data set. When you have missing entries in your data set you have to decide on a plan for dealing with them. Even more than that, you must be ready to communicate with stakeholders if missing values impact your analysis. Missing data, which is often encoded as a NA, NAN, or BLANK is defined as a value that is not stored for a variable in a set of data. This is different from a data point of zero. I'll explain more on that in a bit. Data professionals often experience the challenge of missing data. Every data set is different, so there are a wide variety of reasons for values to be missing. Everything from a computer error during a data upload to someone simply forgetting to input it. Depending on the number of entries in your data set and the quantity of missing values, the impact of data fields with a not a number or NANs can range from negligible to substantial. The impact will also determine how you communicate to stakeholders or clients, ranging from a note at the bottom of a data visualization about the fact that there's missing data to a face-to-face meeting about how analyses cannot be completed due to the large number of missing values. Here's an example of the impact missing data has on a data set and the ethical questions missing data creates. Imagine you're a data professional trying to learn more about sleep habits. You email a questionnaire to say 100 people and all of them complete it. One of the questions asks do you sleep exactly 8 hours each night? The answer choices are yes, no, or unsure. 9 people answer yes, 9 answer no, and 15 answer unsure. Because only 33 people responded to that question, it's hard to draw a definitive conclusion about the sleep habits of all 100 students. Since 100 people completed the questionnaire and only 33 responded to this question, 67% of the replies would be considered missing. These missing values greatly impact the questions resulting data. It would be unwise for a data professional to try to use the 33% response rate in order to draw conclusions about the entire population. If you can counter this level of missing values while working as a data professional, you should communicate the inability to complete any analysis and suggest possible solutions. Another challenge that can come up during EDA is with the value zero. In some data sets a zero could be considered a missing value, but in other data sets could be a legitimate data point. In some data sets a nan or blank space might be a mistake. So one forgot to fill it in. Or the data may have been left blank because the column may not apply to that data point. Data professionals unsure of whether or not the blank space is intentional should ask a stakeholder or the owner of the data to confirm the space's legitimacy. Whenever you find missing data, you have a choice to make on how to handle it. As a data professional, you should consider how the missing data might impact stakeholders and who should be made aware. Here are four common ways to handle missing data. 1. You requested the missing values be filled in by the owner of the data. 2. You delete the missing columns, rows or values, which would work best if the total count of missing data is relatively low. 3. You create a nan category or 4. You derive new representative values such as taking the median or average of the values that aren't missing. First, we'll talk about filling in the missing data. If there are large quantities of missing values in the data you receive, the best method to handle the missing data would be to contact the owners of the data and request that that data be filled in. For example, in our sleep example, you could send a follow up request for people to log a response to that particular question. Or, you could rephrase the question, how many hours do you sleep each night? However, in many data sets you'd see while working as a data professional, the option to get missing data filled in may not be feasible for a variety of reasons. For example, if you're working on a project that is time sensitive, you might not have time to gather more data before you're deadline. Or, the data cannot be retrieved because the study happened too long ago. If filling in missing data isn't an option, you can also choose the option of deleting the missing data. Though some might assume it's not appropriate to delete data, removing rows or whole columns of data is ideal if there isn't a large percentage of man's or the values are not going to impact the business plan for the data. One thing to watch out for, though, is discarding missing data that is not missing at random. Deleting values that have been left blank intentionally can skew the results of your analysis. For example, in the sleep questionnaire, if you were to remove all the people that left that question blank, you'd delete a majority of the data. A third option is to make the nans their own category, which is a good strategy if the missing data itself is categorical rather than numerical. For example, in the sleep questionnaire, if you were to put all the non-responses into their own category called answer not recorded, you would be creating a category for the missing data. Finally, there's the strategy of filling in the missing data by creating a representative value. This strategy can be more useful with business plans that call for a predicted value or forecast. There are multiple methods within this filling in strategy to choose from, including the foremost common forward filling, backward filling, deriving mean values, and deriving median values. We will talk about how to do all these missing data operations using Python in another video. These are the most common methods for handling missing data. Choosing which method to use comes with experience, intuition, and reasoning. Sometimes you'll have the opportunity to use more than one option with a data set. Every data set and project plan will be different, so you will be making this determination each time you encounter new data. Sometimes you'll want to confer with peers, managers, or stakeholders in order to make a decision. The impact missing values has on your analysis should be the determining factor in whom you should contact. As a data professional, it is essential that you are thoughtful and intentional about how missing data are addressed. Consider the quantity of NANDs and their importance in relation to the project plan. Ask yourself how will this approach impact this data set? And what are the ethical consideration? Just like the board game scenario, it is essential to determine a strategy and decide on a plan. Now that we reviewed the importance of having a plan to address missing data, let's talk about how to identify and decide how to work with missing data in a Python notebook. We'll use the data you're familiar with, the group of NOAA Lightning Strike data sets. The goal of this video is to learn how to identify missing data. You'll be provided two different slices of data from August 2018. One slice will have the columns date, center point geom, latitude, longitude, and number of strikes. The second slice of the data will include the same columns as the first slice, along with the zip code, state, city, and state code. You will learn through comparisons of these two data sets how to find missing data. Before we look for any missing values, we need to import some Python libraries and packages. We'll use pandas, numpy, seaborn, datetime, and pyplot from matplotlib. We will begin by looking at only the number of lightning strikes for August 2018. Using less data in a Python notebook allows you to spend more time coding and less time waiting for the code to run. For this exercise, our first dataset will have two additional columns called latitude and longitude, which were points pulled from the center point geom column. Let's explore the column headers and the overall size of this first dataset using the functions that you're already familiar with, head and shape. We'll find the two additional columns we just mentioned for latitude as well as a shape of 717,530 rows and five columns. Let's put this aside as our first data frame, saved as df. Next, we'll use the second dataset for August 2018. This one has columns for zip code, which is a postcode delivery area number in the US as well as city, state, and a column titled state code that has two letter abbreviation of states in the US. We'll call the second data frame df zip, so we don't get confused with the first dataset we created earlier. Let's run the same two functions, head and shape, on df zip. The head function returns what we expected out of the dataset. The data columns zip code, city, state, state code, centerpoint geom, and number of strikes. The shape function, however, returned 323,700 rows and seven columns. Given that this data frame is also from August 2018, we expected to see the same number of rows as the first data frame, 717,530. To further explore this, let's put the two data frames together with the merge function. We'll need to create a new data frame. We'll call it df joined followed by an equal sign and df.merge. This df.merge indicates that we want to merge the first data frame we saved named df with another data frame. For the merge, put df underscore zip in the argument field, which is the data frame we want to merge with df. Next, fill in the parameters for the merge, including how equals left followed by a comma. Finally, input the last parameter on equal sign followed by two column names date and centerpoint geom. These two parameters how we want to merge the first data frame on tell python which way to join the data. When we run the head function on our new data frame, we find nans listed for the columns zip code, city, state, state code and number of strikes y. Now that the data is merged, let's use a basic function we learned in a previous lesson to search for nans. When we run the cell, we find that the total count of lightning strikes has been divided into two. Number of strikes x and number of strikes y. Once we merge the two data frames, pandas automatically separated number of strikes into data entries with full data rows and data entries that have missing entries. We can find the total amount of data that's missing using this next piece of code. We'll create a data frame called df null geo by taking our join data frame, df joined and then adding the pandas function is null all one word. The term null indicates that the data is missing. This function pulls all of the missing values from the df joined data frame. We already know which of the columns have missing values. Since we're interested in finding the total, we can use any one of those columns. For this code, we selected state code. After that, we'll use the shape function to give us the total rows and columns which comes to 393,830 and 10 respectively. For even more detail, use the info function which we used in another video. When we input df joined .info with the argument field left blank, we get the column names and another column called non-null count. Non-null count is the total number of data entries for a data column that are not blank. A quick check of the non-null count column helps us confirm which columns have missing data. Zip code, city, state, and state code. In fact, if we subtract the total number of strikes x and the number of strikes y, the result is the number we identified earlier. 393,830 Let's take a look at the top portion of the new data frame we created, df0 using the head function. The output tells us exactly what we expected to find in the data frame we created. In the top five lines of data, the columns zip code, city, state, state code, and number of strikes y have the nan values in the cells. Now that we've pinpointed exactly what values are missing in our data frame, the last thing we'll do in this notebook is learn how these missing values impact our data. The best way to do that is to create a data visualization. A plotted map will help us see where the majority of the missing values are located geographically. To design the map, let's start by creating another data frame. This one we'll call top underscore missing. We are gathering only the data comms we will need in order to plot a geographic visualization. Latitude, longitude, and number of strikes x. As you'll recall from earlier in this video, number of strikes x includes all of the 717,530 data rows while number of strikes y are a segment of those 717,530 with missing data in the zip code, state and state code columns. You'll find that it's helpful to group the columns by latitude first, then longitude. This will make it easier for most data processors to plot. Lastly, we'll sort the lightning strikes x column by sum of values. In the ascending field, we'll input x because we want the largest sums of lightning strikes at the top of the data frame. Finally, let's take a look at what we built using the head function again. After running the cell, the first 10 rows show the most strikes are falling on and around the same latitudes and longitudes. Let me give you a helpful hint before we plot a visualization using this data frame. You should import the express package from plot leaf first. Express is a helpful package that speeds up coding by doing a lot of the back end work for you. If we don't use express for this particular data set, which has hundreds of thousands of points to plot, your run cell times could be long or the code could even break. Now it's time to create the map. Our format for the graph will be from plot leaf express called scatter geo. As indicated by its name, this graph type is used for scatter plots on a geographic map. In the argument field, we'll input the data frame we just created top missing. For the parameters in the argument field, we want to filter the number of values to be only those latitudes and longitudes with lightning strikes of more than 300. Naturally, we plug latitude and longitude into their appropriate parameter spots. Then, for the size parameter, we plug in the final column number of strikes X. Lastly, let's use the update layout function within the figure to give the plot title missing data. Now, we create it. It's a nice geographic visualization but we really don't need the global scale. Let's scale it down to only the geographic area that we are interested in, the United States. So, let's copy the same code we just wrote. This time, though, we add the parameter geo underscore scope equals USA. This will limit the geographic scope to just the United States of America. Then, we can plot it again and check the output. The resulting map shows a majority of these missing values cropping up along the borders or in spots over bodies of water, like lakes, the ocean, or the Gulf of Mexico. Given that the missing data were in the state abbreviations and zip code columns, it does make sense why those data points were left blank. There are no zip codes for bodies of water. You'll also notice some other locations with missing data that are not over bodies of water. These types of missing data, latitude and longitude on land are the kind of missing data you would want to reach out to the NOAA about. What's interesting about cleaning the lightning strike data and looking for missing data values is that we learn something new in the process. We found one of the stories hidden in the data. The data values were missing because the lightning strikes were over bodies of water. It goes to show you that data cleaning may sound tedious, but you'll never know what you learn. Hi, my name is Remi, and I'm a customer engineer for Google Cloud. A customer engineer is someone who focuses in some technical field. For me, it's data analytics and machine learning, and is the expert in that field and helps their customer answer questions and build things related to data analytics. I had an exaggeration around 30 different jobs before I got into the field that I'm currently in. I was a cashier, a hostess, a cocktail waitress, an office assistant, and quite frankly, I wasn't very good at any of those jobs. If something's like repetitive, I can't concentrate, and then I end up making silly mistakes. So that's kind of what's so great about data science is a lot of the minutiae and repetitive tasks is software does it for you where you can write a script to do it. There is definitely no typical day in the life of a customer engineer. Usually I will start the day by looking at emails. I'll usually have some kind of questions from a customer, another colleague related to data that I can help answer. There's a lot of informational emails about new features, products, or techniques. I have to read those and learn about them because it's my job to be the person who knows about all of these things for our customers. I'll also typically have a few meetings where I'm working with our account and customer teams where we kind of try to strategize how we're going to help customers and get them to do new things with our tools. A new feature that I've been working with is related to our data warehouse big query. You can actually create machine learning models within BigQuery so you don't have to move the data anywhere. I have a variety of customers that use BigQuery machine learning so I can now come to them and share with them this new feature and show them how it actually works and if while they're using it they might have an issue that comes up, they get some kind of error I can help troubleshoot that problem for them. There's so much going on in this field of data and it's also new it's going to feel overwhelming I've been doing it for almost a decade and I still get overwhelmed on a daily basis. The best thing you can do is pick a software, pick a language maybe a specific area and just focus on that so you know get really good at one thing and then you can expand from there. As a child did you ever play a puzzle game in which you were supposed to look at a picture and find an object that doesn't belong? Maybe sometimes the object was obvious but other times it may have been more difficult for example if you were given a picture of a bustling marketplace full of people selling fruits, vegetables, and grains it would probably take some time to find the item for sale that's not in the right place. Similar to visiting a busy market when you encounter a data set that hasn't been cleaned it can be really challenging. Data may at first seem structured but it often can be hard to determine whether there are any data points that are measurably different from others these extreme data observations, the ones that stand out from others are known as outliers. Outliers are observations that are an abnormal distance from other values or an overall pattern in a data population. In this video we will discuss the three main types of outliers and why it is important for data professionals to identify them in the data we analyze. As a data professional you should be fully aware of the meanings and ends, highs and lows and extreme points of your data across every variable in the data set. These values will often be your outliers. It is essential that you recognize them what their value is and where they are. Otherwise they will likely skew any conclusions you draw or models you build from it. There are three different types of outliers we will discuss in this video. Global outliers, contextual and collective outliers. Later you will learn how to find outliers and work with them in Python. The first outliers we will discuss tend to be the easiest data points to detect. Global outliers are values that are completely different from the overall data group and have no association with any other outliers. They may be inaccuracies, typographical errors or just extreme values that you will not see in a data set. For example, if we had a set of human heights, 1.7 meters 1.9 meters 1.6 meters 1.8 meters 7.9 meters 1.7 meters the outlier is fairly obvious. Typically, global outliers should be thrown out to create a predictive model. Contextual outliers can be trickier to spot. Contextual outliers are normal data points under certain conditions but become anomalies under most other conditions. As an example, movie sales are expected to be much larger when a film is first released. If there is a huge spike in sales a decade later, that would typically be considered abnormal or a contextual outlier. These outliers are more common in time series data. Another example might be an outlier only in a specific single category of data. For instance, 2.5 meters may be a normal enough size for a category called mammal heights but if the mammal is found out to be a mouse we would likely consider this an outlier despite it fitting in with other mammal heights. Lastly, we have collective outliers. Collective outliers are a group of abnormal points that follow similar patterns and are isolated from the rest of the population. Think of a parking lot at a store. It is not uncommon to have cars or scooters coming and leaving consistently during the hours a store is open but to have a full parking lot after the store is closed would be considered a collective outlier. It could be that there is a company party or a local event nearby which would explain the outlier of cars parked in the store parking lot after hours. One useful way to find these different types of outliers in our data is something we've already discussed quite a bit, visualization. It will be easier to see any big dips or giant blips in the data when we plot it on a line graph or a bar chart. No matter how you discover that your data contains global, contextual or collective outliers, it is essential that your EDA includes a strategy for dealing with them. It will be up to you to decide how outliers need to be represented or whether they need to be removed completely. The decision on what to do must always be done in the context to the data set and the business plan for the data. As mentioned in other videos, always consider the ethical implications of any decision you make about outliers. When you're working as a data professional, it may be tempting to remove outliers to improve results, predictions or forecasts but you will need to ask yourself does that change the data story? For example, let's imagine that you are a data professional at a retail business. You review two years of sales data and find that there's one month where sales of a typically popular item are down dramatically. You might at first assume that there's a typo in the reported data, but as a diligent data professional you ask questions about it. You discover that several top sales people were on leave during that month and your company advertised a new product which slowed sales on the existing item. It is likely that both of these issues led to a drop in sales. Rather than dismissing the drop in sales as an outlier this information will be helpful for the business team. Using this information managers can prevent a future drop in sales numbers ahead of product launches by hiring additional sales team members limiting requested employee leave during those critical times. Later you'll learn how to find outliers and work with them in Python. Earlier you learned about three types of outliers global, contextual and collective. Now that we discussed outliers on a conceptual level I'll show you how to identify them and analyze their impact in a Python notebook. Our goal will be to identify outliers across a 33 year span of total lightning strike counts in the United States. As usual, let's start by importing our libraries and packages. Pandas, NumPy, Seaborn and Pyplot from Matplotlib. We'll continue using our NOAA Lightning Strike data set. This time we'll group the total sum of the lightning strikes in the United States by a year from 1987 to 2020. Let's first reveal the top 10 rows of data using the head function. The data set has two numeric columns year and number of strikes. As you'll notice we're dealing with some fairly large numbers for the lightning strike totals. It would be helpful to make them a bit shorter and easier to read in visualizations. To do that let's write a readable numbers function. Below the function there is a long explanatory text inside a triple set of quotation marks. This is called a documentation string or doc string. A documentation string or doc string is a line of text following a method or function that is used to explain to others using your code what this method or function does. A doc string represents good documentation practice in Python. It makes the code easier to understand and can be easily exported to create library documentation. A doc string was already provided for this readable numbers function. Next we'll form an if else statement. We want to code any number above six digits to be formatted to one decimal place followed by an M. Under the else statement we are formatting numbers with more than three digits to the same one decimal place followed by the letter K. Under these statements we define the new data frame we want titled number of strikes readable. We then use the apply function to set our readable numbers statement to only the number of strikes column in the data frame. Using the head function we'll see the output of our newest code. The new column has readable values of 15.6 million, 209,000 and 44.6 million in its first three rows where before there were just long strings of numeric digits. This indicates that we coded the read numbers function correctly. Now that we have more readable values let's plot our data. As you'll recall our goal is to find any outliers among the total number of lightning strikes in 33 years of NOAA data. One way to find any outliers would be to investigate the mean and median of this data. The mean is the average number of the given values and the median is the middle point of the given values. When we code for these two values we use the numpy functions of mean and median. We're using the readable numbers for simplicity. The resulting output is the mean 26.8 million and the median 28.3 million. With the mean being 2 million strikes less than the median we suspect the data distribution is likely skewed to the left. The left side of the distribution will be a good place to investigate. One effective way to visualize outliers is a box plot. As we discussed in a previous video a box plot divides the distribution of data points into four main quadrants or quartiles. The box plot is helpful in visualizing and confirming the outliers we've already found. Let's plot the box plot first in more detail. To design a box plot using Python we'll use the seaborn box plot function. We'll use the number of strikes column from our data frame as the basis for the plot. Next we'll use set x tick labels based on the readable numbers statement we created already. X tick labels simply allow us to give names to the x marks so that they are read more easily. Lastly we input the x and y labels and the title. You'll recall that the purpose of a box plot is to show the distribution of values separated into quadrants or quartiles. What we are most focused on with our plot are that the two points beyond the far left of the line are below 10 million. This data visualization is showing us very clearly there are outliers included in our group of data points. The blue rectangles in your box plot are called the interquartile range. This range represents the difference between the third quartile and the first quartile of a set of data values. A standard rule statistics which you'll learn more about in another course is that any data point that falls beyond 1.5 times the blue boxes are considered outliers. This next piece of code will help us define what value is at 1.5 times below our interquartile range. We'll use the quantile function to define what we'll call percentile 25 and percentile 75. We enter our data column number of strikes for our data frame and then 0.25 0.75 in the argument fields. The values that occur between the 75th and 25th percentile are the interquartile range. We create a statement for interquartile range IQR defining percentile 75 minus percentile 25 is equal to IQR. Now let's define two statements, upper limit and lower limit. Upper limit is equal to percentile 75 plus 1.5 times IQR. Lower limit is percentile 25 minus 1.5 times the IQR. Lastly, let's have Python provide the exact value of the lower limit. We'll input print followed by the text lower limit is colon plus sign, readable numbers and lower limit in the argument fields. After running the cell we learn that the lower limit is 8.6 million. Don't worry if you're still trying to understand interquartile range. This concept will be explored further in depth later in the program. Next, let's plot the values that lie below our lower limit of 8.6 million. Place the number of strikes column inside the brackets next to DF. Then, use the less than symbol followed by the lower limit we just derived. And now, here are the outliers. One great way of seeing these outliers in relation to the rest of the data points is a data visualization. For this plot, let's do a scatter plot. You may recall that a scatter plot represents relationships between different variables with individual data points without a connecting line. To start the scatter plot, let's first add labels to each point on the plot. We'll define them using the add labels function for X and Y points. The plt.text function allows us to define how we want the text of each data point to appear. In this case, we tell the text to start 0.5 pixels to the left of the data point and to start 0.5 pixels above the data point. We also define the number of values using our readable number statement from the beginning of the video. Next, let's add color to the data points. Let's make the outlier points red and the other points blue. To do that, we'll define colors by the lower limit. For those that fall below the previously defined lower limit, we'll code that they should be R for red, and everything else will code as B for blue. Our next line of code should feel familiar. We start by configuring our visualization size. In this case, we code 16 and 8 inches, which are the default units for this function. Next is our code for scatter plots. ax.scatter The ax refers to axis, which tells the notebook to plot the data on an x versus y data graphic. Inside the argument field, we first input the x-axis and then the y-axis, which is year and number of strikes, respectively. For the rest of the parameters, we fill in the x and y label and the title. We finish the parameters of our scatter plot by defining the x-axis tick labels and setting the rotation of the x-axis ticks to a rotation of 45 degrees. The resulting plot shows us the years 1987 and 2019 are the two values in red are two outliers. Now that we've narrowed our scope of the outliers to just two years, 1987 and 2019, we can do a little more investigating. Let's start with the 2019 Lightning Strike data. If we import just the 2019 data, the first thing we'll want to do as we learned in a previous video is convert our date column to date time. Next, we'll create some new columns in our data frame, month and month text. You'll notice for the date, we use the str.slice function to cut the month names down to just the first three letters. Finally, we create a data frame that groups the total number of lightning strikes for each month. The result explains why the year is an outlier. For the 2019 data set, only lightning strikes from the month of December have been inputted. If you were a data professional tasked with calculating lightning strike risk, you would first research the documentation available to see if there's a reason given. If you don't find anything based on your research, you would then ask the NOAA why lightning strikes have been recorded for the other 11 months of 2019. When we do the same code for 1987, the resulting output is different from 2019. We find there are lightning strikes recorded for each month in 1987. This difference that 2019 does not have data for all months and that 1987 does have data for all months helps us to know how to handle these two outliers. For 2019, it would make sense to exclude it from our analysis. As for 1987, we recognize that it is indeed an outlier, but it should not be excluded from our analysis because it does have lightning strike totals included for each month of the year. Now before we end, let's return to the mean and median values that we calculated earlier. First, let's create a new data frame including only data points that exclude the outliers we identified. Our goal here is to show the mean and median of the data set excluding the outliers. To do this, we'll use a data frame without outliers we just created and our readable numbers statement to only include data points that are greater than our lower limit. When we exclude those outliers, the mean and median are much closer together suggesting a fairly evenly distributed data set. Now, let's revisit the goal we set earlier. To identify outliers across the 33 year span of total lightning strike counts in the United States. We discussed a lot of new data sets and we achieved this goal. Have you ever tried to charge your phone with the wrong type of plug? It is a frustrating problem but definitely not an insurmountable one. You will probably experience a similar frustration when you work with categorical data. Categorical data is data that is divided into a limited number of qualitative groups. For example, demographic information tends to be categorical like occupation, ethnicity and educational attainment. Another way to think about categorical data is data that uses words or qualities rather than numbers. As a data professional, you will likely work with data sets that contain categorical data. Categorical data entries can typically be identified quickly because they are often represented in words and have a limited number of possible values. Many data models and algorithms don't work as well with categorical data as they do with numerical data. Assigning numerical representative values to categorical data is often the quickest and most effective way to learn about the distribution of categorical values in a data set. There are some algorithms that work well with categorical variables in word form like decision trees, which you'll learn about in another course. However with many data sets, the categorical data will need to be turned into numerical data. This conversion is often essential for predicting, classifying, forecasting and more. There are several ways to change categorical data to numerical. In this video, we will focus on two common methods creating dummy variables and label encoding. One way to work with categorical data is to create dummy variables to represent those categories. Dummy variables are variables with values of 0 or 1 which indicate the presence or absence of something. It helps to open a data set with dummy variables already created to understand what their function is. You'll find in this data set that for any value that has been determined to be mild A1 has been input. All other values in the mild column are labeled with 0s. The same goes for other categories and values. You can think of it as the 1 representing a yes and the 0s representing a no. Dummy variables are especially useful in statistical models and machine learning algorithms. Another way to work with categorical data is called label encoding. Label encoding is a data transformation technique where each category is assigned a unique number instead of a qualitative value. For instance let's imagine you had a data set about mushrooms in which you wanted to understand the distribution among different types of mushrooms. Imagine you've been given a data set about mushrooms with a column titled type and the options of black truffle crimini, king trumpet, button, hedgehog, morel, portobello, tollstool, or shiitake. You can use label encoding to transform each mushroom type into a number. Black truffle would be changed to a 0. Button would become a 1, crimini a 2, hedgehog a 3, king trumpet a 4, morel 5, portobello 6, shiitake 7, and lastly tollstool would be 8. So why do data professionals use label encoding? Data is much simpler to clean, join, and group when it is all numbers. It also takes up less storage space. An algorithm or model typically runs smoother when we take the time to transform our categorical data into numerical data. For example, let's say you try to run a prediction model using the mushroom data set, which will try to anticipate the percent chance a new mushroom introduced to the data is a portobello. If we try to create a model without first performing the label encoding, the prediction model will not function. Of course, there will be models and algorithms in which you may not want to perform label encoding. Let the business need be your guide on whether or not you'll need to perform label encoding. The type of model or algorithm you choose will determine whether or not you encode labels. You'll learn more about algorithms and models in an upcoming course. Much like making sure you have the right phone charger to plug into your phone as a data professional, you need to make sure your data is ready to use with models or algorithms. In a previous video, you learned about the importance of label encoding, also known as transforming categorical data into numerical data. Now it's time to open our Python notebooks and learn how to do it. Let's continue with the same NOAA data set we've been using for lightning strike counts. For this notebook, we'll look at data from 2016, 2017, and 2018. Let's start by importing Python libraries and packages. For this notebook, let's import Datetime, Pandas, Seaborn, and Pyplot from Matplotlib. As you learned previously, start by converting the Date column into DateTime, which makes it easier to manipulate. Next, create a column called Month. Like we did earlier, we will use str.slice to cut the month names down to only the first three letters. When working with this data, it is helpful to make sure the month names stay in chronological order. Our next line of code then is months, which simply defines the group of months in order. You'll find this helpful in the next line of code, where we use the pandas function, categorical, to group the number of strikes column by the month column. Because we have three years' worth of data, let's also create a column listing the year. To do this, we name the column year using the DateTime function strftime. This function will return a string representing only the year in the original Date column. Inside the argument field, we input %y indicating we only want the year from the DateTime string. Finally, let's create a data frame called df by month, which groups the number of strikes first by year and then month. The last part of the code here is making sure that the number of strikes is added together for each month using the sum method. We'll tack on the reset index function to tell the notebook to recount the rows from zero without using its initial order. When we look at the top of the data frame using the head function, we get the first five rows of data with columns of year, month and number of strikes. Let's create a column for categorical variables called strike level. In this column, we will group or bucket the total number of strikes into categories, mild, scattered, heavy and severe. This is how we perform label encoding in Python. To do this, we'll create a new column in our df by month data frame called strike level. This new column will be coded using the pandas function, q cut. The q in this function refers to quantile, meaning that when using this pandas function, you can cut the data into four equal quantiles. First parameters, we first input the desired column we want to cut into quantiles, which will be the column called number of strikes. Then we input the number of quantiles we want, which is four, one for each classification. We type in those classifications under the label parameter mild, scattered, heavy and severe. Let's use the head function to find out how our code modified the data frame df by month. It may seem counterintuitive to create categorical data in order to then transform it to numerical data, but this process will allow us to segment the data into useful chunks, which can then be plotted on a graph in interesting ways. Not to mention, string categories are easier to understand in a data visualization than numbers. One other column which will help us later is to assign numerical values to the strike levels we just defined as mild, scattered, heavy and severe. We'll call this new column strike level code. We'll define it by taking the column strike level and adding dot, cat, dot, codes. Cat codes is a pandas function that takes categories and assigns them a numeric code. In this case, when we run the cell, we find that mild has been assigned a zero, scattered assigned a one, heavy a two and severe a three. And that's how we perform labeling coding. Having both the categorical terms for different levels of lightning strikes and the numeric codes to go with them, we can create data visualizations or models that are straightforward to interpret. One other helpful way to work with categorical data is to create dummy variables to represent those categories. As you'll recall from a previous video, dummy variables are variables with values of zero or one which indicate the presence or absence of something. The pandas function we'll use to achieve this in our python notebooks is called get dummies. To clarify, new columns of ones and zeroes are also known as dummies. The get underscore dummies function converts these categorical variables mild, scattered, heavy and severe into numerical or dummy variables from the column called strike level. If we run the cell you'll find this function creates four new columns and puts a one in any row where the number of lightning strikes falls into the labeled category and a zero for anything else. Now that we have our new columns of categorical data, numeric data and dummy variables, let's discover what you can potentially learn from these groupings. To do that, let's plot our data. We'll first create a data frame called df5 month plot. Let's reshape the data by using the pandas function pivot. Pivot allows us to reshape our data frame into any given index or set of values we want based on the parameters inside the pivot function. Those parameters are index, columns and values in that order. To help us with the visualization, let's use year in the index field, month in the columns field and strike level code in the values field. If we use our head function, we find all three years in the rows, the months in the columns and the strike level categories in each month just like we coded. Next let's visualize our data using what is called a heat map so we have a better idea of where most of the lightning strike values are. A heat map is a type of data visualization that depicts the magnitude of an instance or set of values based on two colors. It is a very valuable chart for showing the concentration of values between two different points. Let's create a heat map using Python. Using the heat map plot from Seaborn we can place df5 month plot in the parameter field for the data along with cmap equals blues. This map definition will give us a preset color gradient from dark blue to white. Next we fill in some of the parameters to help with understanding the visualization. In this case, we'll use the standard collection 0 from Seaborn for our color bar. This will just give us a preset gradient color bar which saves us a lot of time from having to code our own colors on a color bar. Next, we'll set the color bar ticks to 0 through 3 and label those ticks according to our 4 strike levels severe, heavy, scattered and mild. Lastly when we code the show function we are given a beautiful heat map. Using the heat map we can find the most severe lightning strike months across all three years. Thanks to Pandas and Seaborn transforming data into categories and back into numerical data is possible with just a few lines of code. As we discussed, it's easier to plot a visualization by using strike level code as the base values. The categories we created of mild, scattered, heavy and severe became the labels on our heat map plot. Label encoding can be as simple as that and it is an essential skill for data professionals. When I think about input validation and the EDA practice of joining I like to think about vegetables. Allow me to explain. When you're at the market picking out leafy greens and root vegetables you check that they're fresh first, right? And not just at the store you also check on them at home when they're in the refrigerator. Plus you probably ensure that they're still edible one more time before you cook and eat them. Lastly, if for some reason the vegetables have gone bad or you didn't buy enough for a recipe you would need to add more. Performing EDA on a data set is similar to checking for the freshness of vegetables. You are searching, exploring and checking that the data is as error free as possible. You should check and recheck your data sets to make sure that they are correct. As I mentioned before as a data professional you should know your data set thoroughly. One way to make sure you know a data set is validating it. Which is, as you recall, one of the six practices of EDA. There are many different ways to validate a set of data but here we'll discuss input validation. Input validation is the practice of thoroughly analyzing and double checking to make sure data is complete, error free and high quality. Input validation is intended to be an iterative practice. Meaning you should perform it again and again in between and during the other five practices of EDA which are discovering structuring, cleaning joining and presenting. Most often as a data professional you will perform input validation when starting a new analysis project or getting familiar with a new data source. We will discuss more about the how of performing input validation later in the program. Now we will focus on the why and the what. Why should we take the time to validate data? What exactly are we looking for? When we validate data we help make more accurate business decisions and we improve complex model performance. Think about it like this. If a gourmet chef doesn't double check the vegetables for freshness before cooking a dish the food could taste horrible or worse. Make people sick. It is much the same with data. The more careful we are in checking and rechecking our data after each manipulation of the data during EDA the less likely we are to have problems later. Clean and validated data can help prevent future system crashes coding issues or wrong predictions. So while we are performing our validation work what is it that we are searching for? No two data sets are alike. There will be different things to check for based on the type. Here are some questions to consider when validating data. Are all the data entries in the same format? For example are people's ages expressed as solitary numbers like 23 and 47 or within a range like 18 to 35 and 35 to 50? Are all entries in the same range? For instance in finance are some of the values expressed in thousands of euros and others in millions of euros? Finally are the applicable data entries expressed in the same data type? For example are all the data entries expressed in the same format of month, day or year? While asking these questions or performing EDA in general, you may find that the data you have been given is not sufficient to answer the business questions that you have been tasked with. For example let's go back to the comparison we made earlier in the video to a gourmet chef and his vegetables. Imagine that the gourmet chef is introducing a new recipe to their restaurant. They are not sure how many vegetables will be needed for the new dish. In the end the chef realizes half way through that day that they did not account for additional vegetables to be used in vegan and vegetarian dishes. It is at that point the chef would buy more vegetables. The EDA practice of joining is much the same principle. As you've learned joining is different from the structuring technique merging. While asking these questions or while performing EDA in general, you may find that the data you have been given is not sufficient to answer the business question. Joining is the process of augmenting data by adding values from other data sets. The practice of joining will be most useful if we validate the data to ensure formatting and data entries align and are the same data type. For example while adding the new vegetable the chef will want to make sure the consistency and taste are similar to the vegetables that you're buying all day. Otherwise the dish may not turn out the same. You will need to use your own logic, common sense and experiences to understand the ways in which you should join or validate each particular data set you work on. There won't be a rigid process to show you down to the detail how to handle each data file. Data science takes a lot of analytical thinking and shifting of perspectives to be thorough in your EDA validation. Experience and effort will be the best way to improve your performance in analytical thinking. We'll also be exploring examples in our Python notebooks as well. Keeping your validation practices in line with the pace workflow will also help you keep a focus on ethics. Validation should be about cleaning and correcting data for the sake of quality and correctness. You probably won't be surprised to learn that the EDA practice validating fits squarely in the analyze phase of our pace workflow but is also a practice you should use for all four phases. Remember, pace stands for plan, analyze, construct and execute. This means that what and how we join and validate data should be an alignment with the plan phase of the workflow. For example if your task is to discover which week within a given month is profitable for a business, it will be important that you check and recheck that you've grouped the revenue dates into weeks correctly. Your analysis won't be effective if you think you've grouped revenue by week but actually done it by day or month. When you look to validate your data to this level of detail it will help you toward meeting your pace goals, specifically for the construct and execute phases. Before we end there's one more thing that can help you with validating and joining aspects of EDA. Look to your peers and managers to help you. A peer reviewed data set is one of the best ways to ensure that you check yourself on bias keep ethics as a focus and ensure you are on track with the pace workflow. Remember validating data is like checking your vegetables to find if they're the right choice for the recipe. It may take extra time on the front but it may save you from headaches and stomach aches down the road. In this video you'll perform input validation in Python. Remember input validation is the practice of thoroughly analyzing and double checking to make sure data is complete, error free and high quality. Just like checking your vegetables before you cook them. With that knowledge fresh in your mind it's time to discuss how to perform input validation in Python. We'll focus our validation on the NOAA data. The goal will be to check for errors and prepare this data set for publication. Let's open our Python notebooks and get started. First step, as usual we will import the Python libraries and packages that we need for input validation. You'll notice we have Pi plot from Matplotlib, Pandas plotly express and Seabourn. Next with our lightning strike data libraries and packages all imported in our notebook, let's use the head function to take a look at what we have in terms of columns and the top five rows. You'll find our familiar lightning strike data set also includes a fourth and fifth column longitude and latitude respectively. If we were to check the data types now using df.d types we would discover the date column is a string data type. As we did previously let's convert the column to date time. One thing we should include in our input validation is a check for missing data. You'll recall from earlier that we likely will have already done this check for missing entries during the cleaning part of EDA. The definition of validate assumes an action or process has already been completed. So if we input is null.sum into our notebook in order to give us missing values for the second time it is a practical and not redundant thing to do. After running the code we find that there are indeed no missing values. Next, for this data set let's review the ranges of all the variables. In this review we will check the highest and the lowest values of each column and the overall distribution of values. To do this we'll use the describe function which you may recall from previous video. In the argument field we'll use the parameter includes equal this will tell the notebook to include every parameter covered by the describe function. You'll notice NAND values in some of the describe fields like unique, last, and max. Remember that these are not fields in the data set but rather a summary table of attributes. For example, you'll notice that there are NANDs in the date column for mean, 50% and max. These are to be expected because dates are not data types that would be averaged. Speaking of dates, let's double check our date column. First are there any calendar days missing? We know from prior EDA work on this data set that the date column should include every day of each year. Let's confirm each day is listed by designing a calendar index. We'll call it full date range. We'll then use the pandas date range function with the first and last dates of 2018 in the argument field under the start and end parameters. After that let's add full date range dot difference then in the argument field we'll add df and in the brackets the column header date which will limit this function to checking only the date column. The result is interesting. There are four consecutive days missing in June of 2018 then two consecutive days in September and two consecutive days in December. This finding would be something to investigate or to question the owner of the data to find out the reason why. Given that the number of missing days is relatively small, you could complete your analysis by making a note of the missing days in the presentation. This will ensure that anyone who analyzes your visualization or presentation will know that the data depicted doesn't include those missing dates. Now that we've done some initial validation, let's get a better sense of the range of the total numbers of lightning strikes. If we use a simple box plot data visualization with the number of strikes as our data parameter in the argument field, we'll find that the distribution is very skewed. Most days of the year have less than five lightning strikes but some days have more than 2,000. As you recall from a previous video we can remove the outliers from the box plot visualization to help us understand where the majority of our data is distributed. We can do that by adding show fliers equals false into the argument field. The result is much easier to interpret. From what we know of lightning strikes in the North America regions of the United States, Mexico and the Caribbean, this distribution is conceivable. If the highest distribution of strikes were all in the 2,000s we might be a little more skeptical. The last bit of input validation we will do is to verify that all of the latitudes and longitudes included in this data set are in the United States. This will ensure we don't have any mistakes in our geographic values. To do that we'll first create a new data frame that removes all the duplicate values. We'll type df underscore points equals df then latitude and longitude in double brackets. Following that we'll add dot drop underscore duplicates. The reason we want to drop duplicate points for the latitude and longitude is that we don't need to check a particular location on a map twice. After running that code we have a new data frame that only includes two columns latitude and longitude. Within those two columns we find that there are no duplicate data points. Finally, we will plot these points on a map. Here's a tip. When you're working with hundreds of thousands of data points plotted on a map it takes a lot of computing power. To make it a shorter runtime we'll use the plotly express package. The package is designed to keep run times as low as possible. To plot these latitude and longitude points on a map we use the scatter geoplot from plotly express. Within the argument field we'll fill in the proper parameters for a scatter geoplot. In this case we'll first input the data frame name df points followed by first lat for latitude then lon for longitude. And the last code is to show the scatter plot. After running the cell you'll notice the runtime is still slow because of the sheer volume of the data. Most data professionals use input validation regularly so these are important concepts to learn and important technical skills to practice. As you learn data sets can be busy and chaotic. There are many different things happening in them it can be difficult to figure out where to focus and how to uncover those stores buried beneath the surface. In this section of the course you learn some important strategies to help you correct mistakes and discover hidden data stories. We focused our story finding efforts by learning the EDA practices of cleaning, joining, and validating. We discussed a lot of ways to engage in those practices. We focused on working with missing values and outliers, transforming categorical into numerical data and input validation. And we reminded ourselves not to forget to visualize our data in Python to help further our understanding. We discussed identifying missing data and outliers in Python and why it's important both from an ethical perspective and a business sense to find them. We considered the difference between categorical and numerical data and why transformation using Python is important. As for input validation you learned what it is, why it's important and how to perform it using Python. We also discussed some workplace skills along the way, such as understanding when to communicate with stakeholders or other subject matter experts about missing values. We also reviewed ethical considerations you should make when performing cleaning, joining, and validating work. You learned why these practices are vital when performing EDA on a data set as part of the PACE workflow. EDA can be an exciting process. Thorough EDA not only saves time and energy on future processes, but it also helps you find the trends, patterns, and stories in the data. Later, you'll learn how to use Tableau to design and present data visualizations to key stakeholders and business managers in a business setting. In the meantime, you've learned some amazing skills that data professionals use almost every day. The ability to perform careful EDA on a data set in Python is an essential building block in a data professional's career. Great job on your work so far. We've reached the final practice of EDA, presenting. As we discussed, visualizing and presenting don't necessarily come at the end of your data exploration. You may create visuals of your own data throughout the data analysis process. The practice of visualizing your data will help you discover insights and further your own understanding. The presenting practice of EDA is an important part of both the analyze and execute phases of the PACE framework. Sometimes you'll create visualizations to analyze the data during the analyze phase. At other times, you'll utilize data visualizations as part of executing an algorithm. In this video, we will focus on designing data visualizations for the purpose of presenting. In this course, we'll discuss concepts such as accessibility, tablo basics, dashboards, and data visualizations. You may recall some of these concepts from the Google Data Analytics certificate. If you'd like, take a few minutes to review that content before moving ahead. Coming up, you will learn how to improve your data visualization skills. The accurate representation of data is one of the most important aspects of a successful presentation. You'll learn specific techniques for making data visualizations that represent the data with precision. Another aspect of designing successful data visualizations is ensuring they are inclusive. We will talk about tips and techniques for making your data visualizations accessible to a diverse audience. You will also learn about the data visualization platform, Tablo. Tablo is a versatile data visualization software primarily used for presenting data to inform and improve businesses. Using the software, you'll learn how to create data visualizations that tell stories and explain technical concepts to non-technical audiences. You'll learn to create dynamic visualizations with interactive and motion elements, and to alter visualizations based on different audiences' needs. We will also discuss how to choose the right graphic or chart, give context during presentations, and select the proper order and timing for your data visualizations while presenting. Data professionals often need to create data visualizations and deliver presentations. That's why we'll discuss the most important skills to help you thrive in this area. After all, you want your audience to understand your data story. As a data professional, you will have an opportunity to tell data-driven stories that change your team, department, company, industry, or even the world. Now let's get started. Often, creating a data visualization is an iterative process. The first graph you design may not be the one you share with stakeholders. The colors, text, labels, scale, and areas of emphasis are all aspects of a graph that may change to meet different business needs. But how do we get to that final product? What decision should we make along the way to create a successful visualization? To answer these questions, we'll take a sample data set and create a data graphic. Then, we'll discuss how to alter the visualization along the way. This will illustrate the types of decisions data professionals make when creating graphs and charts for presentations. Now let's get started. Imagine you're working on a new app which will gather a list of the top-rated rental homes and departments in Europe in one easy-to-access place. Your company asks you to start by gathering the best rental listings in Athens, Greece. You want to know the locations of rental units where the owners have 40 or more property listings. You only want properties with many good reviews and a rental listing price between 90 and 250 euros. This helps to clarify the type of visualization you need. Since you want the locations of rental property units where owners have 40 or more property listings, you need a geographic map. Additionally, you limit the count of listings based on the criteria of your analysis. To create a map with the locations of rental properties, a program like Tableau will be the most effective. If you're familiar with Tableau, feel free to add this data set into the public Tableau and try it yourself. First, we'll start with a map of Athens. We'll plot all the rental listings on that map. We'll use filters to remove rental properties that don't meet the analysis criteria such as prices outside our defined range. Finally, we'll adjust the formatting to make our map easily understandable and accessible. Let's consider the dimensions of the graphic. We know that the end result needs to be a map of listings in Athens, Greece. Let's input the latitude and the longitude from the data set and plot those listings with the host name as the labeled spot on the map. Now that we've added those data points to Tableau, let's see what it looks like. We have a map of Greece, but we need to refine it so that it's easier to read. Next, let's filter the listings we don't want based on our analysis criteria. Narrow the results to only those owners who have more than 40 rental listings. We can do that with the column in the data set called calculated host listings count. We limit the amount to only values over 40. This narrows the fields. Next, we limit the entries by price. Our criteria includes listings between 90 and 250 euros, which gives us a much more manageable number of listings. We want to keep only the listings with a lot of reviews. The word a lot, of course, is relative. So we'll use a percentile filter to just give us the top 50% of total reviews. This is the result. Now we have a reasonable amount of data points. But what's next? We need to make it easier to comprehend. To do so, we add a title and add a color coding for the owners of each listing. We also add the price for each listing and this is the result. At this point, we have a workable visualization that meets our criteria. This is a successful first version. However, if we want to make this visualization part of a formal presentation, we need to make it accessible. We do that by including a descriptive caption beneath it and making the marks slightly smaller so more of the prices display. We also want to be sure the colors are friendly to individuals who might have difficulty seeing color. This is the result. The goal with visualizations is to meet an audience's needs. You want to share the geolocation of rental property owners. You accomplish that with your map of Athens. But what if the app's business team decides they need different data display? Or if you want to review property listings by neighborhood, you need to update the visualization. The main point here is that the resulting visualization may not always fit the business need. Designing visualizations as you'll remember is a multi-step process. Even after you've completed a visualization you will sometimes find you need to update it. You may also get new information that requires you to change the original criteria. Making changes and fixing mistakes is all part of the process. We talked a lot about the power and usefulness of Tableau. Are you ready to log in and try it out? Great. We'll discuss how to access a free online version of Public Tableau, build basic data visualizations and understand when to use a variety of charts. You may remember some of these concepts from the Google Data Analytics Certificate and you can revisit that program for a quick refresher. When you access Tableau Public it may appear different from this video. Keep in mind that Tableau Public may have updated its user interface. This should not be a problem because the steps you follow are almost the same. To begin go to the Public Tableau website which is the free online version of Tableau you can access from your browser. Sign in to your Tableau Public account. Next, you'll need to access a new data source from the Tableau Public. We'll use the NOAA Lightning Strike data set again for this video. On the Tableau Public home page select Create a Viz. Upload your data source from your computer using the file provided for you. Once the data is uploaded it will be redirected to Tableau's data source screen. The data set is divided into four columns Date, Number of Strikes, X coordinate and Y coordinate. There are symbols above or to the side of these fields names representing the types of data that are included in the data set. There is a calendar icon for the date, a pound sign for the numeric column, number of strikes, and globe icons for the latitudes and longitudes or X and Y coordinates. There are also tabs representing a new worksheet, new dashboard and story. Create a new worksheet. A worksheet in Tableau Public is a data page that contains a single view of a data visualization. The new worksheet is blank other than the data column. The data source field of the worksheet is pre-loaded with the column headers we identified on the data source page. There is a thin line separating the list into two sections. This indicates the data type and whether the data is discrete or continuous. Tableau divides all the data fields into two broad data types, dimensions and measures. You can tell them apart based on the icons Tableau assigns to each field. Dimensions are qualitative data values used to categorize and group data to reveal details about it. Measures are numeric values that can be aggregated based on calculations. The green and blue colors in Tableau indicate another aspect of the list of dimensions and measures in the Tableau worksheet data tab. Green indicates that the data field is continuous. Blue indicates that the field is discrete. The term continuous is a mathematical concept indicating that a measure or dimension has an infinite and uncountable number of outcomes. Discrete is a mathematical concept indicating that a measure or dimension has a finite and countable number of outcomes. With these definitions we can learn about our list of data fields even before we plot them on a chart or graph. Keep in mind that Tableau may assign the wrong data type. You can always change data from measure to dimension or discrete by right clicking on the data field if the data type and its label do not match. Next, let's begin our visualization by plotting a line graph. Line graphs are useful for presenting time series data or tracking changes in data values over different periods of time. To start, drag date into the column field. In the pop up that displays you can select different segments of time. Select year for this chart. Next, drag and drop number of strikes into the row field. You now have a line graph. Now, you know how to create line graphs. Next, we'll create a bar chart. Bar charts are useful when you want to compare things like data from two different time periods. To get started use the toolbar to duplicate your worksheet. In the new sheet, open the drop down menu in the marks field. Select bar. Changing data visualization types is really that simple. Next, let's create a bar chart to compare two data sets. Go to the filters tab and edit the filter. Check 2009 and 2018. Leave all the other years unchecked. You'll find a comparison of the total number of lightning strikes in the years 2009 and 2018. Next, add a label to each bar by dragging number of strikes to the square in the marks field titled label. Now, we know the exact difference between the number of lightning strikes in 2009 and 2018. 30.1 million and 44.6 million, respectively. For the last visualization let's compare 2009 to 2018 by quarter. To start duplicate the worksheet again. Drag date to the column field. Tableau will automatically add quarters to the stacked bars sectioning out Q1, Q2, Q3 and Q4 for both years. To make the side by side comparison of each quarter stand out, drag date to the color in the marks field. The colors will adjust based on the year with 2009 in the original color and 2018 in a different color. You can keep the chart like this or click the drop down arrow and select quarter instead. Now the quarters are segmented by different colors. To determine the color scheme decide if you want to highlight the differences between quarters or between years. You're starting to discover how useful Tableau can really be. Often data professionals use Tableau for their EDA work because it helps them quickly create visuals of their data. Soon you'll be able to do that too. So far in this program you've learned the basics of Tableau public like how to upload a data source and use discrete or continuous measures and dimensions to plot data visualizations. Next you'll design more complex visualizations including heat maps, box plots and histograms. You'll also learn how to input calculations and write code within the design elements of a visualization. Let's get started. Go to Tableau public and create a new data visualization. We'll use another data set from the NOAA lightning strike data. Once you've uploaded the data source you'll start your data visualization on the data source page with the NOAA lightning strike data. The date, longitude, latitude and number of strike columns are all there. Let's open a new worksheet so that we can create a geographic map. This requires latitude and longitude data that Tableau can use to plot the points. First, drag longitude from the data list to our columns field. Next, drag latitude into the rows field. Now, we have a map. This is a great start. Next, filter the data so that you have fewer data points to work with. Drag date into the filters field and select year from the dropdown. Then, configure the filter to select only the year 2018. Next, use color gradients to help differentiate the denser locations on the map from the more scattered lightning strike locations. Drag number of strikes to the box labeled color. In the dropdown menu under marks, make sure that density is selected. Now that the map is much better, the default color for this visualization is blue. The lighter blues indicate more scattered lightning strikes and the darker blues are heavier lightning strike areas for the year of 2018. To change the color, click on the color square and change it from automatic to just something else. Try one or two. Consider which will make the map most accessible for yourself and for people who might have visual disabilities. Next, let's create a heat map. As discussed in a previous video, a heat map is a type of data visualization that depicts the magnitude of an instance or set of values based on two colors. To create a heat map in Tableau Public, you need one or more dimensions or one or two measures. To start, drag date to the row field and select year. Next, you need to create a calculation that derives the month from the date string. In the date dropdown menu, click create and select calculated field. A pop-up will open. Let's give the calculator a title. For the calculated field name input month, use the left function to pull the month names from the string in the column date. Pull the first three letters from the month listed in the string type left, parentheses, date name, parentheses, month, in single quotes, a comma, date and brackets, parentheses, a comma, and three. Then add a closing parentheses. Click OK. Now, there is a new field on the list of data on the left. Drag month to the columns field. Then drag number of strikes to the color square under the marks field. Be sure that in the marks dropdown, that square is selected. Now, you have a heat map. Just like the density map, the default color for this heat map is blue. You can adjust the heat map to a color range that fits your needs. Remember to consider accessibility as you create your visualizations. Now, let's create a box plot. You'll recall from a previous video that a box plot is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles. You've already learned how to create a box plot chart in Python. To create a box button tab below public, first drag date to the row field. Then drag number of strikes to the columns field and select year. If tab below doesn't default to box plot, select circle from the dropdown menu under the marks field and scroll to the show me tab to select box and whisker plot. What you have so far is not much to look at. All you're seeing is a thin line of data points in a blank field of white background. However, once you drag date into the detail square under the marks field and select day, you'll see box plots appear in full strength. With box plots, most data pros will include a legend or annotation of the exact numbers of median and mean for presentation. You can, of course, change the colors and size of the circle in this box plot using the color and size squares under the marks field. Our last complex plot is a histogram. A histogram is a data visualization that depicts an approximate representation of the distribution of values in a data set. To create a histogram, first, you need to create bins. A bin is a tab below term that describes the custom segments of data that values can be grouped into. Bins are an important part of histograms and the groupings in heat maps determine how the data is segmented and compared. For a histogram of lightning strike data, group the total number of lightning strikes into bins. Do this by clicking on the number of strikes drop down. Select create and then bins. You can give the bin a name and then determine the size. If you select 50 for the size of bins, there would be one bin for every 50 lightning strikes. Any value between one and 50 would be included in that bin. Because there are millions of values of lightning strikes, you should select a number between 5 and 10. This will display the distribution of values in much more detail. Next, drag the new bin to the columns field. Then drag number of strikes to the row field. Make sure the number of strikes in rows is the count or C and T rather than sum or average. This presents the actual count of each bin. To give us more detail, drag number of strikes to the filter field and limit the field numbers between 1 and 200. Drag number of strikes to the label square under the marks field. Make sure it is set to count. You can adjust the color to make the data visualization more accessible. Coming up next, you'll learn how to create a series of data visualizations to use in a presentation. I'll meet you there. Hi, my name is Drew and I am a predictive modeling specialist here at Google. So originally I actually studied to be an accountant. I worked as an accountant for about 7 years. Most of what I did was working in finance. So most people who work in finance become incredibly good at Excel. And so I became better and better and better at Excel all the time and started working with different formulas. Formulas led me to figure out coding and then coding kind of led me towards data analytics. Exploratory data analysis is everything. You can't do anything without understanding the data first. You need to first go in and you need to really understand what every single variable is, what the distributions look like, what the initial correlations look like. You could start to go deeper than that building explanatory models to start to look at longer correlations and multi-correlations between multiple variables together. It's really it's everything. That is data science. I can remember a project we were working with a client who is spending tons of money on takeover media. Takeover media requires a large investment and it basically blocks out all other media spend from other people during a day so that you're concentrated on that day. What we found is that there wasn't a reaction between takeover media and conversions on the other side. What we had to do when framing it to the client is first figure out what are all the potential objections the client might have. Once we got those inputs we were able to put together a story that wasn't so destructive to what they're doing currently but was constructive and showed them the way forward. When delivering news about data it's important to shift from a data mindset to a client mindset. You need to marry the ideas that you learned during your data analysis to how it can actually be used by the client. A lot of the ways that we do this is through visuals. We try and incorporate visuals that are very simple yet communicative. We use humor. Humor is a great way of engaging people and it makes it very memorable for people. You use personal stories. Personal stories help get your point across to others very quickly. I once had a professor who told me that you have 30 seconds to actually communicate with someone until they stop paying attention so you need to be able to get in there make sure they remember it know what the takeaways are and have various succinct next steps of what to do with the data that you have presented to them. Data professionals often need to present their data. Creating, sharing and discussing data visualizations is an important part of telling data stories. In this video you will create a series of data visualizations that work together to tell a story. This will allow you to lead an audience from one concept to the next and build additional context as you present each visualization. But first, let's consider a scenario. Imagine you're working for an organization that wants to learn more about lightning strikes in the United States. They request a series of data visualizations that illustrate the increase in lightning strikes over time and detail lightning strike data in three states. Texas, Oklahoma and Kansas. After you create your visualizations you will need to share them with the organization's directors. With the audience in mind you can get started. You need to consider three strategies for organizing and presenting data visualizations in a series. Chronological generic to specific and specific to generic. A chronological approach to data visualizations is useful for data that is best understood in a time series. A generic to specific approach helps an audience consider an issue before describing how it affects them. And a specific to generic approach is useful to highlight impacts the data can have on a broader scale. Let's select an approach. A chronological approach or time series would be most helpful to consider data over time. This does not address an important aspect of the request. Visualizing strike data in three specific states. With a specific to generic approach the climax of the presentation would focus on the United States as a whole rather than lightning strikes in Texas, Oklahoma and Kansas. This does not meet the needs of the organization. A generic to specific approach would allow you to illustrate the country wide increase in lightning strikes followed by specific data on each state. You can highlight national trend before targeting the lightning strike data in Texas, Oklahoma and Kansas. This meets the organization's needs so a generic to specific approach is the best choice. Before you design your visualizations you need to create three different Tableau public worksheets. These will form a story in Tableau public. A story is a Tableau term for a group of dashboards or worksheets built into a presentation. For the first worksheet you will create a line graph showing an upward trend in lightning strikes over three decades starting with 2009. For the second worksheet you will show that the greatest amounts of lightning strikes have shifted from the east coast of the United States to the south central mainland over three decades. For the third worksheet you will show the number of lightning strikes in Texas Oklahoma and Kansas in 2018. These three sheets will form the generic to specific organization for the story. Let's open Tableau public and start designing. As always start by uploading your data source. For the column field drag in date which is a discrete dimension and select year for it. And for the row add the number of strikes measure into that field. A line graph should appear. If it doesn't open the show me tab and find the lines discrete selection. You'll notice the line moves steadily upward from 2009. Next we will make the data visualization easier to interpret. For example add two annotations to the line graph. First add 30.1m strikes in 2009 to the beginning of the graph to give the audience a clear starting point. Then add 45 m strikes in 2018 to its applicable position. Next let's create a geographic data visualization. Since we want to progress from generic to specific it is important to show the trend of lightning strikes moving west over the decades. To do that we need to create a time series map that shows the location of lightning strikes in the US for each year since 2009. In the new worksheet place the x coordinate in the column field and the y coordinate in the row field. This will create a map focused on the eastern and central US. If you're still seeing a line chart make sure that both your x and y coordinates are marked as continuous. Once you have the geographic map click on the drop down for both the x and y coordinates and select measure then average. This will give us a manageable view of the average number of lightning strikes in the US for each year. Next add the date and the discrete dimension to the filters and the pages fields. Select year for the filter method. The filter field allows us to create a dynamic filter which can be used to select just one year of lightning strikes. The pages field can create a distinct snapshot or page of lightning strikes for each year. Finally add the number of strikes to the detail field. The map showing lightning strike locations is very responsive. We just need to add the interactive part to it. With the year dimension in the pages and filters fields you can format interactive legends to the side of the data visualization. These allow users to self-select the years they wish to see. We have two of the three worksheets done that we need for our presentation. The last thing to create is the most specific visualization 2018 lightning strikes for Texas, Oklahoma and Kansas. To do this we'll create three new worksheets one for each state. Then we'll place the three snapshots into one dashboards. We'll work together on the Texas visualization. Then you can create the Oklahoma and Kansas visualizations all on your own. Our map of lightning strikes in the state of Texas starts in almost the same way as the geographic chart we just made. We put the X coordinate in the column field and the Y coordinate in the row field. We put the number of strikes measure in the detail field. This time we add number of strikes to the color field and make the color coincide with the number of strikes. Use red for most strikes and yellows for fewest. Because we will limit these state snapshots to just 18, we drop the date into the filters field, select year, then uncheck all the years except 2018. Finally, we need to create a set in Tableau Public. A set is a Tableau term for a custom field of data created from a larger data set based on custom conditions. We can create a set by first selecting all the data points in Texas. To do this we will use the lasso tool. You will follow the state line as best as you can all the way around Texas. Once complete, you will find a popup window. Select keep only, then click create set. Drag your newly created set to the filter field. Now, only the average lightning strikes for Texas will show up on our map. We can now assemble them into one story. First, add the two other state worksheets for Oklahoma and Kansas which were created in the same way as the Texas worksheet. Then create a dashboard for the three states illustrated in the locations of lightning strikes in the U.S. for the year 2018. You should also create dashboards for the other two worksheets created earlier the line chart and the interactive map of the U.S. Now that we have three dashboards we can create a story in Tableau. On the first page of the story insert the dashboard with the line chart. Fill in the caption to give details about what the chart is showing. On pages two and three of the story, put the interactive map of the U.S. followed by the dashboard that includes all three states, Texas, Oklahoma and Kansas. With the three captions filled in, we have a complete story. Coming up, you'll learn how to create an interactive dashboard in Tableau. I'll meet you soon. So far in this program, you've been practicing essential Tableau skills. Now, you will learn how to put these skills together and create an interactive dashboard. The interactive dashboard you create will display the quantity and the location of all lightning strikes in the United States from 2009 to 2018. The dashboard will be color coded based on the quantity of lightning strikes and it will allow viewers to select any year or location to learn the count of the strikes. Let's get started. Begin by opening Tableau public and uploading the data source provided. To build the dashboard, you will need to create three different worksheets. First, you'll create an interactive U.S. map with locations of lightning strikes for each year. Then you'll create a dynamic stat for each year. Finally, you'll design a way to pull and display the lightning strike metrics for the individual locations on each year's map. Let's start with the interactive lightning strike map. To begin, add the X coordinate and the Y coordinate in the column in the row field. Next, drag date into the filters field and select year. Make sure the show filter option is checked. For the last part of this map, drag number of strikes into both the color square and detail square in the marks field. The color square will categorize the dots on the map with a color code based on the quantity of strikes. The detail square will plot the locations for each of the lightning strikes for each year. Make sure when you do this that both number of strikes variables are set to continuous. For the next worksheet, you will create a list of three metrics that will change on the dashboard whenever a year is selected. The three metrics are total number of strikes, average number of strikes for each location for that year, and maximum number of strikes in any location. This worksheet will differ from our other visualizations. It will be text rather than a bar chart or line graph. It will sit to the side of the interactive map and update whenever a new year is selected. First, create a calculation from the number of strikes dimension. In the blank field space, type number of strikes and put it in brackets. Next, drag the new calculation into the tool tip square in the marks field. In the drop down select measure and average. Repeat this process, but instead of selecting average you should select sum and maximum. This creates dynamic fields for three metrics. The total number of strikes, the average number of strikes for each location for that year and the maximum number of strikes in any location. Finally, create a place to display these calculations. To keep it simple go to worksheet and select show title. Input the title, metrics for the year. Under that you will include the three calculations in the tool tip. To do this, put an asterisk followed by total colon and then sum and number of strikes inside the left and right angle brackets. Repeat this step for the maximum and average tool tips you created. And there you have it. For the last worksheet, drag number of strikes to the text square in the marks field. In the drop down, select measure and average. This creates a dynamic field for the dashboard we'll create next. Now, select new dashboard. You should find all your worksheets in the list on this page. Drag over the worksheet with the interactive map in it. It may take a few seconds to load. Next, drag your other two worksheets into the dashboard. Before we clean these up, we need to create an action for the dashboard. An action is a tab load tool to help a user interact with a visualization or dashboard by allowing control of a selection. When a year is selected from the filter bar, all the other fields will update with it. We will connect the three worksheets together with an action. Click on the map in the dashboard. Select worksheet and actions. In the actions pop up, select add action and filter. In the add filter action menu in the source worksheet section, select your dashboard from the drop down. Check the box for your map. And for the run action on list, click select. For target sheets, select your dashboard from the drop down. Select the map and the recently completed worksheet that will gather metrics for individual locations. From the list of clearing the selection will, select show the values. Lastly, select all fields for the filter section. Click OK. And now you have an interactive dashboard. Update the text and titles before you're done. Add a title to the filter section. You should also title the location average variable, location metrics and write brief instructions to clarify what the tool does. For example, you might type choose a point on the map to view location specific metrics. Take a few minutes to test your dashboard. Select different years and locations on the map to make sure the tools function properly. What a collection of tools and techniques you've learned in this video will help you communicate and present ideas more effectively throughout your career as a data professional. Good luck and happy designing. If you were to design a data visualization of what you've learned in this program, what might it look like? Would you plot your engagement level over time on a line graph? Or would you create a heat map showing the tasks you completed with different colors representing the tasks that took the least and most time? No matter what visualization you choose you've learned the skills you need to create an effective graphic or visualization that an audience can understand. You've learned to use Tableau Public to tell your data stories and to adjust the content depending on the audience. We also discussed the best ways to share technical concepts with non-technical audiences. You've learned to use Tableau at a high level and you've applied the technique data professionals use to create visualizations. No data visualization discussion would be complete without considering ethics and accessibility. You've learned that as a data professional it is your responsibility to ensure visualizations accurately represent the data. You also learned how to make data visualizations accessible for people who have visual impairments. You learned a lot. Congratulations on completing this section and good luck on the rest of your journey. Hi, I'm Tiffany and it's great to be with you again. You've made a lot of progress in the program so I'm back to tell you more about your portfolio projects and how you can use them in your future job search. Remember, your portfolio will be a collection of materials you can use to showcase your approach to solving data-oriented problems. In the portfolio project for this course you will demonstrate your knowledge of telling stories using data. This is an incredibly important skill for a data professional to know on the job but it's also critical for success in an interview. As potential employers assess you as a candidate they might ask you for specific examples of how you approach cleaning, structuring and validating data in the past. You can use your portfolio as a way to discuss actual data challenges you have resolved and real stories you have presented. Additionally, some employers might ask you to create a presentation based on what they provide. The skills you've learned in Python and Tableau to create data visualizations will help you feel more comfortable and prepared for those interviews as well as build out your portfolio. You already learned about experiential learning which is when people gain understanding through doing. Watching an instructor create a visualization is one thing. Creating a visualization yourself to expand your understanding of the concepts and improve your skills at presenting to the project is also a great opportunity to discover how organizations are using data analytics every day and show off your knowledge of how to tell data driven stories. To complete the portfolio project you'll be working with a database and an accompanying business scenario. You'll use instructions to complete a Jupyter notebook showing your EDA work and three to five visualizations in Tableau in response to the scenario. By the time you complete this project you can add to your data professional portfolio. In your Pay Strategy document you'll also have documentation of the steps you took along the way which you can use to explain your work to future hiring managers. You're almost finished with this course which means you're advancing your understanding of what it means to be a data professional. Now it's time to demonstrate what you've learned. This portfolio project will help you practice and demonstrate the skills you've learned throughout the course. For example, you'll be able to show how to form databases of EDA in Python. Using Tableau you'll demonstrate how to create data visualizations that accurately detail a data set story. And you'll be able to demonstrate how to prepare and document a comprehensive workflow strategy using PACE. Ready? Let's go. In this course you've been learning about using the six steps of the EDA process to tell stories with data. Now it's time for an exciting next step. Putting all this to work for your portfolio project. In the previous course you learned the basic formatting and structure for writing code in Python. These coding skills will be critical to the completion of your next portfolio project which will require you to clean, analyze and present data to technical and non-technical audiences. Now that you have some practice completing portfolio projects think about how to reframe the data you're analyzing and cleaning it into a well thought out story. In this part of the course you'll take a data set and apply the six steps of EDA to formulate a useful document that can help you present key findings to stakeholders. In other sections of this program you will work to develop additional skills to help you thrive in the data career space. There's so much more to learn about telling stories with data. As a data professional a large part of your job is focused on transforming messy data into an organized, clear story that meets business goals and helps stakeholders understand important details needed for making business decisions. This portfolio project is a great opportunity to demonstrate to potential employers that you can do exactly that turn messy data into a logical story. And remember developing your skills as a data professional is an iterative process so you can continue to improve as you have new ideas or learn new things. At this point in the program you've done a lot of work toward better understanding data and how it can be useful for enacting change in a business. You completed a Jupiter notebook created visualizations to support your work and refined your presentation to meet the needs of your particular audience. As you continue to make progress in this program remember that documenting your learning process and skills will help you communicate what you've done to potential employers and hiring managers in future interviews. You may recall from previous sections of this course that audience awareness is essential. During the interview process knowing how to talk about your work process transferable skills and other achievements will lead to much greater success. In this course you learn the importance of following the pace structure in a data career. You practice using Python to manipulate data and you demonstrated how to organize and analyze a set of data to tell a compelling story. The portfolio projects were designed to help you thrive on the job market and the transferable skills you applied contributed to the tangible artifacts you created. As you begin preparing for future interviews you should be ready to answer questions like what is your process for cleaning data? What tool do you use for creating data visualizations? How and why do data visualizations enhance the stories data tells? And what considerations are top of mind when sharing data stories with non-technical stakeholders? Of course there may be many other questions as you are asked as you interview for a data professional role. Each portfolio project will help you prepare responses. For example in the portfolio project you just completed you use the EDA process to clean organize and analyze a data set. Then you turn your data into a presentation full of visualizations that will help stakeholders understand insights from the data story you discovered. Don't forget that you recorded all of your considerations questions, process notes and more in your pay strategy document as well. Coming up you are going to learn all about the power of statistics and data driven work. Then you will have an opportunity to use statistical analysis to simulate an A-B test. By the end of these courses you will have lots of artifacts in your portfolio.