I downloaded en.openfoodfacts.org.products.csv which is and 8.2GB file, containing 2.9 million products with 203 attributes stored per product, such as name, origin, sugar per 100g, etc.
To be able to work with this data, a couple of pre-processing steps and selections are executed which we will layout here with justifications if necessary:
Products must be sold in Germany. (I am German and I wanted to at least know some of the products. It also made my life easier because it reduces the amount of data by roughly a factor of 10.)
Products must have a valid Nutriscore. (The whole point of this analysis is to have a look at the Nutriscore, so items with a null Nutriscore are discarded. This reduces the data by another 3x.)
Remove redundant information. (Some columns are redundant as they contain the same content, just in a different format. This drops about 25 columns.)
Remove products with non-unique IDs. (This drops another couple of rows.)
Executing all these steps, we are left with around 73k products and 175 columns corresponding to different data fields. Out of those 175 columns, 118 contain information about the nutrients (per 100g).
Here’s how the number of products split up across the different Nutriscores.
It is important to note, that products are listed in the data base multiple times, often because they occur in different portion sizes. For example, here are all products that are named “Snickers”.
The Nutriscore shall serve as a tool that allows to quickly compare products. The score can be one of “A”, “B”, “C”, “D”, or “E”. However, there is also an actual numeric score underpinning the grade.
(Note that while the official Nutriscore has a green-to-red scale, we are using a blue-to-red scale to ensure better readability for people with colorblindness.)
This is shown in the next graph which presents the histogram of these numeric scores, colored by their Nutriscore grade.
Interestingly, there are a number of products that have an “E” grade but a lower numeric Nutriscore. Looking at a random selection of those, it seems that these are mostly beverages containing sugar.
Very frequently I can’t believe how a specific product got an “A” grade, like how does some chocolate get an “A” grade even though it is high of sugar. Is there any way that companies are cheating the system?
Let’s have a look at all the products that contain the word “choco” and see if we find something.
Let’s see which of the nutrients correlate with the Nutriscore.
To do that, we pick all columns that contain ingredients about the nutrients and where at most 10% of the entries are null.
Code
d = ( ( df.select([c for c in df.columns if c.endswith("_100g")]).null_count()/len(df)*100 )<=10)[0].to_dict()columns = ["nutriscore_score"] + [key for key, values in d.items() if values[0]]display( df.select(columns) .to_pandas() .corr()[["nutriscore_score"]] .iloc[1:-1, :] .sort_values("nutriscore_score") .style.background_gradient(cmap="RdBu", vmin=-1, vmax=1))
So (saturated) fat is the best indicator for a Nutriscore.
The obvious next question is, how well can we predict this. We’ll use simple Decision Trees.
y = df.select("nutriscore_score").to_numpy().flatten()logger.info(y.shape)X = df.select( [ cfor c in columns_for_fittingif (c !="nutriscore_score"and c !="nutrition-score-fr_100g") ]).to_numpy()logger.info(X.shape)logger.info("Keep only data that has no nans.")select = np.sum(np.isnan(X), axis=1) ==0y = y[select]X = X[select, :]logger.info(y.shape)logger.info(X.shape)transformer = Normalizer().fit(X)X_train, X_test, y_train, y_test = train_test_split( transformer.transform(X), y, test_size=0.20, random_state=2023)clf = tree.DecisionTreeRegressor(max_depth=15)clf = clf.fit(X_train, y_train)df_tree = pl.concat( [ pl.DataFrame( {"actual score": y_test,"predicted score": clf.predict(X_test),"label": "test", } ), pl.DataFrame( {"actual score": y_train,"predicted score": clf.predict(X_train),"label": "train", } ), ]).with_columns(err=pl.col("predicted score") - pl.col("actual score"))chart_v1 = ( alt.Chart(df_tree) .mark_point(filled=True, opacity=0.02) .encode( x="actual score:Q", y="predicted score:Q",# y="err:Q", color="label:N", column="label:N", ) .properties(width=300, height=300))display(chart_v1)
A little underwhelming. Maybe adding information on the categories helps. Let’s find the top 20 labels and one-hot-encode them.
Code
lol_categories = ( df.select("categories_en") .with_columns(pl.col("categories_en").str.split(","))["categories_en"] .to_list())categories = []for l in lol_categories[:100]: categories.extend(l)hist =dict(collections.Counter(categories))df_hist = ( pl.DataFrame( {"categorie": [key for key, _ in hist.items()],"count": [value for _, value in hist.items()], } ) .sort("count", descending=True) .head(20))print(df_hist["categorie"].to_list())
Alright so we can now one-hot-encode these. Let’s run the classifier again, but now we add these labels. We can also check the correlation again.
Code
df_one_hot = df.with_columns( (pl.col("categories_en").str.to_lowercase().str.contains(x.lower())) .cast(int) .alias(f"1hot__{x}")for x in df_hist["categorie"].to_list())columns_one_hot = [c for c in df_one_hot.columns if"1hot"in c]logger.info(columns_one_hot)y = df_one_hot.select("nutriscore_score").to_numpy().flatten()logger.info(y.shape)X = df_one_hot.select( [ cfor c in columns_for_fitting + columns_one_hotif (c !="nutriscore_score"and c !="nutrition-score-fr_100g") ]).to_numpy()logger.info(X.shape)logger.info("Keep only data that has no nans.")select = np.sum(np.isnan(X), axis=1) ==0y = y[select]X = X[select, :]logger.info(y.shape)logger.info(X.shape)d = ( ( df_one_hot.select( [c for c in df_one_hot.columns if c.endswith("_100g")] ).null_count()/len(df)*100 )<=10)[0].to_dict()columns = ( ["nutriscore_score"]+ [key for key, values in d.items() if values[0]]+ columns_one_hot)display( df_one_hot.select(columns) .to_pandas() .corr()[["nutriscore_score"]] .iloc[1:-1, :] .sort_values("nutriscore_score") .style.background_gradient(cmap="RdBu", vmin=-1, vmax=1))logger.info(X.shape)transformer = Normalizer().fit(X)X_train, X_test, y_train, y_test = train_test_split( transformer.transform(X), y, test_size=0.20, random_state=2023)clf = tree.DecisionTreeRegressor(max_depth=15)clf = clf.fit(X_train, y_train)df_tree_one_hot = pl.concat( [ pl.DataFrame( {"actual score": y_test,"predicted score": clf.predict(X_test),"label": "test", } ), pl.DataFrame( {"actual score": y_train,"predicted score": clf.predict(X_train),"label": "train", } ), ]).with_columns(err=pl.col("predicted score") - pl.col("actual score"))new_chart = ( alt.Chart(df_tree_one_hot) .mark_point(filled=True, opacity=0.02) .encode( x="actual score:Q", y="predicted score:Q",# y="err:Q", color="label:N", column="label:N", ) .properties(width=300, height=300))chart_v1 & new_chart
This looks fine for now, with the one-hot-encoded model, the error is mostly acceptable.
However, let’s try one more time, this time with classification instead of regression.
Code
df_one_hot = df.with_columns( (pl.col("categories_en").str.to_lowercase().str.contains(x.lower())) .cast(int) .alias(f"1hot__{x}")for x in df_hist["categorie"].to_list())columns_one_hot = [c for c in df_one_hot.columns if"1hot"in c]y = df_one_hot.select("nutriscore_grade").to_numpy().flatten()X = df_one_hot.select( [ cfor c in columns_for_fitting + columns_one_hotif (c !="nutriscore_score"and c !="nutrition-score-fr_100g") ]).to_numpy()select = np.sum(np.isnan(X), axis=1) ==0y = y[select]X = X[select, :]logger.info(y.shape)logger.info(X.shape)clf = tree.DecisionTreeClassifier(max_depth=10)transformer = Normalizer().fit(X)X_train, X_test, y_train, y_test = train_test_split( transformer.transform(X), y, test_size=0.80, random_state=2023)clf.fit(X_train, y_train)y_test_predict = clf.predict(X_test)y_train_predict = clf.predict(X_train)logger.info("Training confusion matrix")# cm = confusion_matrix(y_test, y_test_predict, labels=["A", "B", "C", "D", "E"])cm = confusion_matrix(y_train, y_train_predict, labels=["A", "B", "C", "D", "E"])"""Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class."""letter = {4: "A", 3: "B", 2: "C", 1: "D", 0: "E"}display( pl.DataFrame(cm) .rename( mapping={f"column_{i}": f"predicts {letter}"for i, letter inzip([0, 1, 2, 3, 4], ["A", "B", "C", "D", "E"]) } ) .with_columns( pl.col("predicts A") .rank() .cast(int) .apply(lambda r: letter[r -1]) .alias("true label") ) .select("true label","predicts A","predicts B","predicts C","predicts D","predicts E", ) .to_pandas() .style.background_gradient())logger.info("Testing confusion matrix")cm = confusion_matrix(y_test, y_test_predict, labels=["A", "B", "C", "D", "E"])# cm = confusion_matrix(y_train, y_train_predict, labels=["A", "B", "C", "D", "E"])"""Confusion matrix whose i-th row and j-th column entry indicates the number of samples with true label being i-th class and predicted label being j-th class."""letter = {4: "A", 3: "B", 2: "C", 1: "D", 0: "E"}display( pl.DataFrame(cm) .rename( mapping={f"column_{i}": f"predicts {letter}"for i, letter inzip([0, 1, 2, 3, 4], ["A", "B", "C", "D", "E"]) } ) .with_columns( pl.col("predicts A") .rank() .cast(int) .apply(lambda r: letter[r -1]) .alias("true label") ) .select("true label","predicts A","predicts B","predicts C","predicts D","predicts E", ) .to_pandas() .style.background_gradient())
Summary
This was a good first deep dive into the data. Nothing suspicous as of yet. No obvious green washing and prediction works reasonably well (as expected). To be continued.!