Python pandas column merging

Author
Discussion

zb

Original Poster:

2,924 posts

170 months

Tuesday 7th March 2023
quotequote all
Hello, I'm relatively new to Python and I'm attempting to scrape data generated from various sensors to use in modelling. I've ran into an issue that no amount of youtube tutorials or throwing stuff at the wall from stack seems to address...

I have various csv files that I have cleaned up and merged, the sensors are all five digits some were messed up and were in the format 12345/12345 or 12345/12345/12345 instead of 12345, i successfully split at the / and placed them in new columns 'Sensor1' 'Sensor2' .

I now have:
df_a 'Sensor_No', 'a_Sensor1'
df_b 'Sensor_No', 'b_Sensor1', 'b_Sensor2'

I successfully merged them into one df_c

I'd like to merge all the sensor identification as one, the sticking point appears to be that i need these to remain as integers, and whenever I merge the columns instead of:
12345
12345
12345

I get
12345 12345
12345 00000
12345 12345 12345

all the other columns are the data i want to extract 'temp', 'force', 'duration', I just need to sort the sensor numbers to use as an index

I've been trying various concat, melt and stack, nothing has working and I've hit a flat spot.

Anyone give me a nudge in the right direction?

Thank you.

buggalugs

9,243 posts

243 months

Tuesday 7th March 2023
quotequote all
I didn't completely follow what you want to merge and why but it sounds like an issue with strings vs integers - you want to keep them as integers but treat them like strings. "1234"+"1234"="12341234" not 1234+1234=2468.

Maybe throw a few type() checks in to make sure that everything is the data type you're expecting it to be everywhere and be very conscious of converting from int() to str() and back to do all your various operations on it. Does it make sense for it to be an int at rest or would it be less work to keep the data as str then int() it later instead?

Not really a python expert but I've fecked about with a few csv's in my time smile

zb

Original Poster:

2,924 posts

170 months

Tuesday 7th March 2023
quotequote all
buggalugs said:
I didn't completely follow what you want to merge and why but it sounds like an issue with strings vs integers - you want to keep them as integers but treat them like strings. "1234"+"1234"="12341234" not 1234+1234=2468.

Maybe throw a few type() checks in to make sure that everything is the data type you're expecting it to be everywhere and be very conscious of converting from int() to str() and back to do all your various operations on it. Does it make sense for it to be an int at rest or would it be less work to keep the data as str then int() it later instead?

Not really a python expert but I've fecked about with a few csv's in my time smile
Cheers for the reply, I'm not sure I follow what I'm trying to do either! Let's see if I can place a table here that explains it a bit better (should have done this in the 1st place)

Sensor_No a_Sensor1 b_Sensor1 b_Sensor2
12345 BLANK BLANK BLANK
12345 BLANK BLANK BLANK
12345 12345 BLANK BLANK
12345 BLANK 12345 BLANK
12345 BLANK BLANK BLANK
12345 BLANK 12345 12345


To join as

New_Sensor_No
12345

|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||
|| 12345 ||

/\ oh great another formatting issue banghead

Tycho

11,823 posts

279 months

Tuesday 7th March 2023
quotequote all
If the data in the first column is always there and the data in the other columns is always the same as the first column then can't you just read in the first column and use that?

zb

Original Poster:

2,924 posts

170 months

Tuesday 7th March 2023
quotequote all
Tycho said:
If the data in the first column is always there and the data in the other columns is always the same as the first column then can't you just read in the first column and use that?
I think you might be on to something. I just can't verify that the sensor numbers I've split off are all numbers for the same sensor, I suspect that they might well be.

Just to be clear, the ones I've split off have different digits from the original sensor, I hope 12345 hasn't confused the issue

12345/43654/67855 is split off to 12345 43654 67855 the latter two sections in new columns.

I'm going to drop the other sensor columns I've created and try some data analysis, see if anything wild pops up.

TonyRPH

13,111 posts

174 months

Tuesday 7th March 2023
quotequote all
What about using an array?

Then you could access each element and put it exactly where you want it.

zb

Original Poster:

2,924 posts

170 months

Tuesday 7th March 2023
quotequote all
TonyRPH said:
What about using an array?

Then you could access each element and put it exactly where you want it.
Seemingly Python doesn't have Arrays per se, you have to use LISTS and the numpy library, I'll investigate.

To add, I can't just tidy this up in Excel and be done with it. It's taken me a lot to get to this point, the csv files were all over the place. I also have just under a hundred datasets, hence my keenness to just get a script where I can just change the csv file name for importing, and it'll do all the nasty work.

eharding

14,097 posts

290 months

Tuesday 7th March 2023
quotequote all
I'm not really clear as to what you're trying to achieve, but if you want to create a new column in a Pandas DataFrame based on some logic applied to the columns in each row, then you can literally use the "apply" method e.g.



....which results in....




You can perform whatever logic you like in the apply function and then tidy your dataframe up afterwards to keep the just the columns you want....

Bizarrely, for some reason PH forbids posting if I mention "l a m b d a" functions, so I won't....

Tycho

11,823 posts

279 months

Wednesday 8th March 2023
quotequote all
zb said:
TonyRPH said:
What about using an array?

Then you could access each element and put it exactly where you want it.
Seemingly Python doesn't have Arrays per se, you have to use LISTS and the numpy library, I'll investigate.

To add, I can't just tidy this up in Excel and be done with it. It's taken me a lot to get to this point, the csv files were all over the place. I also have just under a hundred datasets, hence my keenness to just get a script where I can just change the csv file name for importing, and it'll do all the nasty work.
You could use a list of lists with the first column as the list name. You can then reference them from there similar to an array.

selwonk

2,132 posts

231 months

Wednesday 8th March 2023
quotequote all
I'm also struggling to see exactly what is required, but it sounds like you have a Pandas data frame with three columns.

Column A - Always populated.
Column B - Sometimes populated.
Column C - Sometimes populated.

It sounds like you want to set columns B and C to None where they contain the same value as column A.

Is that correct?

ATG

21,157 posts

278 months

Wednesday 8th March 2023
quotequote all
Tycho said:
zb said:
TonyRPH said:
What about using an array?

Then you could access each element and put it exactly where you want it.
Seemingly Python doesn't have Arrays per se, you have to use LISTS and the numpy library, I'll investigate.

To add, I can't just tidy this up in Excel and be done with it. It's taken me a lot to get to this point, the csv files were all over the place. I also have just under a hundred datasets, hence my keenness to just get a script where I can just change the csv file name for importing, and it'll do all the nasty work.
You could use a list of lists with the first column as the list name. You can then reference them from there similar to an array.
Yes, but that is fighting how lists are intended to be used. Python encourages you to iterate over list items; "for each" in other languages.

Writing stuff like:

for i in range(0, len(myData)):
... do_something( myData[i] )

would be described as a "smell" or as being unpythonic


Anyway ... speaking as a Python developer who builds data analysis apps, I would suggest you try to use python's basic data types (lists, dicts, generators) FIRST and only start trying to use dataframes like Pandas if you really need to and they are bringing something specific to the table.

Reading data from csv files, and pivoting it, filtering it, making sure you're treating numbers as numbers and labels as labels is most easily done as pure, plain old python. If you try to use something like Pandas to solve a problem, you have to understand Python and Pandas. If you avoid using Pandas until you really need them, you can concentrate on learning Python.

ATG

21,157 posts

278 months

Wednesday 8th March 2023
quotequote all
Also this looks like a shining example of where you could express the requirement as a test. E.g.

file1_data = "skjsfjgksfdjgkj"
file2_data = "skjsfjgksfdjgkj"

processed_data = munge_my_data(file1_data, file2_data)

assert processed_data == "sngdsfngjfndjgnfdjgnfjdngjfndj"

where the file1 and file2 data are taken from real examples

Tycho

11,823 posts

279 months

Wednesday 8th March 2023
quotequote all
ATG said:
Tycho said:
zb said:
TonyRPH said:
What about using an array?

Then you could access each element and put it exactly where you want it.
Seemingly Python doesn't have Arrays per se, you have to use LISTS and the numpy library, I'll investigate.

To add, I can't just tidy this up in Excel and be done with it. It's taken me a lot to get to this point, the csv files were all over the place. I also have just under a hundred datasets, hence my keenness to just get a script where I can just change the csv file name for importing, and it'll do all the nasty work.
You could use a list of lists with the first column as the list name. You can then reference them from there similar to an array.
Yes, but that is fighting how lists are intended to be used. Python encourages you to iterate over list items; "for each" in other languages.

Writing stuff like:

for i in range(0, len(myData)):
... do_something( myData[i] )

would be described as a "smell" or as being unpythonic
Thanks for explaining.

eharding

14,097 posts

290 months

Wednesday 8th March 2023
quotequote all
ATG said:
Reading data from csv files, and pivoting it, filtering it, making sure you're treating numbers as numbers and labels as labels is most easily done as pure, plain old python. If you try to use something like Pandas to solve a problem, you have to understand Python and Pandas. If you avoid using Pandas until you really need them, you can concentrate on learning Python.
That being said, there are plenty of folk with a Data Science background and who are very familiar with the types of processing available with, for example, R Data Frames for whom the natural progression into Python is via Pandas - they don't want to be Python developers as such, and don't really want to dick about learning how slow it is to perform the sorts of processing they're interested in doing with native Python collections, and just want to get to the good stuff. Of course, a basic understanding of datatypes is necessary, but you don't need to be a developer to be a good data scientist.

ATG

21,157 posts

278 months

Wednesday 8th March 2023
quotequote all
eharding said:
ATG said:
Reading data from csv files, and pivoting it, filtering it, making sure you're treating numbers as numbers and labels as labels is most easily done as pure, plain old python. If you try to use something like Pandas to solve a problem, you have to understand Python and Pandas. If you avoid using Pandas until you really need them, you can concentrate on learning Python.
That being said, there are plenty of folk with a Data Science background and who are very familiar with the types of processing available with, for example, R Data Frames for whom the natural progression into Python is via Pandas - they don't want to be Python developers as such, and don't really want to dick about learning how slow it is to perform the sorts of processing they're interested in doing with native Python collections, and just want to get to the good stuff. Of course, a basic understanding of datatypes is necessary, but you don't need to be a developer to be a good data scientist.
To repeat myself, if the dataframe framework brings something useful to the table (e.g. you want to do some statistical analysis or tensor bashing) then use it, but if all you're doing is reading some csv files and handling blanks or wonky representations of "data missing", then dataframes often make life harder and less reliable. Numpy, scipy and pandas are good for scientific computation, the kind of stuff I started doing in ForTran, and as you say are good alternatives to matlab, R or S. For actual datascience, python is a perfectly good orchestration language, but you'd generally be using it to actually run spark or some other distributed framework, not actually be doing the number crunching in pandas.

zb

Original Poster:

2,924 posts

170 months

Wednesday 8th March 2023
quotequote all
Thanks for the input guys, it wasn't my intention to toss in a bit hand grenade and start a coding war!

eharding I've taken the cowards way out and just removed the offending columns, I'll have a play around with your suggestion to see if I can understand it a little better.

As others have observed, I'm essentially trying to bullst my way along some data wrangling, believe me that there's been a lot of times over the past weekend, and few days, I've dearly regretted not doing a basic foundation Python course over 12 weeks. Rather than try to shamble my way through with youtube tutorials and the various teaching coding websites. There's absolutely no substitute for someone hitting you upside the head and saying not like that like this.


My latest Eiger, is that I'm attempting to join together two dataframes. The Sensor_No is shared a shared column i.e. some sensor numbers duplicate, but they have different properties columns. Neither merge nor concatenate are giving me what i want so far

e.g. I would like:
(sensor number)12345 property1 property2 property3

Instead, I'm getting something like:

(sensor number)12345 property1 ................. property3
(sensor number)12345 .................property2

banghead


NowWatchThisDrive

763 posts

110 months

Wednesday 8th March 2023
quotequote all
How exactly are you trying to merge the dataframes - pandas.merge? What values are you specifying for the parameters?

When you call dataframe.dtypes on the two dataframes, what types do you see for the sensor number column in each?

zb

Original Poster:

2,924 posts

170 months

Thursday 9th March 2023
quotequote all
NowWatchThisDrive said:
How exactly are you trying to merge the dataframes - pandas.merge? What values are you specifying for the parameters?

When you call dataframe.dtypes on the two dataframes, what types do you see for the sensor number column in each?
Carefully and meticulously, through a series of bodges, I believe I've got something that worked. I'll put my code up for the amusement of those who know what they are doing with Python.


#As Sensor_No is common column across datasets, set Sensor_No column as the index
df_a.set_index(Sensor_No", inplace=True)
df_b.set_index("Sensor_No", inplace=True)

all_sensors = pd.concat([df_a, df_b], axis=1) #this method combines the common part numbers in the Sensor_No column



Did a little more tidying up after, but that's the jist of it.